Skip to main content

Section Ethics Spotlight: Selection Bias

When filtering and subsetting data, we must be careful not to introduce selection bias—a distortion in our results due to the way we select cases for analysis.

Definition 67.

Selection bias occurs when the data selected for analysis is not representative of the population about which conclusions are to be drawn, leading to systematic error in the findings.
Common forms of selection bias include:
Sampling Bias
The sample selected for study doesn’t represent the population of interest.
Self-Selection Bias
Participants choose whether to participate, potentially introducing systematic differences between participants and non-participants.
Survival Bias
Analysis focuses only on cases that "survived" some process, ignoring those that didn’t.
Exclusion Bias
Systematic exclusion of certain groups or cases due to methodological choices.
Confirmation Bias
Tendency to filter data in ways that confirm preexisting beliefs or hypotheses.

Example 68. Selection Bias in Community Health Research.

In our Community Health project, selection bias might occur if:
  • We exclude neighborhoods with missing data, which happen to be primarily lower-income areas
  • Health survey data includes only responses from residents who volunteered to participate
  • Air quality measurements are taken only during weekdays, missing weekend patterns
  • We focus our analysis only on neighborhoods with good outcomes to identify "best practices"
  • We filter data to include only cases that support our initial hypothesis about environmental factors
Each of these filtering or selection decisions could lead to conclusions that don’t accurately represent the true relationships in the complete population.
To minimize selection bias when filtering data:
  • Document all filtering criteria and justify them based on substantive rather than convenient reasons
  • Consider how excluded cases differ from included ones
  • Perform sensitivity analysis by comparing results with different filtering criteria
  • Be transparent about the limitations of your filtered dataset
  • Actively look for potential sources of bias in your selection process

Checkpoint 69. Identifying Selection Bias.

    Which scenario represents the clearest example of selection bias?
  • A researcher randomly selects 100 participants from a complete list of 1,000 eligible individuals.
  • This describes proper random sampling rather than selection bias, as every eligible individual had an equal chance of being selected.
  • After collecting data, a researcher finds that one variable has a non-normal distribution.
  • A non-normal distribution is not necessarily an indication of bias; many variables naturally have skewed or other non-normal distributions.
  • A study finds weak but statistically significant correlations between variables.
  • Finding weak correlations is not an indication of selection bias; it simply describes the strength of relationships in the data.
  • A survey about internet usage is conducted exclusively through online questionnaires.
  • Correct! This is a clear example of selection bias. By conducting the survey only online, the study systematically excludes people with limited or no internet access, who likely have different internet usage patterns than those who regularly use the internet.

Activity 20. Identifying Potential Selection Bias.

In this activity, you’ll examine potential selection bias in your own data analysis.

(a)

Review the filtering and subsetting operations you’ve performed on your dataset. For each filter, identify who or what is being excluded.

(b)

Consider how these exclusions might affect your conclusions. Are you systematically excluding certain types of cases?

(c)

Identify at least one potential source of selection bias in your data collection process (before you even received the dataset).

(d)

Propose strategies for addressing or minimizing the selection bias you’ve identified.