Skip to main content

Section Ethics Spotlight: Representation in Data

As we plan our data investigations, it’s crucial to consider who is represented in our data and who might be missing or underrepresented.
Key ethical considerations regarding representation include:
  • Selection bias: Does our data systematically exclude certain groups?
  • Sampling fairness: Does our sample adequately represent diverse populations?
  • Historical exclusion: Are we working with data that reflects historical patterns of exclusion?
  • Appropriate categorization: Do our categories respect how people identify themselves?
  • Contextual interpretation: Are we considering social and historical context when interpreting group differences?

Example 48. Representation in Community Health Data.

In our Community Health dataset, we might need to consider:
  • Whether health surveys reached residents who don’t speak English
  • If environmental monitoring stations are distributed equitably across neighborhoods
  • Whether certain communities have historically been excluded from public health research
  • If the neighborhood boundaries used in our analysis reflect meaningful community divisions
  • How to interpret health disparities without reinforcing harmful stereotypes

Checkpoint 49. Data Representation Scenarios.

For each scenario, identify the primary ethical concern related to representation in data.
Scenario 1: A city collects feedback about public services through an online survey that requires a smartphone and internet access.
What is the primary ethical concern in this scenario?
  • a. Selection bias
  • b. Privacy violation
  • c. Inappropriate categorization
  • d. Excessive data collection
Scenario 2: A medical research study examines health outcomes using patient data from university hospitals in affluent neighborhoods.
What is the primary ethical concern in this scenario?
  • a. Inappropriate categorization
  • b. Sampling fairness
  • c. Data security
  • d. Transparency
Hint.
Consider who might be included or excluded from each data collection approach, and how that might affect the conclusions drawn from the data.
Answer 1.
\(\text{a}\)
Answer 2.
\(\text{b}\)
Solution.
Scenario 1: The primary ethical concern is Selection bias.
This method systematically excludes residents without smartphones or internet access, creating selection bias that likely underrepresents lower-income or elderly populations.
Scenario 2: The primary ethical concern is Sampling fairness.
The sample is biased toward patients who have access to university hospitals in affluent areas, potentially missing health patterns in underserved communities.

Activity 14. Evaluating Representation in Your Dataset.

In this activity, you’ll critically examine representation issues in your chosen dataset.

(a)

Identify at least three ways in which your dataset might not fully represent the population you’re interested in studying.

(b)

Consider how these representation issues might affect the conclusions you can draw from your analysis.

(c)

Propose at least two strategies for acknowledging or addressing these representation issues in your investigation.