Skip to main content

Section Understanding Data Complexity and Quality

As students become more sophisticated in their data work, they need to understand that not all datasets are created equal. Some are clean and well-organized, others are messy and complex. Learning to work with increasing levels of complexity prepares students for real-world data analysis.

Checkpoint 41. Recognizing Data Complexity.

Consider these different datasets and their complexity levels.

(a)

Dataset A: 25 students, 2 variables (favorite color, grade level), no missing data. What makes this “simple” data?
Answer.
Small size, few variables, complete data, clear categories.

(b)

Dataset B: 200 students, 8 variables (including some with missing data), collected over 3 different time periods. What makes this more complex?
Answer.
Larger size, more variables, missing data, time dimension adds complexity.
Students should start with simple, clean datasets and gradually work up to more complex data as their skills develop.

Exploration 15. Teaching Students to Assess Data Quality.

Questions Students Can Ask About Any Dataset:
Completeness: Are there missing values? How much data is missing?
Consistency: Are similar things recorded in similar ways throughout the dataset?
Accuracy: Do the numbers make sense? Are there obvious errors or outliers?
Relevance: Does this data actually help answer our question?
Timeliness: Is this data recent enough to be useful for our investigation?
Elementary Approach: Use simple checklists and visual inspection to assess data quality as a class.
Secondary Approach: Have students create data quality reports identifying strengths and limitations of their datasets.

Checkpoint 42.

Students are using attendance data from their school. They notice that Fridays consistently show lower attendance than other days, but there’s no data recorded for several random dates throughout the year. What should they consider about data quality?
Hint.
Think about what might explain the patterns they’re seeing and how missing data might affect their conclusions.
Solution.
Students should: (1) Consider whether lower Friday attendance reflects a real pattern or data collection issues, (2) Investigate why certain dates are missing (holidays? technical problems? weather closures?), (3) Decide whether to exclude incomplete weeks from analysis, and (4) Consider how missing data might bias their conclusions about attendance patterns.

Checkpoint 43.

Think about your students’ current skill level. What level of data complexity would be appropriate for their next investigation? What support would they need to work with slightly more complex data than they’ve used before?