Skip to main content

Section Organizing and Cleaning Data: From Messy to Meaningful

Real-world data collection is messy. Students might record information inconsistently, miss data points, or make recording errors. Teaching students to organize and clean data helps them understand that data analysis requires careful preparation. This series of videos below are from Delavari, Shelton, Ireland, and Weiland (2025) through Statistical Literacy and Critical Education (SLiCE) and covers the key components of data moves.
Data Moves: Filtering with associated activity here.
 5 
codap.concord.org/releases/latest/static/dg/en/cert/index.html?url=https://concord-consortium.github.io/codap-data/SampleDocs/Science/Biology/27mammals/Mammals_Sample.codap#
Data Moves: Grouping with associated activity here.
 6 
codap.concord.org/releases/latest/static/dg/en/cert/index.html?url=https://concord-consortium.github.io/codap-data/SampleDocs/Education/School_Children/School_Children.json#
Data Moves: Calculating and Recoding
Data Moves: Summarizing with associated activity here.
 7 
codap.concord.org/app/static/dg/en/cert/index.html#shared=https%3A%2F%2Fcfm-shared.concord.org%2FXFAFf7va8Vw5t6aXuzH9%2Ffile.json

Checkpoint 35.

Students survey classmates about favorite colors and get these responses: “blue”, “Blue”, “light blue”, “navy”, “red”, “RED”, “green”. What problems do you notice?
Hint.
Look for inconsistencies in how the same type of response is recorded.
Answer.
Inconsistent capitalization and unclear categories (are “blue”, “light blue”, and “navy” the same or different?).
Solution.
This data needs cleaning because: (1) capitalization is inconsistent (“blue” vs “Blue” vs “RED”), and (2) it’s unclear whether “light blue” and “navy” should count as “blue” or be separate categories. Students need to make decisions about how to group responses consistently before they can analyze the data meaningfully.

Exploration 13. Try This Week: Data Organization Practice.

Time needed: 15-20 minutes after data collection
Basic Organization Steps:
1. Check for Completeness: Are there any missing responses? Illegible entries?
2. Standardize Formats: Make capitalization consistent, spell out abbreviations, use consistent units.
3. Group Similar Responses: Decide how to handle responses that are similar but not identical.
4. Create Simple Tables: Organize data into rows and columns with clear labels.
5. Double-Check Numbers: Do counts add up correctly? Are calculations accurate?
Elementary Approach: Work through data organization as a whole class, making decisions together about how to handle problems.
Secondary Approach: Have students work in small groups to clean their data, then compare approaches and discuss which methods work best.

Checkpoint 36. Making Data Cleaning Decisions.

Data cleaning often requires making judgment calls about how to handle inconsistent responses.

(a)

Students are collecting data about pets and get responses like: “dog”, “puppy”, “golden retriever”, “cat”, “kitten”, “fish”. How should they group these responses?
Hint.
Consider what level of detail would be most useful for their investigation.
Answer.
Group by general animal type: “dog” (including puppy and golden retriever), “cat” (including kitten), “fish”.

(b)

What if their original question was specifically about dog breeds? How might they handle the data differently?
Answer.
They might keep “golden retriever” separate and ask follow-up questions to get breed information for “dog” and “puppy” responses.
Data cleaning decisions should always connect back to the original question you’re trying to answer.

Checkpoint 37.

During a survey about homework time, some students leave the question blank. What are two reasonable ways to handle this missing data?
Hint.
Consider whether missing responses should be ignored, estimated, or treated as a separate category.
Solution.
The best approach depends on how much data is missing and why. If only a few students didn’t respond, you might exclude those responses or follow up to get complete information. If many students skipped the question, you need to consider whether the question was unclear or too personal. Don’t guess or make up missing data—that creates false information.