Filtering and Subsetting

Section Filtering and Subsetting

Subsection The Purpose of Filtering

Rarely do we analyze an entire dataset at once. More often, we focus on specific subsets of the data that are relevant to particular questions. Filtering and subsetting allow us to:

Focus on specific groups or conditions of interest
Compare different subsets to identify patterns and differences
Remove irrelevant data that might obscure important relationships
Create more manageable subsets for specialized analyses
Test relationships under different conditions

Effective filtering requires clear criteria and an understanding of how the filtering might affect your analysis.

Example 62. Filtering in Community Health Analysis.

In our Community Health dataset, we might apply these filters:

Focus only on neighborhoods with complete data across all health metrics
Compare high-income versus low-income neighborhoods (using median household income)
Examine only neighborhoods with poor air quality to understand health patterns in most affected areas
Create separate analyses for different regions of the city (north, south, east, west)
Filter out neighborhoods undergoing major redevelopment that might skew environmental measurements

Checkpoint 63. Purposes of Filtering.

Which of the following is NOT a valid reason to filter or subset data?

To compare outcomes between different demographic groups
This is a valid reason for filtering—creating subsets based on demographic variables can reveal important differences between groups.
To focus analysis on the most recent time period in a longitudinal dataset
This is a valid reason for filtering—focusing on the most recent data can provide insights into current conditions.
To remove data points that contradict your hypothesis
Correct! This is NOT a valid reason for filtering. Removing data merely because it contradicts your hypothesis introduces bias and violates principles of scientific integrity.
To create a more manageable dataset for complex computational methods
This is a valid reason for filtering—some analyses may require smaller datasets due to computational constraints, as long as the subsetting is done in a principled way.

Subsection Filtering Methods in CODAP

CODAP provides several ways to filter data:

Selection from Visualizations: Clicking on points in a graph or cells in a table selects those cases. You can then hide unselected cases or create a new collection with only selected cases.
Filter Using Formulas: Create a filter using a formula like income > 50000 to show only cases meeting that condition.
Creating Subsets: Create a new dataset containing only filtered data, preserving the original dataset.
Hierarchical Organization: Organize data into hierarchical collections, allowing analysis at different levels (e.g., cities → neighborhoods → households).

When filtering data, it’s important to:

Document your filtering criteria clearly
Consider how the filter might affect the representativeness of your data
Be aware of how sample size reduction might impact statistical analyses
Check whether your filtered data still addresses your research questions

Activity 18. Filtering Data in CODAP.

In this activity, you’ll practice filtering data in CODAP using different methods.

(a)

Open your dataset in CODAP and create a scatter plot using two numerical variables.

(b)

Use selection to highlight a cluster of points, then create a new collection containing only these selected cases.

(c)

Create a filter using the formula editor to show only cases meeting specific criteria (e.g., values above a threshold or matching a category).

(d)

Compare summary statistics of your original dataset and filtered subset. How do measures like mean, median, and standard deviation change when you apply your filter?

Checkpoint 64. Translating Filter Statements.

For each of the following filtering objectives, select the CODAP filter formula that would accomplish it.

Question 1: Show only neighborhoods with both above-average income and above-average green space.

Select the correct CODAP filter formula:

a. Income > AverageIncome OR GreenSpace > AverageGreenSpace
b. Income > AverageIncome AND GreenSpace > AverageGreenSpace
c. Income = AverageIncome AND GreenSpace = AverageGreenSpace
d. Income > AverageIncome

Question 2: Show neighborhoods that are either in the North region or have low pollution levels.

Select the correct CODAP filter formula:

a. Region = ’North’ AND PollutionLevel = ’Low’
b. NOT(Region = ’North’) OR NOT(PollutionLevel = ’Low’)
c. Region = ’North’ OR PollutionLevel = ’Low’
d. Region != ’North’ AND PollutionLevel != ’Low’

Hint.

Remember the difference between AND and OR operators:

AND requires both conditions to be true
OR requires at least one condition to be true

Also pay attention to comparison operators (=, >, !=) and make sure they match what the question is asking for.

Answer 1.

\(\text{b}\)

Answer 2.

\(\text{c}\)

Solution.

Question 1: The correct formula is Income > AverageIncome AND GreenSpace > AverageGreenSpace

This formula uses AND to require that both conditions (above-average income and above-average green space) must be true for a neighborhood to be included in the filter.

Question 2: The correct formula is Region = 'North' OR PollutionLevel = 'Low'

This formula uses OR to include neighborhoods that meet either condition: being in the North region OR having low pollution levels.

Subsection Creating Meaningful Subsets

Beyond simple filtering, creating meaningful subsets often involves more complex criteria and a deeper understanding of your data. Effective subsetting strategies include:

Comparative Subsets: Create groups for comparison based on key variables (e.g., high vs. low exposure groups, different demographic categories).
Threshold-Based Subsets: Define groups based on meaningful thresholds like regulatory standards or clinical definitions.
Time-Based Subsets: Create groups based on time periods to study changes or compare before/after scenarios.
Cluster-Based Subsets: Use patterns in the data itself to identify natural groupings.
Random Sampling: Create representative subsets through random sampling, particularly for very large datasets.

Example 65. Meaningful Subsets in Community Health.

For our Community Health dataset, meaningful subsets might include:

Income quantiles: Dividing neighborhoods into income quintiles (bottom 20%, 20-40%, etc.) to examine how health patterns vary across socioeconomic spectrum
Environmental risk categories: Grouping neighborhoods as "high-risk" or "low-risk" based on combined environmental factors
Geographic regions: Creating subsets based on meaningful geographic divisions like urban core, inner suburbs, outer suburbs
Health outcome groups: Identifying neighborhoods with multiple poor health outcomes versus those with generally good outcomes
Green space access: Comparing neighborhoods with high, medium, and low access to green spaces

Activity 19. Creating Meaningful Subsets.

In this activity, you’ll create and analyze meaningful subsets of your data.

(a)

Identify a key numerical variable in your dataset. Create a new categorical attribute that divides this variable into meaningful groups (e.g., low/medium/high, or quantiles).

(b)

Create visualizations comparing these groups across other variables. Look for patterns or differences between the groups.

(c)

Devise at least two different ways to create subsets from your data that might reveal interesting patterns. Implement these in CODAP and explore the results.

(d)

Write a brief summary of what you learned by examining different subsets of your data. Were there patterns that only became apparent when looking at specific subsets?

Checkpoint 66. Subset Creation Approaches.

A researcher is studying the relationship between exercise habits and health outcomes. Which subsetting approach would be MOST useful for comparing the effects of different exercise levels?

Random sampling to create smaller, more manageable datasets
While random sampling can be useful for creating manageable datasets from large ones, it doesn’t specifically help with comparing different exercise levels.
Time-based subsetting to examine seasonal variations in exercise
While seasonal variations might be interesting, this approach doesn’t directly address comparing different exercise levels and their health effects.
Threshold-based groups dividing participants into categories based on weekly exercise minutes (e.g., sedentary, moderately active, highly active)
Correct! Creating categorical groups based on meaningful exercise thresholds provides a clear way to compare health outcomes across different exercise levels.
Geographic subsetting to compare exercise habits across different regions
While regional comparisons might reveal interesting patterns in exercise habits, this approach focuses on geographic differences rather than directly comparing exercise levels and their health effects.

Prev Top Next