Skip to main content

Section Planning an Investigation

Subsection Elements of an Investigation Plan

A well-structured data investigation plan helps guide your analysis and ensures you consider all important aspects before diving into the data. Key elements of an investigation plan include:
Research Questions
The statistical questions you aim to answer, clearly stated and refined.
Data Requirements
The specific variables, measurements, and data sources needed to answer your questions.
Data Assessment
Evaluation of the available data’s quality, completeness, and appropriateness for your questions.
Analysis Approach
The methods and techniques you plan to use for data cleaning, exploration, visualization, and statistical analysis.
Potential Challenges
Anticipated difficulties and limitations, along with strategies to address them.
Expected Outcomes
What you hope to learn and how you might apply the findings.

Checkpoint 40. Organizing an Investigation Plan.

Arrange the following steps in a logical order for planning a data investigation.
Hint.
Think about what information you need to know before you can determine your analysis methods.
A thoughtful investigation plan serves as a roadmap for your analysis, though you should be prepared to adapt it as you learn more about the data and discover unexpected patterns or challenges.

Subsection Sample Investigation Plan

Let’s look at a sample investigation plan for our Community Health and Environment project:

Example 41. Community Health Investigation Plan.

Research Questions:
  1. What is the relationship between air quality index (AQI) and asthma rates across neighborhoods?
  2. How does access to green space correlate with obesity rates, controlling for neighborhood income?
  3. Are there clusters of neighborhoods with similar environmental health profiles, and how do these clusters relate to demographic factors?
Data Requirements:
  • Neighborhood-level data on air quality measurements (annual average AQI)
  • Asthma prevalence rates by neighborhood
  • Green space access metrics (percentage of area, proximity to parks)
  • Obesity rates by neighborhood
  • Median household income by neighborhood
  • Additional environmental health indicators (water quality, proximity to pollution sources)
  • Demographic information (age distribution, racial/ethnic composition)
Data Assessment:
  • Our dataset includes most required variables but lacks detailed green space metrics
  • Some neighborhoods have missing data for certain health indicators
  • Air quality measurements were taken at different times of year across neighborhoods
  • Need to verify that health data is age-adjusted for fair comparison
Analysis Approach:
  1. Data cleaning: Handle missing values, check for outliers, standardize variables
  2. Exploratory analysis: Create scatter plots, histograms, and maps to visualize distributions and relationships
  3. Calculate correlation coefficients between environmental factors and health outcomes
  4. Perform regression analysis to examine relationships while controlling for income
  5. Use cluster analysis to identify neighborhoods with similar environmental health profiles
Potential Challenges:
  • Missing data might bias results if not handled appropriately
  • Correlation doesn’t imply causation; many confounding variables might exist
  • Neighborhood boundaries might not align perfectly with environmental exposure patterns
  • Limited sample size (number of neighborhoods) might affect statistical power
Expected Outcomes:
  • Identification of environmental factors most strongly associated with health outcomes
  • Understanding of how socioeconomic factors interact with environmental exposures
  • Visualization of environmental health disparities across the city
  • Insights that could inform targeted public health interventions
This plan provides a comprehensive framework for the investigation, identifying key questions, necessary data, analytical approaches, and potential limitations. It serves as a guide but remains flexible enough to adapt as the investigation proceeds.

Activity 12. Developing Your Investigation Plan.

In this activity, you’ll develop an investigation plan for your own dataset.
(a)
Using the refined statistical questions from the previous activity, outline a complete investigation plan following the structure of the sample plan.
(b)
Assess your dataset critically. Does it contain all the variables you need? Are there quality issues you need to address?
(c)
Specify at least three analysis techniques you plan to use and why they’re appropriate for your research questions.

Checkpoint 42. Identifying Data Requirements.

    To investigate the question "How does public transportation usage vary with income level across neighborhoods?", which of the following variables would be LEAST essential?
  • Average number of public transportation trips per resident by neighborhood
  • This is a key measure of public transportation usage, which is central to the research question.
  • Median household income by neighborhood
  • Income level is explicitly mentioned in the research question, making this variable essential.
  • Average home value by neighborhood
  • Correct! While home value might correlate with income, it’s not directly mentioned in the research question and is less essential than variables that directly measure transportation usage and income.
  • Availability of public transportation (e.g., bus stops per square mile) by neighborhood
  • This is important to consider because availability can influence usage, making it a potential confounding variable that should be accounted for.