Background
As part of my graduate studies at UC Berkeley, I took a great class on Causal Inference & Data-Adaptive Methods applied to problems in public health. This series builds on the learnings from that class to help cement and develop my understanding. In the course, we ended up estimating a target causal parameter using a number of models that we tried to identify to a statistical model. Those models estimating the average treatment effect (ATE) included:
- G-Computation (aka simple substitution)
- Inverse Probability Weighting (IPW)
- Targeted Maximum Likelihood Estimator (TMLE)
In this introductory post to the series, I’ll provide a bit of information on data-adaptive methods vs traditional modelling, pose a research question, then identify a data set to clean and prepare for my analysis. Much of this work will be based on the work of researchers at UC Berkeley’s School of Public Health, particularly Mark van der Laan(Laan and Rose 2011), Maya Petersen(M. L. Petersen and Laan 2014), and Laura BalzerM. Petersen and Balzer (2023).
Research Question
Existing studies find that infants born at low birth weight (LBW) are at an increased risk of physical disabilities and impaired cognitive development. While genetic factors contribute to LBW, maternal smoking during pregnancy has been identified as the most significant modifiable risk factor. We seek to answer the following question: what is the effect of maternal smoking during pregnancy on the likelihood of having a LBW infant?
The target population for this study is live singleton first births in the US in 2015. We are limiting the population to singleton first births because multiples are associated with lower birth weight, and infants from subsequent pregnancies have been shown to have higher birth weights than those from first pregnancies.
Target Causal Parameter
We aim to estimate the causal risk difference: \(\Psi^* (P^*) = P^* (Y1 - 1) - P^* (Y0 - 1)\) \(= E^*(Y1) - E^*(Y0)\)
The target causal parameter is the difference in the counterfactual risk of LBW if all expectant mothers in the population smoked during pregnancy vs. if all expectant mothers in the population did not smoke during pregnancy.
Data Exploration
First, we import the data set for 2022 and inspect it, including variables available.
Variables
Variable Name Type Descriptive summary of measure
smoked Exposure (A, binary) This variable is considered the intervention or exposure of interest - it’s a measure of whether the mother was considered a smoker (at least 1 cigarette/day) during any of the three trimesters.
lbw Outcome (Y, binary) This variable is the outcome, which is the weight of the infant at time of birth, classified as low birth weight (1) when the birthweight was below 2500 grams. Birth weight greater than 2500 grams is coded as 0.
Data Cleaning
Then, we recode some of the variables of interest into outcome and exposure variables A and Y. We also prepare the covariates and endogenous variables for analysis by recoding them into indicator or dummy variables. We also remove missings or unknowns, which is a very conservative analysis approach - future analysis may utilize data imputation, but given the large number of records in this data set and the relatively small number of missing/unknown data, for the purpose of this assignment the more conservative approach is taken.
Descriptive Statistics
To better understand the data we’re working with and get a sense for the distributions across variables (W1,W2, W3, A, Y), we can observe the information presented in Table 1 below.
Marginal Distributions of Exposure and Outcome
Expected Challenges
What’s Next?
Citation
@online{reyes2023,
author = {Reyes, Matthew and Reyes, Matthew},
title = {Causal {Inference} and {Machine} {Learning:} {Part} 1},
date = {2023-09-08},
url = {https://blog.mreyes.info/posts/Big Data/causal1.html},
langid = {en}
}