Causal Inference and Machine Learning: Part 1

Introduction to an analysis of US Birth vital statistics using data-adaptive models
Quarto
R
Machine Learning
Causal Inference
Author
Published

September 8, 2023

Background

As part of my graduate studies at UC Berkeley, I took a great class on Causal Inference & Data-Adaptive Methods applied to problems in public health. This series builds on the learnings from that class to help cement and develop my understanding. In the course, we ended up estimating a target causal parameter using a number of models that we tried to identify to a statistical model. Those models estimating the average treatment effect (ATE) included:

  • G-Computation (aka simple substitution)
  • Inverse Probability Weighting (IPW)
  • Targeted Maximum Likelihood Estimator (TMLE)

In this introductory post to the series, I’ll provide a bit of information on data-adaptive methods vs traditional modelling, pose a research question, then identify a data set to clean and prepare for my analysis. Much of this work will be based on the work of researchers at UC Berkeley’s School of Public Health, particularly Mark van der Laan(Laan and Rose 2011), Maya Petersen(M. L. Petersen and Laan 2014), and Laura BalzerM. Petersen and Balzer (2023).

Laan, Mark J. van der, and Sherri Rose. 2011. Targeted Learning. Springer New York. https://doi.org/10.1007/978-1-4419-9782-1.
Petersen, Maya L., and Mark J. van der Laan. 2014. “Causal Models and Learning from Data.” Epidemiology 25 (3): 418–26. https://doi.org/10.1097/ede.0000000000000078.
Petersen, Maya, and Laura Balzer. 2023. “Introduction to Causal Inference.” UC Berkeley. 2023. https://www.ucbbiostat.com/about.

Research Question

Existing studies find that infants born at low birth weight (LBW) are at an increased risk of physical disabilities and impaired cognitive development. While genetic factors contribute to LBW, maternal smoking during pregnancy has been identified as the most significant modifiable risk factor. We seek to answer the following question: what is the effect of maternal smoking during pregnancy on the likelihood of having a LBW infant?

The target population for this study is live singleton first births in the US in 2015. We are limiting the population to singleton first births because multiples are associated with lower birth weight, and infants from subsequent pregnancies have been shown to have higher birth weights than those from first pregnancies.

Target Causal Parameter

We aim to estimate the causal risk difference: \(\Psi^* (P^*) = P^* (Y1 - 1) - P^* (Y0 - 1)\) \(= E^*(Y1) - E^*(Y0)\)

The target causal parameter is the difference in the counterfactual risk of LBW if all expectant mothers in the population smoked during pregnancy vs. if all expectant mothers in the population did not smoke during pregnancy.

Data Exploration

First, we import the data set for 2022 and inspect it, including variables available.

Variables

Variable Name Type Descriptive summary of measure

smoked Exposure (A, binary) This variable is considered the intervention or exposure of interest - it’s a measure of whether the mother was considered a smoker (at least 1 cigarette/day) during any of the three trimesters.

lbw Outcome (Y, binary) This variable is the outcome, which is the weight of the infant at time of birth, classified as low birth weight (1) when the birthweight was below 2500 grams. Birth weight greater than 2500 grams is coded as 0.

Data Cleaning

Then, we recode some of the variables of interest into outcome and exposure variables A and Y. We also prepare the covariates and endogenous variables for analysis by recoding them into indicator or dummy variables. We also remove missings or unknowns, which is a very conservative analysis approach - future analysis may utilize data imputation, but given the large number of records in this data set and the relatively small number of missing/unknown data, for the purpose of this assignment the more conservative approach is taken.

Descriptive Statistics

To better understand the data we’re working with and get a sense for the distributions across variables (W1,W2, W3, A, Y), we can observe the information presented in Table 1 below.

Marginal Distributions of Exposure and Outcome

Expected Challenges

What’s Next?

Citation

BibTeX citation:
@online{reyes2023,
  author = {Reyes, Matthew and Reyes, Matthew},
  title = {Causal {Inference} and {Machine} {Learning:} {Part} 1},
  date = {2023-09-08},
  url = {https://blog.mreyes.info/posts/Big Data/causal1.html},
  langid = {en}
}
For attribution, please cite this work as:
Reyes, Matthew, and Matthew Reyes. 2023. “Causal Inference and Machine Learning: Part 1.” September 8, 2023. https://blog.mreyes.info/posts/Big Data/causal1.html.