
🔍 The Challenge — Missing Covariates Are Everywhere
Datasets built from text, images, merged surveys, and voter files often lack key covariates because those features are latent (for example, sentiment in text) or simply not collected (for example, race in voter files).
⚠️ A Common Shortcut — And Why It Fails
A widespread approach is to hand-label the true covariate for a subset of observations, train a machine learning model to predict that covariate for the remainder, and then plug those predictions into regressions. Doing so without accounting for prediction error leads to biased, inconsistent, and overconfident inference.
🛠️ A Practical Fix That Restores Validity
This work characterizes how severe the problems from prediction error can be and describes a procedure that avoids these inconsistencies under comparatively general assumptions. Key features:
🔬 How the Method Is Evaluated
Performance is demonstrated through:
💾 Tools for Applied Researchers
Software implementing the proposed approach is provided to facilitate adoption.
📈 Why It Matters
When machine learning is used to impute missing covariates across text, images, merged surveys, or voter files, naive plug-in of predictions into regressions can produce misleading results. The proposed procedure enables valid effect estimation and accurate uncertainty reporting in such settings.

| Machine Learning Predictions as Regression Covariates was authored by Christian Fong and Matthew Tyler. It was published by Cambridge in Pol. An. in 2021. |
