📌 Overview
Racial identification often must be inferred from ecological data, a process that is vulnerable to bias and error. Bayesian Improved Surname Geocoding (BISG) greatly improves those inferences by combining surname and geographic demographic data, but the geographic unit used varies widely in practice and the trade-offs are not well quantified. This letter validates BISG on Georgia's voter file, compares geocoded and nongeocoded approaches, and introduces ZIP codes as an intermediate geography for BISG.
📊 What Was Compared
Comparison: Geocoded Versus Nongeocoded BISG on a State Voter File
- Data: Georgia voter file used as the validation dataset.
- Methods: BISG applied under multiple geography levels and procedures: surname-only estimation, county-level approximations, nongeocoded ZIP-code-based estimation, and geocoded census-block-level BISG.
- Aim: Quantify accuracy trade-offs across geography levels and assess missingness and bias implications of each approach.
🔍 Key Findings
- ZIP-code BISG (without precise geocoding) is an acceptable alternative for estimating White and Black racial identification.
- Census-block geocoded BISG yields the most accurate imputations for Asian and Hispanic voters, outperforming ZIP-code and larger-area approaches for these groups.
- The choice of geography involves trade-offs between accuracy, data availability, and missingness; smaller geographies reduce bias for some groups but are more likely to be missing or unavailable.
- Results identify a sequence of BISG practices that maximize correct racial identification while minimizing data missingness and bias across groups.
⚖️ Why It Matters
- Practical guidance is provided for researchers and practitioners who must impute race from surnames and geography: when geocoding is unavailable, ZIP-code-level BISG can suffice for many analyses focused on White and Black populations, but analyses centered on Asian or Hispanic populations should prioritize census-block geocoding where possible.
- The findings clarify the efficiency and limitations of common BISG implementations and offer a data-driven basis for selecting geography levels in race-imputation tasks.