FIND DATA: By Author | Journal | Sites   ANALYZE DATA: Help with R | SPSS | Stata | Excel   WHAT'S NEW? US Politics | Int'l Relations | Law & Courts
   FIND DATA: By Author | Journal | Sites   WHAT'S NEW? US Politics | IR | Law & Courts
If this link is broken, please report as broken. You can also submit updates (will be reviewed).
Insights from the Field

Improve Proportion Estimates From Text With Matching And Continuous Features


nonparametric
content analysis
matching
text features
classification
Methodology
Pol. An.
2 archives
Dataverse
An Improved Method of Automated Nonparametric Content Analysis for Social Science was authored by Connor Jerzak, Gary King and Anton Strezhnev. It was published by Cambridge in Pol. An. in 2023.

🔎 What problem this paper addresses

Many scholars classify documents into categories, but social scientists often need accurate estimates of the proportion of documents in each category. Parametric “classify-and-count” approaches can be highly model-dependent and may produce greater bias in proportion estimates even when the percent of documents correctly classified rises. Direct, nonparametric estimation of proportions avoids some of those model-dependence problems but can fail when language meaning shifts between training and test sets or when categories use very similar language.

🧪 How the improved method works

This paper develops an improved direct estimation approach that mitigates those weaknesses by:

  • optimizing continuous text features rather than relying only on discrete labels or simple counts;
  • incorporating a form of matching adapted from the causal inference literature to align training and test distributions; and
  • maintaining a nonparametric focus on estimating category proportions rather than classifying individual documents.

📊 What was tested and what happened

  • The approach was evaluated on a diverse collection of 73 datasets.
  • Results show the new method substantially improves performance compared to standard classify-and-count and prior direct-estimation approaches, especially when language meaning shifts or categories are textually similar.

💡 Why this matters

  • Reduces the model dependence and bias that can afflict parametric classify-and-count techniques.
  • Preserves the advantages of direct, nonparametric proportion estimation while making it robust to changes in language and subtle category distinctions.
  • Offers a practical solution for social scientists focused on population-level inferences from text data.

🧰 Tools available

Easy-to-use software implementing all described ideas is provided, enabling replication and application across varied text analysis tasks.

data
Find on Google Scholar
Find on JSTOR
Find on CUP
Political Analysis
Podcast host Ryan