
🔎 What problem this paper addresses
Many scholars classify documents into categories, but social scientists often need accurate estimates of the proportion of documents in each category. Parametric “classify-and-count” approaches can be highly model-dependent and may produce greater bias in proportion estimates even when the percent of documents correctly classified rises. Direct, nonparametric estimation of proportions avoids some of those model-dependence problems but can fail when language meaning shifts between training and test sets or when categories use very similar language.
🧪 How the improved method works
This paper develops an improved direct estimation approach that mitigates those weaknesses by:
📊 What was tested and what happened
💡 Why this matters
🧰 Tools available
Easy-to-use software implementing all described ideas is provided, enabling replication and application across varied text analysis tasks.

| An Improved Method of Automated Nonparametric Content Analysis for Social Science was authored by Connor Jerzak, Gary King and Anton Strezhnev. It was published by Cambridge in Pol. An. in 2023. |
