🔎 What problem this paper addresses
Many scholars classify documents into categories, but social scientists often need accurate estimates of the proportion of documents in each category. Parametric “classify-and-count” approaches can be highly model-dependent and may produce greater bias in proportion estimates even when the percent of documents correctly classified rises. Direct, nonparametric estimation of proportions avoids some of those model-dependence problems but can fail when language meaning shifts between training and test sets or when categories use very similar language.
🧪 How the improved method works
This paper develops an improved direct estimation approach that mitigates those weaknesses by:
- optimizing continuous text features rather than relying only on discrete labels or simple counts;
- incorporating a form of matching adapted from the causal inference literature to align training and test distributions; and
- maintaining a nonparametric focus on estimating category proportions rather than classifying individual documents.
📊 What was tested and what happened
- The approach was evaluated on a diverse collection of 73 datasets.
- Results show the new method substantially improves performance compared to standard classify-and-count and prior direct-estimation approaches, especially when language meaning shifts or categories are textually similar.
💡 Why this matters
- Reduces the model dependence and bias that can afflict parametric classify-and-count techniques.
- Preserves the advantages of direct, nonparametric proportion estimation while making it robust to changes in language and subtle category distinctions.
- Offers a practical solution for social scientists focused on population-level inferences from text data.
🧰 Tools available
Easy-to-use software implementing all described ideas is provided, enabling replication and application across varied text analysis tasks.