How RoBERTa Finds Government Rationales for Internment—And Where It Fails

Text Analysisrobertatext classificationinternment without trialarchival ocr errorshuman rights abusesMethodology @JOP Dataverse

What the Authors Ask

Sarah Dreier, Sofia Serrano, Emily Gade, and Noah A. Smith test whether recent advances in natural language processing (NLP) improve researchers' ability to detect politically meaningful concepts in messy historical archives—specifically, government rationalizations for internment without trial. The question matters for scholars who want scalable ways to mine political texts for evidence of rights abuses and official justifications.

The Data and The Challenge

The team works with imperfectly digitized archival texts that contain spelling errors, OCR noise, and context-specific language. The target concept—official rationalizations for internment without trial—is subtle and often embedded in bureaucratic prose, making automated detection challenging.

How They Tested Models

The authors evaluate a modern contextual embedding model (RoBERTa) against conventional supervised text-classification approaches.
They assess reliability of automated labels and explore how model specification choices and manual interventions (targeted human review and annotation) affect performance.

Key Findings

RoBERTa outperforms conventional supervised methods at identifying and classifying government rationalizations for internment in these archival texts.
Despite its relative advantage, RoBERTa remains inadequate for some research objectives that require near-perfect accuracy or fine-grained causal claims.
Combining RoBERTa with targeted manual annotation and careful model specification can substantially reduce the manual coding burden while preserving usable data quality.
The authors note that RoBERTa and similar models would likely perform even better on cleaner, contemporary text sources.

Practical Guidance for Researchers

The article demonstrates when applying contextual NLP is beneficial and when human-in-the-loop workflows remain necessary. It also provides concrete advice on model choice, specification, and annotation strategies for political scientists working with context-specific policy discussions or poorly digitized historical records.

Why This Matters

This work clarifies both the promise and limits of state-of-the-art NLP for classifying politically salient concepts—helping researchers decide when machine labeling can responsibly replace or augment manual coding in studies of repression, civil liberties, and archival political texts.

Article card for article: Troubles in Text: Using Natural Language Processing to recognize government rationalizations for rights abuses

Troubles in Text: Using Natural Language Processing to recognize government rationalizations for rights abuses was authored by Sarah K. Dreier, Sofia Serrano, Emily K. Gade and Noah A. Smith. It was published by Chicago in JOP in 2025.