
📌 What was attempted:
Presents a machine-learning solution aimed at matching the gold standard of double-blind human coding for content analysis in comparative politics. The goal was to classify front-page articles of a leading Hungarian daily by full text into one of 21 policy topics from the Comparative Agendas Project codebook.
🗞️ What was analyzed (data and target):
- Front-page articles from a leading Hungarian daily newspaper
- Full-text documents assigned to 21 policy topics using the Comparative Agendas Project codebook
🔍 How the hybrid binary snowball approach worked:
- Combined supervised machine learning with limited human coding effort
- Converted the multiclass (21-way) problem into a series of binary classification tasks
- Used a snowball procedure that augmented the training set with machine-classified observations after each successful round and also between corpora
- Designed specifically to handle strongly imbalanced topic classes while minimizing human labor
🧾 Key results:
- Precision exceeded 80% for most topic codes
- Precision performance was higher than what is customary for human coders and for most computer-assisted coding projects
- High precision came with limited coverage: fewer than 60% of articles were labeled by the system
⚖️ Why this matters:
- Demonstrates a practical workflow that trades broader coverage for higher precision when human resources are constrained
- Offers a scalable option for high-precision topic labeling in comparative politics, with explicit trade-offs between label quality and the share of articles labeled