Supervised cross-domain topic classification trains a model on a labeled source corpus and applies it to an unlabeled target corpus from a different domain. This approach leverages existing labeled data to reduce effort compared with collecting new within-domain training data, while offering clearer, research-targeted topics than unsupervised methods.
🔍 The Approach
- An algorithm is trained to classify topics in a labeled source corpus and then extrapolates those topic labels to documents in an unlabeled target corpus from another domain.
📚 Data Used: Party Platforms to Parliamentary Speeches
- Source corpus: labeled party platforms.
- Target corpus: unlabeled parliamentary speeches.
🧪 How Performance Was Evaluated
- Standard within-domain error metrics were reported.
- Cross-domain performance received additional validation by manually labeling a subset of the target-corpus documents to compare against classifier assignments.
📈 Key Findings
- The classifier can accurately assign topics in parliamentary speeches.
- Accuracy varies substantially by topic, indicating some topics transfer better across domains than others.
- Using existing labeled data makes this method substantially more efficient than training new within-domain supervised models.
- Compared with unsupervised topic models, the supervised cross-domain method can be more precisely targeted to a research question and yields topics that are easier to validate and interpret.
⚙️ Tools and Applications
- Diagnostic tools are proposed to evaluate when cross-domain classification will perform well and to identify problematic topics.
- Two case studies illustrate substantive use: how electoral rules and the gender of parliamentarians influence the choice of speech topics.
💡 Why It Matters
- Enables reuse of labeled resources to extend topic measurement across domains, saving time and improving interpretability.
- Provides a practical workflow and diagnostics for researchers studying political texts across different institutional contexts.