FIND DATA: By Journal | Sites   ANALYZE DATA: Help with R | SPSS | Stata | Excel   WHAT'S NEW? US Politics | IR | Law & Courts🎵
   FIND DATA: By Journal | Sites   WHAT'S NEW? US Politics | IR | Law & Courts🎵
WHAT'S NEW? US Politics | IR | Law & Courts🎵
If this link is broken, please report as broken. You can also submit updates (will be reviewed).

Why Preprocessing Choices Can Make or Break Unsupervised Text Analysis

PreprocessingUnsupervised LearningSensitivity AnalysisFeature SelectionMethodology@Pol. An.14 R filesDataverse
Methodology subfield banner

🔎 Problem and Contribution

Unsupervised text-as-data methods are widely used in political science, yet systematic attention to preprocessing choices is scarce. Those choices — from how text is tokenized to which features are retained — can have profound effects on the results of real models applied to real data. Substantive theory is often too vague to guide feature selection in these settings, and lessons from the supervised learning literature are not necessarily applicable to unsupervised tasks.

🛠️ How Sensitivity Is Evaluated

A statistical procedure and accompanying software are introduced to examine how findings change under alternate preprocessing regimes. The approach does not replace substantive judgment; instead, it provides tools to quantify and visualize the variability that different preprocessing decisions induce when analyzing a particular dataset.

The procedure helps to:

  • Systematically compare model outputs across multiple preprocessing pipelines
  • Characterize the degree of variability in results attributable to preprocessing
  • Produce diagnostics useful for assessing robustness and informing feature selection

📌 Key Findings

  • Preprocessing decisions materially alter results from unsupervised models on empirical political text data.
  • Substantive theory typically lacks the specificity required for reliable feature selection in unsupervised settings.
  • Advice drawn from supervised learning is not always appropriate for unsupervised analyses.
  • The proposed procedure and software complement substantive expertise by revealing when results are stable versus when they are driven by arbitrary preprocessing choices, thereby facilitating more transparent reporting and replication.

🌟 Why It Matters

Making scholars aware of the sensitivity of their results to preprocessing choices improves the credibility and replicability of unsupervised text-as-data research. The tools offered enable researchers to report not just point estimates from a single pipeline but a characterization of how those estimates might vary under plausible alternative preprocessing decisions.

Article card for article: Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do About It
Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do About It was authored by Matthew Denny and Arthur Spirling. It was published by Cambridge in Pol. An. in 2018.
Find on Google Scholar
Find on JSTOR
Find on CUP
Political Analysis
Edit article record marker