FIND DATA: By Journal | Sites   ANALYZE DATA: Help with R | SPSS | Stata | Excel   WHAT'S NEW? US Politics | IR | Law & Courts🎵
   FIND DATA: By Journal | Sites   WHAT'S NEW? US Politics | IR | Law & Courts🎵
WHAT'S NEW? US Politics | IR | Law & Courts🎵
If this link is broken, please report as broken. You can also submit updates (will be reviewed).

Google Translate Works for Bag-of-Words Text Analysis

machine translationbag-of-wordstopic modelsEuroparlLDAMethodology@Pol. An.Dataverse
Methodology subfield banner

Comparative text analysis faces a basic hurdle: texts are written in different languages. Some researchers have proposed translating all texts into English using Google Translate before analysis. This study tests whether that shortcut undermines bag-of-words approaches such as topic models or whether machine translation preserves the features scholars rely on.

🔍 What Was Compared

  • Two versions of the same multilingual corpus (Europarl): a gold-standard human-translated English corpus and a machine-translated English corpus produced by Google Translate.
  • Two analytical outputs: term–document matrices (TDMs) and Latent Dirichlet Allocation (LDA) topic models.
  • Evaluation at both the document level and the overall corpus level to capture fine-grained and aggregate effects.

🧪 Key Findings

  • TDMs from human-translated and machine-translated texts are highly similar, with only minor differences across languages.
  • A substantial proportion of features (terms) overlap between the gold-standard and machine-translated corpora.
  • LDA topic models show strong resemblance in both topical prevalence (how common topics are) and topical content (what topics look like), again with only small cross-language differences.

💡 Why It Matters

  • For researchers using bag-of-words techniques, Google Translate provides a practical and reliable way to harmonize multilingual corpora without substantially distorting term-level or topic-level results.
  • These results support the practice of translating non-English texts into English for comparative bag-of-words applications, while acknowledging small language-specific deviations that merit caution in sensitive or high-stakes contexts.
Article card for article: No Longer Lost in Translation: Evidence That Google Translate Works for Comparative Bag-of-Words Text Applications
No Longer Lost in Translation: Evidence That Google Translate Works for Comparative Bag-of-Words Text Applications was authored by Erik De Vries, Martijn Schoonvelde and Gijs Schumacher. It was published by Cambridge in Pol. An. in 2018.
Find on Google Scholar
Find on JSTOR
Find on CUP
Political Analysis
Edit article record marker