
๐ The Problem
Data sets that quantify social-scientific concepts often rely on multiple experts to code latent, ordinal concepts. Reporting the simple average across experts is common practice, but experts can differ in both their reliability and how they interpret rating scales. These differences make the mean a potentially biased or inaccurate summary.
๐ Data Sources: V-Dem and Realistic Simulations
๐ ๏ธ How Models Were Compared
A range of item-response theory (IRT) models was evaluated against the standard practice of reporting average expert codes. Comparisons focused on the ability of each approach to recover underlying latent concepts when experts vary in scale interpretation or reliability, and when differential item functioning (DIF) is present.
๐งพ Key Findings
โ๏ธ Why It Matters
Item-response theory offers an intuitive, practical way to account for different expert judgments and scale interpretations when aggregating ordinal expert ratings. Given their superior performance under realistic patterns of expert disagreement and their robustness when such disagreement is absent, cross-national data producers are advised to adopt IRT techniques for aggregating expert-coded measures of latent concepts.

| IRT Models for Expert-Coded Panel Data was authored by Kyle Marquardt and Daniel Pemstein. It was published by Cambridge in Pol. An. in 2018. |
