đź”’ Privacy Problem Exposed
De-identification—removing names and direct identifiers—has long been the standard way to share survey data. Recent work shows these procedures do not stop intentional re-identification attacks, creating a real risk for large survey programs in academia, government, and industry. This risk is especially acute in political science because respondents’ political beliefs are among the most sensitive information they provide.
🔎 How Re-identification Was Tested
A practical demonstration confirms the threat: individuals were re-identified from a de-identified survey about a controversial referendum declaring life beginning at conception. Key points about the demonstration:
- The target dataset was a survey on a politically sensitive referendum.
- Conventional de-identification (removing direct identifiers) was insufficient to prevent intentional re-identification.
🛡️ A Practical Fix Built on Differential Privacy
A set of new data-sharing procedures, grounded in the formal notion of differential privacy, is proposed to address the problem. These procedures provide:
- Mathematical guarantees that individual respondents’ privacy is protected against a wide class of re-identification attacks.
- Statistical-validity guarantees that allow social scientists to analyze the released, differentially private data while accounting for the privacy-induced noise.
⚖️ Trade-offs and Implications
The primary cost of deploying differential privacy for survey data is larger standard errors in estimates derived from the privatized data. However, this cost has a clear remedy: larger sample sizes reduce the privacy-induced loss of precision. Implications include:
- A necessary shift in data-sharing practice from ad hoc de-identification to formally private release mechanisms.
- A planning consideration for survey designers and funders to budget for larger samples when differential privacy is required.
đź’ˇ Why It Matters
Adopting differential privacy preserves respondent confidentiality with provable guarantees while keeping survey data usable for research. Without it, traditional de-identification leaves respondents vulnerable to re-identification—undermining trust in survey research and threatening the viability of studies that collect highly sensitive political information.