
🧾 About the Facebook URLs dataset and its privacy noise
The Facebook URLs Dataset contains over 40 trillion cell values, making it one of the largest social science research datasets ever assembled. The release applies a version of differential privacy that adds specially calibrated random noise, providing mathematical guarantees for the privacy of individual research subjects while aiming to preserve aggregate patterns useful to social scientists.
⚠️ Why standard analyses can be misleading
Random noise in the release creates measurement error that induces statistical bias in conventional analyses. Typical distortions include:
⚙️ How bias is corrected at scale
Methods originally developed to correct naturally occurring measurement error are adapted to the specifics of the differentially private release, with special attention to computational efficiency for extremely large datasets. Key methodological features include:
📈 Key findings
⭐ Why it matters
These methods reconcile strong formal privacy protections with credible social-science inference, enabling researchers to draw reliable conclusions from massive differentially private data releases such as the Facebook URLs Dataset.

| Statistically Valid Inferences from Differentially Private Data Releases, With Application to the Facebook URLs Dataset was authored by Georgina Evans and Gary King. It was published by Cambridge in Pol. An. in 2023. |
