Releasing private population analytics: How should we do it?
Giordana Verrengia
Apr 15, 2024
Population analytics are used widely by governments and nonprofits to provision resources within communities and understand their state and demographics. However, these analytics can leak sensitive data about individual members of a population. As a result, private population analytics are being increasingly used to communicate these quantities. Most notably the U.S. Census Bureau recently started using a technique called differential privacy to release randomized statistics from the census, while protecting individual user privacy. As this trend grows, an important question is: how exactly should we release differentially-private population analytics? The main challenge is that if done carelessly, these techniques can harm the usefulness of the data.
A team of Carnegie Mellon University researchers conducted an empirical comparison of two existing classes of methods used to release differentially-private population analytics. The group, led by Aadyaa Maddi, who earned a master's in privacy engineering in 2023, sought to determine which of the two classes of algorithms did a better job at preserving both data privacy and accuracy. The research was done as part of the Upanzi Network and in collaboration with CMU's CyLab Security and Privacy Institute.
The first method in question, the TopDown algorithm, is already popular and established. In fact, it's used to protect the data released by the US Census Bureau every ten years. The TopDown algorithm answers queries by releasing noisy responses. The "noise" in this scenario skews the numbers in a dataset before they are released for public consumption. This is an important feature given the sensitive nature of census data, which looks at categories like age, race, sex, and religion across the US population.
Because the downstream applications of this type of data are so important, we want to make sure that statistics can be as accurate as possible.
Aadyaa Maddi, MS Privacy Engineering ‘23
"Because the downstream applications of this type of data are so important, we want to make sure that statistics can be as accurate as possible even with differential privacy measures in place," says Maddi. Some of the downstream applications of population data include funding allocations and deciding the number of representatives per district to send to Congress.
In comparison, the second method, synthetic data release, preserves privacy by creating an artificial copy of a dataset in which the numbers have been altered.
The head-to-head comparison consisted of testing the TopDown algorithm and synthetic data release to see which did a better job of responding to familiar queries (used to train the algorithm) and unfamiliar queries (new ones) with results that were as accurate as possible while adhering to privacy guarantees.
According to the results of the study, the TopDown algorithm did a better job of maintaining accuracy and privacy on familiar queries—regardless of the type of query and irrespective of how strict the privacy rules were. The synthetic data release method sacrificed accuracy when tasked with higher privacy guarantees, but was better able to handle unfamiliar queries.
Given these results, the team established recommendations for organizations who want to release private population data. There is no one-size-fits-all approach to handling sensitive information, and the best tool to use depends on an organization’s priorities. The TopDown algorithm remains the better option for stakeholders who would like to release numbers that are closer to the original dataset, as long as they only need to provide responses to familiar queries.
"This research is important because synthetic data is being discussed as a possible technique for releasing differentially-private analytics, as a replacement for TopDown," says Giulia Fanti, an assistant professor of electrical and computer engineering. "Our work shows that existing synthetic data algorithms still have some way to go before they can compete with TopDown on in-distribution (or familiar) queries."
Swadhin Routray, who earned a master's in privacy engineering in 2023, and Alexander Goldberg, a Ph.D student in the School of Computer Science, also worked on this research project. The article, "Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown," was featured as a poster at the Fifth AAAI Workshop on Privacy-Preserving Artificial Intelligence in Vancouver, Canada.