A National Science Foundation Secure and Trustworthy Cyberspace Project
This prototype system will allow researchers with sensitive datasets to make differentially private statistics about their data available through data repositories using the Dataverse platform.
Our prototype system will allow researchers to:  upload private data to a secured Dataverse archive,  decide what statistics they would like to release about that data, and  release privacy preserving versions of those statistics to the repository,  that can be explored through a curator interface without releasing the raw data, including  interactive queries.
A paper describing our system can be found here. This system was created by the Privacy Tools for Sharing Research Data project. Differential privacy is a mathematical framework for enabling statistical analysis of sensitive datasets while ensuring that individual-level information cannot be leaked. The project website contains resources for learning more about differential privacy.
The first part of this system is a tool that helps both data depositors and data analysts distribute a global privacy budget across many statistics. Users select which statistics they would like to calculate and are given estimates of how accurately each statistic can be computed. They can also redistribute their privacy budget according to which statistics they think are most valuable in their dataset. This work has motivated new theoretical results from our group that maximize the utility achievable when using differential privacy to share many statistics about a research dataset.
When the data depositor has distributed their privacy budget, the second portion of our tool system draws differentially private versions of those statistical summaries selected by the data depositor from a library of differentially private routines (which we created in the R statistical language, and also make available for use by the R community) and stores them in metadata associated with that file on Dataverse. Future researchers who wish to explore restricted social science data can then access these privacy-preserving summary statistics either from the metadata, or through the TwoRavens graphical data exploration tool built for Dataverse, which we have adapted for differentially private statistics.
Our system will allow some of the privacy budget to be reserved for future data analysts to choose their own differentially private statistics to calculate (selected from the library of differentially private algorithms provided by the system). Differential privacy will ensure that even if these queries are chosen adversarially, individual-level information will not be leaked. This currently works through a command-line interactive system, and we are developing a future user interface.