National Institute of Standards and Technology: “Databases across the country include information with potentially important research implications and uses, e.g. contingency planning in disaster scenarios, identifying safety risks in aviation, assist in tracking contagious diseases, identifying patterns of violence in local communities. However, included in these datasets are personally identifiable information (PII) and it is not enough to simply remove PII from these datasets. It is well known that using auxiliary and possibly completely unrelated datasets, in combination with records in the dataset, can correspond to uniquely identifiable individuals (known as a linkage attack). Today’s efforts to remove PII do not provide adequate protection against linkage attacks. With the advent of “big data” and technological advances in linking data, there are far too many other possible data sources related to each of us that can lead to our identity being uncovered.
Get Involved – How to Participate
The Unlinkable Data Challenge is a multi-stage Challenge. This first stage of the Challenge is intended to source detailed concepts for new approaches, inform the final design in the two subsequent stages, and provide recommendations for matching stage 1 competitors into teams for subsequent stages. Teams will predict and justify where their algorithm fails with respect to the utility-privacy frontier curve.
In this stage, competitors are asked to propose how to de-identify a dataset using less than the available privacy budget, while also maintaining the dataset’s utility for analysis. For example, the de-identified data, when put through the same analysis pipeline as the original dataset, produces comparable results (i.e. similar coefficients in a linear regression model, or a classifier that produces similar predictions on sub-samples of the data).
This stage of the Challenge seeks Conceptual Solutions that describe how to use and/or combine methods in differential privacy to mitigate privacy loss when publicly releasing datasets in a variety of industries such as public safety, law enforcement, healthcare/biomedical research, education, and finance. We are limiting the scope to addressing research questions and methodologies that require regression, classification, and clustering analysis on datasets that contain numerical, geo-spatial, and categorical data.
To compete in this stage, we are asking that you propose a new algorithm utilizing existing or new randomized mechanisms with a justification of how this will optimize privacy and utility across different analysis types. We are also asking you to propose a dataset that you believe would make a good use case for your proposed algorithm, and provide a means of comparing your algorithm and other algorithms.