Gillian Tett at the Financial Times: “Wilbur Ross suffered the political equivalent of a small(ish) black eye last month: a federal judge blocked the US commerce secretary’s attempts to insert a question about citizenship into the 2020 census and accused him of committing “egregious” legal violations.
The Supreme Court has agreed to hear the administration’s appeal in April. But while this high-profile fight unfolds, there is a second, less noticed, census issue about data privacy emerging that could have big implications for businesses (and citizens). Last weekend John Abowd, the Census Bureau’s chief scientist, told an academic gathering that statisticians had uncovered shortcomings in the protection of personal data in past censuses. There is no public evidence that anyone has actually used these weaknesses to hack records, and
The crucial problem revolves around what is known as “re-identification” risk. When companies and government institutions amass sensitive information about individuals, they typically protect privacy in two ways: they hide the full data set from outside eyes or they release it in an “anonymous” manner, stripped of identifying details. The census bureau does both: it is required by law to publish detailed data and protect confidentiality. Since 1990, it has tried to resolve these contradictory mandates by using “household-level swapping” — moving some households from one geographic location to another to generate enough uncertainty to prevent re-identification. This used to work. But today there are so many commercially-available data sets and computers are so powerful that it is possible to re-identify “anonymous” data by combining data sets. …
Thankfully, statisticians think there is a solution. The Census Bureau now plans to use a technique known as “differential privacy” which would introduce “noise” into the public statistics, using complex algorithms. This technique is expected to create just enough statistical fog to protect personal confidentiality in published data — while also preserving information in an encrypted form that statisticians can later unscramble, as needed. Companies such as Google, Microsoft