A new way to look at data privacy


Article by Adam Zewe: “Imagine that a team of scientists has developed a machine-learning model that can predict whether a patient has cancer from lung scan images. They want to share this model with hospitals around the world so clinicians can start using it in diagnosis.

But there’s a problem. To teach their model how to predict cancer, they showed it millions of real lung scan images, a process called training. Those sensitive data, which are now encoded into the inner workings of the model, could potentially be extracted by a malicious agent. The scientists can prevent this by adding noise, or more generic randomness, to the model that makes it harder for an adversary to guess the original data. However, perturbation reduces a model’s accuracy, so the less noise one can add, the better.

MIT researchers have developed a technique that enables the user to potentially add the smallest amount of noise possible, while still ensuring the sensitive data are protected.

The researchers created a new privacy metric, which they call Probably Approximately Correct (PAC) Privacy, and built a framework based on this metric that can automatically determine the minimal amount of noise that needs to be added. Moreover, this framework does not need knowledge of the inner workings of a model or its training process, which makes it easier to use for different types of models and applications.

In several cases, the researchers show that the amount of noise required to protect sensitive data from adversaries is far less with PAC Privacy than with other approaches. This could help engineers create machine-learning models that provably hide training data, while maintaining accuracy in real-world settings…

A fundamental question in data privacy is: How much sensitive data could an adversary recover from a machine-learning model with noise added to it?

Differential Privacy, one popular privacy definition, says privacy is achieved if an adversary who observes the released model cannot infer whether an arbitrary individual’s data is used for the training processing. But provably preventing an adversary from distinguishing data usage often requires large amounts of noise to obscure it. This noise reduces the model’s accuracy.

PAC Privacy looks at the problem a bit differently. It characterizes how hard it would be for an adversary to reconstruct any part of randomly sampled or generated sensitive data after noise has been added, rather than only focusing on the distinguishability problem…(More)”

How do we know how smart AI systems are?


Article by Melanie Mitchell: “In 1967, Marvin Minksy, a founder of the field of artificial intelligence (AI), made a bold prediction: “Within a generation…the problem of creating ‘artificial intelligence’ will be substantially solved.” Assuming that a generation is about 30 years, Minsky was clearly overoptimistic. But now, nearly two generations later, how close are we to the original goal of human-level (or greater) intelligence in machines?

Some leading AI researchers would answer that we are quite close. Earlier this year, deep-learning pioneer and Turing Award winner Geoffrey Hinton told Technology Review, “I have suddenly switched my views on whether these things are going to be more intelligent than us. I think they’re very close to it now and they will be much more intelligent than us in the future.” His fellow Turing Award winner Yoshua Bengio voiced a similar opinion in a recent blog post: “The recent advances suggest that even the future where we know how to build superintelligent AIs (smarter than humans across the board) is closer than most people expected just a year ago.”

These are extraordinary claims that, as the saying goes, require extraordinary evidence. However, it turns out that assessing the intelligence—or more concretely, the general capabilities—of AI systems is fraught with pitfalls. Anyone who has interacted with ChatGPT or other large language models knows that these systems can appear quite intelligent. They converse with us in fluent natural language, and in many cases seem to reason, to make analogies, and to grasp the motivations behind our questions. Despite their well-known unhumanlike failings, it’s hard to escape the impression that behind all that confident and articulate language there must be genuine understanding…(More)”.

Questions as a Device for Data Responsibility: Toward a New Science of Questions to Steer and Complement the Use of Data Science for the Public Good in a Polycentric Way


Paper by Stefaan G. Verhulst: “We are at an inflection point today in our search to responsibly handle data in order to maximize the public good while limiting both private and public risks. This paper argues that the way we formulate questions should be given more consideration as a device for modern data responsibility. We suggest that designing a polycentric process for co-defining the right questions can play an important role in ensuring that data are used responsibly, and with maximum positive social impact. In making these arguments, we build on two bodies of knowledge—one conceptual and the other more practical. These observations are supplemented by the author’s own experience as founder and lead of “The 100 Questions Initiative.” The 100 Questions Initiative uses a unique participatory methodology to identify the world’s 100 most pressing, high-impact questions across a variety of domains—including migration, gender inequality, air quality, the future of work, disinformation, food sustainability, and governance—that could be answered by unlocking datasets and other resources. This initiative provides valuable practical insights and lessons into building a new “science of questions” and builds on theoretical and practical knowledge to outline a set of benefits of using questions for data responsibility. More generally, this paper argues that, combined with other methods and approaches, questions can help achieve a variety of key data responsibility goals, including data minimization and proportionality, increasing participation, and enhancing accountability…(More)”.

Building Responsive Investments in Gender Equality using Gender Data System Maturity Models


Tools and resources by Data2X and Open Data Watch: “.. to help countries check the maturity of their gender data systems and set priorities for gender data investments. The new Building Responsive Investments in Data for Gender Equality (BRIDGE) tool is designed for use by gender data focal points in national statistical offices (NSOs) of low- and middle- income countries and by their partners within the national statistical system (NSS) to communicate gender data priorities to domestic sources of financing and international donors.

The BRIDGE results will help gender data stakeholders understand the current maturity level of their gender data system, diagnose strengths and weaknesses, and identify priority areas for improvement. They will also serve as an input to any roadmap or action plan developed in collaboration with key stakeholders within the NSS.

Below are links to and explanations of our ‘Gender Data System Maturity Model’ briefs (a long and short version), our BRIDGE assessment and tools methodology, how-to guide, questionnaire, and scoring form that will provide an overall assessment of system maturity and insight into potential action plans to strengthen gender data systems…(More)”.

How Statisticians Should Grapple with Privacy in a Changing Data Landscape


Article by Joshua Snoke, and Claire McKay Bowen: “Suppose you had a data set that contained records of individuals, including demographics such as their age, sex, and race. Suppose also that these data contained additional in-depth personal information, such as financial records, health status, or political opinions. Finally, suppose that you wanted to glean relevant insights from these data using machine learning, causal inference, or survey sampling adjustments. What methods would you use? What best practices would you ensure you followed? Where would you seek information to help guide you in this process?…(More)”

AI tools are designing entirely new proteins that could transform medicine


Article by Ewen Callaway: “OK. Here we go.” David Juergens, a computational chemist at the University of Washington (UW) in Seattle, is about to design a protein that, in 3-billion-plus years of tinkering, evolution has never produced.

On a video call, Juergens opens a cloud-based version of an artificial intelligence (AI) tool he helped to develop, called RFdiffusion. This neural network, and others like it, are helping to bring the creation of custom proteins — until recently a highly technical and often unsuccessful pursuit — to mainstream science.

These proteins could form the basis for vaccines, therapeutics and biomaterials. “It’s been a completely transformative moment,” says Gevorg Grigoryan, the co-founder and chief technical officer of Generate Biomedicines in Somerville, Massachusetts, a biotechnology company applying protein design to drug development.

The tools are inspired by AI software that synthesizes realistic images, such as the Midjourney software that, this year, was famously used to produce a viral image of Pope Francis wearing a designer white puffer jacket. A similar conceptual approach, researchers have found, can churn out realistic protein shapes to criteria that designers specify — meaning, for instance, that it’s possible to speedily draw up new proteins that should bind tightly to another biomolecule. And early experiments show that when researchers manufacture these proteins, a useful fraction do perform as the software suggests.

The tools have revolutionized the process of designing proteins in the past year, researchers say. “It is an explosion in capabilities,” says Mohammed AlQuraishi, a computational biologist at Columbia University in New York City, whose team has developed one such tool for protein design. “You can now create designs that have sought-after qualities.”

“You’re building a protein structure customized for a problem,” says David Baker, a computational biophysicist at UW whose group, which includes Juergens, developed RFdiffusion. The team released the software in March 2023, and a paper describing the neural network appears this week in Nature1. (A preprint version was released in late 2022, at around the same time that several other teams, including AlQuraishi’s2 and Grigoryan’s3, reported similar neural networks)…(More)”.

Attacks on Tax Privacy: How the Tax Prep Industry Enabled Meta to Harvest Millions of Taxpayers’ Sensitive Data


Congressional Report: “The investigation revealed that:

  • Tax preparation companies shared millions of taxpayers’ data with Meta, Google, and other Big Tech firms: The tax prep companies used computer code – known as pixels – to send data to Meta and Google. While most websites use pixels, it is particularly reckless for online tax preparation websites to use them on webpages where tax return information is entered unless further steps are taken to ensure that the pixels do not access sensitive information. TaxAct, TaxSlayer, and H&R Block confirmed that they had used the Meta Pixel, and had been using it “for at least a couple of years” and all three companies had been using Google Analytics (GA) for even longer.
  • Tax prep companies shared extraordinarily sensitive personal and financial information with Meta, which used the data for diverse advertising purposes: TaxAct, H&R Block, and TaxSlayer each revealed, in response to this Congressional inquiry, that they shared taxpayer data via their use of the Meta Pixel and Google’s tools. Although the tax prep companies and Big Tech firms claimed that all shared data was anonymous, the FTC and experts have indicated that the data could easily be used to identify individuals, or to create a dossier on them that could be used for targeted advertising or other purposes. 
  • Tax prep companies and Big Tech firms were reckless about their data sharing practices and their treatment of sensitive taxpayer data: The tax prep companies indicated that they installed the Meta and Google tools on their websites without fully understanding the extent to which they would send taxpayer data to these tech firms, without consulting with independent compliance or privacy experts, and without full knowledge of Meta’s use of and disposition of the data. 
  • Tax prep companies may have violated taxpayer privacy laws by sharing taxpayer data with Big Tech firms: Under the law, “a tax return preparer may not disclose or use a taxpayer’s tax return information prior to obtaining a written consent from the taxpayer,” – and they failed to do so when it came to the information that was turned over to Meta and Google. Tax prep companies can also turn over data to “auxiliary service providers in connection with the preparation of a tax return.” But Meta and Google likely do not meet the definition of “auxiliary service providers” and the data sharing with Meta was for advertising purposes – not “in connection with the preparation of a tax return.”…(More)”.

Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology


Paper by Nikhil Agarwal, Alex Moehring, Pranav Rajpurkar & Tobias Salz: “While Artificial Intelligence (AI) algorithms have achieved performance levels comparable to human experts on various predictive tasks, human experts can still access valuable contextual information not yet incorporated into AI predictions. Humans assisted by AI predictions could outperform both human-alone or AI-alone. We conduct an experiment with professional radiologists that varies the availability of AI assistance and contextual information to study the effectiveness of human-AI collaboration and to investigate how to optimize it. Our findings reveal that (i) providing AI predictions does not uniformly increase diagnostic quality, and (ii) providing contextual information does increase quality. Radiologists do not fully capitalize on the potential gains from AI assistance because of large deviations from the benchmark Bayesian model with correct belief updating. The observed errors in belief updating can be explained by radiologists’ partially underweighting the AI’s information relative to their own and not accounting for the correlation between their own information and AI predictions. In light of these biases, we design a collaborative system between radiologists and AI. Our results demonstrate that, unless the documented mistakes can be corrected, the optimal solution involves assigning cases either to humans or to AI, but rarely to a human assisted by AI…(More)”.

Weather Warning Inequity: Lack of Data Collection Stations Imperils Vulnerable People


Article by Chelsea Harvey: “Devastating floods and landslides triggered by extreme downpours killed hundreds of people in Rwanda and the Democratic Republic of Congo in May, when some areas saw more than 7 inches of rain in a day.

Climate change is intensifying rainstorms throughout much of the world, yet scientists haven’t been able to show that the event was influenced by warming.

That’s because they don’t have enough data to investigate it.

Weather stations are sparse across Africa, making it hard for researchers to collect daily information on rainfall and other weather variables. The data that does exist often isn’t publicly available.

“The main issue in some countries in Africa is funding,” said Izidine Pinto, a senior researcher on weather and climate at the Royal Netherlands Meteorological Institute. “The meteorological offices don’t have enough funding.”

There’s often too little money to build or maintain weather stations, and strapped-for-cash governments often choose to sell the data they do collect rather than make it free to researchers.

That’s a growing problem as the planet warms and extreme weather worsens. Reliable forecasts are needed for early warning systems that direct people to take shelter or evacuate before disasters strike. And long-term climate data is necessary for scientists to build computer models that help make predictions about the future.

The science consortium World Weather Attribution is the latest research group to run into problems. It investigates the links between climate change and individual extreme weather events all over the globe. In the last few months alone, the organization has demonstrated the influence of global warming on extreme heat in South Asia and the Mediterranean, floods in Italy, and drought in eastern Africa.

Most of its research finds that climate change is making weather events more likely to occur or more intense.

The group recently attempted to investigate the influence of climate change on the floods in Rwanda and Congo. But the study was quickly mired in challenges.

The team was able to acquire some weather station data, mainly in Rwanda, Joyce Kimutai, a research associate at Imperial College London and a co-author of the study, said at a press briefing announcing the findings Thursday. But only a few stations provided sufficient data, making it impossible to define the event or to be certain that climate model simulations were accurate…(More)”.

AI and the automation of work


Essay by Benedict Evans: “…We should start by remembering that we’ve been automating work for 200 years. Every time we go through a wave of automation, whole classes of jobs go away, but new classes of jobs get created. There is frictional pain and dislocation in that process, and sometimes the new jobs go to different people in different places, but over time the total number of jobs doesn’t go down, and we have all become more prosperous.

When this is happening to your own generation, it seems natural and intuitive to worry that this time, there aren’t going to be those new jobs. We can see the jobs that are going away, but we can’t predict what the new jobs will be, and often they don’t exist yet. We know (or should know), empirically, that there always have been those new jobs in the past, and that they weren’t predictable either: no-one in 1800 would have predicted that in 1900 a million Americans would work on ‘railways’ and no-one in 1900 would have predicted ‘video post-production’ or ‘software engineer’ as employment categories. But it seems insufficient to take it on faith that this will happen now just because it always has in the past. How do you know it will happen this time? Is this different?

At this point, any first-year economics student will tell us that this is answered by, amongst other things, the ‘Lump of Labour’ fallacy.

The Lump of Labour fallacy is the misconception that there is a fixed amount of work to be done, and that if some work is taken by a machine then there will be less work for people. But if it becomes cheaper to use a machine to make, say, a pair of shoes, then the shoes are cheaper, more people can buy shoes and they have more money to spend on other things besides, and we discover new things we need or want, and new jobs. The efficient gain isn’t confined to the shoe: generally, it ripples outward through the economy and creates new prosperity and new jobs. So, we don’t know what the new jobs will be, but we have a model that says, not just that there always have been new jobs, but why that is inherent in the process. Don’t worry about AI!The most fundamental challenge to this model today, I think, is to say that no, what’s really been happening for the last 200 years of automation is that we’ve been moving up the scale of human capability…(More)”.