The Emergence of National Data Initiatives: Comparing proposals and initiatives in the United Kingdom, Germany, and the United States


Article by Stefaan Verhulst and Roshni Singh: “Governments are increasingly recognizing data as a pivotal asset for driving economic growth, enhancing public service delivery, and fostering research and innovation. This recognition has intensified as policymakers acknowledge that data serves as the foundational element of artificial intelligence (AI) and that advancing AI sovereignty necessitates a robust data ecosystem. However, substantial portions of generated data remain inaccessible or underutilized. In response, several nations are initiating or exploring the launch of comprehensive national data strategies designed to consolidate, manage, and utilize data more effectively and at scale. As these initiatives evolve, discernible patterns in their objectives, governance structures, data-sharing mechanisms, and stakeholder engagement frameworks reveal both shared principles and country-specific approaches.

This blog seeks to start some initial research on the emergence of national data initiatives by examining three national data initiatives and exploring their strategic orientations and broader implications. They include:

The British state is blind


The Economist: “Britiain is a bit bigger than it thought. In 2023 net migration stood at 906,000 people, rather more than the 740,000 previously estimated, according to the Office for National Statistics. It is equivalent to discovering an extra Slough. New numbers for 2022 also arrived. At first the ONS thought net migration stood at 606,000. Now it reckons the figure was 872,000, a difference roughly the size of Stoke-on-Trent, a small English city.

If statistics enable the state to see, then the British government is increasingly short-sighted. Fundamental questions, such as how many people arrive each year, are now tricky to answer. How many people are in work? The answer is fuzzy. Just how big is the backlog of court cases? The Ministry of Justice will not say, because it does not know. Britain is a blind state.

This causes all sorts of problems. The Labour Force Survey, once a gold standard of data collection, now struggles to provide basic figures. At one point the Resolution Foundation, an economic think-tank, reckoned the ONS had underestimated the number of workers by almost 1m since 2019. Even after the ONS rejigged its tally on December 3rd, the discrepancy is still perhaps 500,000, Resolution reckons. Things are so bad that Andrew Bailey, the governor of the Bank of England, makes jokes about the inaccuracy of Britain’s job-market stats in after-dinner speeches—akin to a pilot bursting out of the cockpit mid-flight and asking to borrow a compass, with a chuckle.

Sometimes the sums in question are vast. When the Department for Work and Pensions put out a new survey on household income in the spring, it was missing about £40bn ($51bn) of benefit income, roughly 1.5% of gdp or 13% of all welfare spending. This makes things like calculating the rate of child poverty much harder. Labour mps want this line to go down. Yet it has little idea where the line is to begin with.

Even small numbers are hard to count. Britain has a backlog of court cases. How big no one quite knows: the Ministry of Justice has not published any data on it since March. In the summer, concerned about reliability, it held back the numbers (which means the numbers it did publish are probably wrong, says the Institute for Government, another think-tank). And there is no way of tracking someone from charge to court to prison to probation. Justice is meant to be blind, but not to her own conduct…(More)”.

Impact Inversion


Blog by Victor Zhenyi Wang: “The very first project I worked on when I transitioned from commercial data science to development was during the nadir between South Africa’s first two COVID waves. A large international foundation was interested in working with the South African government and a technology non-profit to build an early warning system for COVID. The non-profit operated a WhatsApp based health messaging service that served about 2 million people in South Africa. The platform had run a COVID symptoms questionnaire which the foundation hoped could help the government predict surges in cases.

This kind of data-based “nowcasting” proved a useful tool in a number of other places e.g. some cities in the US. Yet in the context of South Africa, where the National Department of Health was mired in serious capacity constraints, government stakeholders were bearish about the usefulness of such a tool. Nonetheless, since the foundation was interested in funding this project, we went ahead with it anyway. The result was that we pitched this “early warning system” a handful of times to polite public health officials but it was otherwise never used. A classic case of development practitioners rendering problems technical and generating non-solutions that primarily serve the strategic objectives of the funders.

The technology non-profit did however express interest in a different kind of service — what about a language model that helps users answer questions about COVID? The non-profit’s WhatsApp messaging service is menu-based and they thought that a natural language interface could provide a better experience for users by letting them engage with health content on their own terms. Since we had ample funding from the foundation for the early warning system, we decided to pursue the chatbot project.

The project has now spanned to multiple other services run by the same non-profit, including the largest digital health service in South Africa. The project has won multiple grants and partnerships, including with Google, and has spun out into its own open source library. In many ways, in terms of sheer number of lives affected, this is the most impactful project I have had the privilege of supporting in my career in development, and I am deeply grateful to have been part of the team involved bringing it into existence.

Yet the truth is, the “impact” of this class of interventions remain unclear. Even though a large randomized controlled trial was done to assess the impact of the WhatsApp service, such an evaluation only captures the performance of the service on outcome variables determined by the non-profit, not on whether these outcomes are appropriate. It certainly does not tell us whether the service was the best means available to achieve the ultimate goal of improving the lives of those in communities underserved by health services.

This project, and many others that I have worked on as a data scientist in development, uses an implicit framework for impact which I describe as the design-to-impact pipeline. A technology is designed and developed, then its impact is assessed on the world. There is a strong emphasis to reform, to improve the design, development, and deployment of development technologies. Development practitioners have a broad range of techniques to make sure that the process of creation is ethical and responsible — in some sense, legitimate. With the broad adoption of data-based methods of program evaluation, e.g. randomized control trials, we might even make knowledge claims that an intervention truly ought to bring certain benefits to communities in which the intervention is placed. This view imagines that technologies, once this process is completed, is simply unleashed onto the world, and its impact is simply what was assessed ex ante. An industry of monitoring and evaluation surrounds its subsequent deployment; the relative success of interventions depends on the performance of benchmark indicators…(More)”.

Data for Better Governance: Building Government Analytics Ecosystems in Latin America and the Caribbean


Report by the Worldbank: “Governments in Latin America and the Caribbean face significant development challenges, including insufficient economic growth, inflation, and institutional weaknesses. Overcoming these issues requires identifying systemic obstacles through data-driven diagnostics and equipping public officials with the skills to implement effective solutions.

Although public administrations in the region often have access to valuable data, they frequently fall short in analyzing it to inform decisions. However, the impact is big. Inefficiencies in procurement, misdirected transfers, and poorly managed human resources result in an estimated waste of 4% of GDP, equivalent to 17% of all public spending. 

The report “Data for Better Governance: Building Government Analytical Ecosystems in Latin America and the Caribbean” outlines a roadmap for developing government analytics, focusing on key enablers such as data infrastructure and analytical capacity, and offers actionable strategies for improvement…(More)”.

An Open Source Python Library for Anonymizing Sensitive Data


Paper by Judith Sáinz-Pardo Díaz & Álvaro López García: “Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests…(More)”.

Garden city: A synthetic dataset and sandbox environment for analysis of pre-processing algorithms for GPS human mobility data



Paper by Thomas H. Li, and Francisco Barreras: “Human mobility datasets have seen increasing adoption in the past decade, enabling diverse applications that leverage the high precision of measured trajectories relative to other human mobility datasets. However, there are concerns about whether the high sparsity in some commercial datasets can introduce errors due to lack of robustness in processing algorithms, which could compromise the validity of downstream results. The scarcity of “ground-truth” data makes it particularly challenging to evaluate and calibrate these algorithms. To overcome these limitations and allow for an intermediate form of validation of common processing algorithms, we propose a synthetic trajectory simulator and sandbox environment meant to replicate the features of commercial datasets that could cause errors in such algorithms, and which can be used to compare algorithm outputs with “ground-truth” synthetic trajectories and mobility diaries. Our code is open-source and is publicly available alongside tutorial notebooks and sample datasets generated with it….(More)”

AI, huge hacks leave consumers facing a perfect storm of privacy perils


Article by Joseph Menn: “Hackers are using artificial intelligence to mine unprecedented troves of personal information dumped online in the past year, along with unregulated commercial databases, to trick American consumers and even sophisticated professionals into giving up control of bank and corporate accounts.

Armed with sensitive health informationcalling records and hundreds of millions of Social Security numbers, criminals and operatives of countries hostile to the United States are crafting emails, voice calls and texts that purport to come from government officials, co-workers or relatives needing help, or familiar financial organizations trying to protect accounts instead of draining them.

“There is so much data out there that can be used for phishing and password resets that it has reduced overall security for everyone, and artificial intelligence has made it much easier to weaponize,” said Ashkan Soltani, executive director of the California Privacy Protection Agency, the only such state-level agency.

The losses reported to the FBI’s Internet Crime Complaint Center nearly tripled from 2020 to 2023, to $12.5 billion, and a number of sensitive breaches this year have only increased internet insecurity. The recently discovered Chinese government hacks of U.S. telecommunications companies AT&T, Verizon and others, for instance, were deemed so serious that government officials are being told not to discuss sensitive matters on the phone, some of those officials said in interviews. A Russian ransomware gang’s breach of Change Healthcare in February captured data on millions of Americans’ medical conditions and treatments, and in August, a small data broker, National Public Data, acknowledged that it had lost control of hundreds of millions of Social Security numbers and addresses now being sold by hackers.

Meanwhile, the capabilities of artificial intelligence are expanding at breakneck speed. “The risks of a growing surveillance industry are only heightened by AI and other forms of predictive decision-making, which are fueled by the vast datasets that data brokers compile,” U.S. Consumer Financial Protection Bureau Director Rohit Chopra said in September…(More)”.

Why ‘open’ AI systems are actually closed, and why this matters


Paper by David Gray Widder, Meredith Whittaker & Sarah Myers West: “This paper examines ‘open’ artificial intelligence (AI). Claims about ‘open’ AI often lack precision, frequently eliding scrutiny of substantial industry concentration in large-scale AI development and deployment, and often incorrectly applying understandings of ‘open’ imported from free and open-source software to AI systems. At present, powerful actors are seeking to shape policy using claims that ‘open’ AI is either beneficial to innovation and democracy, on the one hand, or detrimental to safety, on the other. When policy is being shaped, definitions matter. To add clarity to this debate, we examine the basis for claims of openness in AI, and offer a material analysis of what AI is and what ‘openness’ in AI can and cannot provide: examining models, data, labour, frameworks, and computational power. We highlight three main affordances of ‘open’ AI, namely transparency, reusability, and extensibility, and we observe that maximally ‘open’ AI allows some forms of oversight and experimentation on top of existing models. However, we find that openness alone does not perturb the concentration of power in AI. Just as many traditional open-source software projects were co-opted in various ways by large technology companies, we show how rhetoric around ‘open’ AI is frequently wielded in ways that exacerbate rather than reduce concentration of power in the AI sector…(More)”.

Scientists Scramble to Save Climate Data from Trump—Again


Article by Chelsea Harvey: “Eight years ago, as the Trump administration was getting ready to take office for the first time, mathematician John Baez was making his own preparations.

Together with a small group of friends and colleagues, he was arranging to download large quantities of public climate data from federal websites in order to safely store them away. Then-President-elect Donald Trump had repeatedly denied the basic science of climate change and had begun nominating climate skeptics for cabinet posts. Baez, a professor at the University of California, Riverside, was worried the information — everything from satellite data on global temperatures to ocean measurements of sea-level rise — might soon be destroyed.

His effort, known as the Azimuth Climate Data Backup Project, archived at least 30 terabytes of federal climate data by the end of 2017.

In the end, it was an overprecaution.

The first Trump administration altered or deleted numerous federal web pages containing public-facing climate information, according to monitoring efforts by the nonprofit Environmental Data and Governance Initiative (EDGI), which tracks changes on federal websites. But federal databases, containing vast stores of globally valuable climate information, remained largely intact through the end of Trump’s first term.

Yet as Trump prepares to take office again, scientists are growing more worried.

Federal datasets may be in bigger trouble this time than they were under the first Trump administration, they say. And they’re preparing to begin their archiving efforts anew.

“This time around we expect them to be much more strategic,” said Gretchen Gehrke, EDGI’s website monitoring program lead. “My guess is that they’ve learned their lessons.”

The Trump transition team didn’t respond to a request for comment.

Like Baez’s Azimuth project, EDGI was born in 2016 in response to Trump’s first election. They weren’t the only ones…(More)”.

Can AI review the scientific literature — and figure out what it all means?


Article by Helen Pearson: “When Sam Rodriques was a neurobiology graduate student, he was struck by a fundamental limitation of science. Even if researchers had already produced all the information needed to understand a human cell or a brain, “I’m not sure we would know it”, he says, “because no human has the ability to understand or read all the literature and get a comprehensive view.”

Five years later, Rodriques says he is closer to solving that problem using artificial intelligence (AI). In September, he and his team at the US start-up FutureHouse announced that an AI-based system they had built could, within minutes, produce syntheses of scientific knowledge that were more accurate than Wikipedia pages1. The team promptly generated Wikipedia-style entries on around 17,000 human genes, most of which previously lacked a detailed page.How AI-powered science search engines can speed up your research

Rodriques is not the only one turning to AI to help synthesize science. For decades, scholars have been trying to accelerate the onerous task of compiling bodies of research into reviews. “They’re too long, they’re incredibly intensive and they’re often out of date by the time they’re written,” says Iain Marshall, who studies research synthesis at King’s College London. The explosion of interest in large language models (LLMs), the generative-AI programs that underlie tools such as ChatGPT, is prompting fresh excitement about automating the task…(More)”.