Beyond data egoism: let’s embrace data altruism


Blog by Frank Hamerlinck: “When it comes to data sharing, there’s often a gap between ambition and reality. Many organizations recognize the potential of data collaboration, yet when it comes down to sharing their own data, hesitation kicks in. The concern? Costs, risks, and unclear returns. At the same time, there’s strong enthusiasm for accessing data.

This is the paradox we need to break. Because if data egoism rules, real innovation is out of reach, making the need for data altruism more urgent than ever.

…More and more leaders recognize that unlocking data is essential to staying competitive on a global scale, and they understand that we must do so while upholding our European values. However, the real challenge lies in translating this growing willingness into concrete action. Many acknowledge its importance in principle, but few are ready to take the first step. And that’s a challenge we need to address – not just as organizations but as a society…

To break down barriers and accelerate data-driven innovation, we’re launching the FTI Data Catalog – a step toward making data sharing easier, more transparent, and more impactful.

The catalog provides a structured, accessible overview of available datasets, from location data and financial data to well-being data. It allows organizations to discover, understand, and responsibly leverage data with ease. Whether you’re looking for insights to fuel innovation, enhance decision-making, drive new partnerships or unlock new value from your own data, the catalog is built to support open and secure data exchange.

Feeling curious? Explore the catalog

By making data more accessible, we’re laying the foundation for a culture of collaboration. The road to data altruism is long, but it’s one worth walking. The future belongs to those who dare to share!..(More)”

Paths to Social Licence for Tracking-data Analytics


Paper by Joshua P. White: “While tracking-data analytics can be a goldmine for institutions and companies, the inherent privacy concerns also form a legal, ethical and social minefield. We present a study that seeks to understand the extent and circumstances under which tracking-data analytics is undertaken with social licence — that is, with broad community acceptance beyond formal compliance with legal requirements. Taking a University campus environment as a case, we enquire about the social licence for Wi-Fi-based tracking-data analytics. Staff and student participants answered a questionnaire presenting hypothetical scenarios involving Wi-Fi tracking for university research and services. Our results present a Bayesian logistic mixed-effects regression of acceptability judgements as a function of participant ratings on 11 privacy dimensions. Results show widespread acceptance of tracking-data analytics on campus and suggest that trust, individual benefit, data sensitivity, risk of harm and institutional respect for privacy are the most predictive factors determining this acceptance judgement…(More)”.

The Measure of Progress: Counting What Really Matters


Book by Diane Coyle: “The ways that statisticians and governments measure the economy were developed in the 1940s, when the urgent economic problems were entirely different from those of today. In The Measure of Progress, Diane Coyle argues that the framework underpinning today’s economic statistics is so outdated that it functions as a distorting lens, or even a set of blinkers. When policymakers rely on such an antiquated conceptual tool, how can they measure, understand, and respond with any precision to what is happening in today’s digital economy? Coyle makes the case for a new framework, one that takes into consideration current economic realities.

Coyle explains why economic statistics matter. They are essential for guiding better economic policies; they involve questions of freedom, justice, life, and death. Governments use statistics that affect people’s lives in ways large and small. The metrics for economic growth were developed when a lack of physical rather than natural capital was the binding constraint on growth, intangible value was less important, and the pressing economic policy challenge was managing demand rather than supply. Today’s challenges are different. Growth in living standards in rich economies has slowed, despite remarkable innovation, particularly in digital technologies. As a result, politics is contentious and democracy strained.

Coyle argues that to understand the current economy, we need different data collected in a different framework of categories and definitions, and she offers some suggestions about what this would entail. Only with a new approach to measurement will we be able to achieve the right kind of growth for the benefit of all…(More)”.

Data Collaborations How-to-Guide


Resource by The Data for Children Collaborative: “… excited to share a consolidation of 5 years of learning in one handy how-to-guide. Our ethos is to openly share tools, approaches and frameworks that may benefit others working in the Data for Good space. We have developed this guide specifically to support organisations working on complex challenges that may have data-driven solutions. The How-to-Guide provides advice and examples of how to plan and execute collaboration on a data project effectively…  

Data collaboration can provide organisations with high-quality, evidence-based insights that drive policy and practice while bringing together diverse perspectives to solve problems. It also fosters innovation, builds networks for future collaboration, and ensures effective implementation of solutions on the ground…(More)”.

How crawlers impact the operations of the Wikimedia projects


Article by the Wikimedia Foundation: “Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.

The Wikimedia projects are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone. 

When Jimmy Carter died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a 1.5 hour long video of Carter’s 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia’s connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our Site Reliability team, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?…

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs…(More)”.

AI, Innovation and the Public Good: A New Policy Playbook


Paper by Burcu Kilic: “When Chinese start-up DeepSeek released R1 in January 2025, the groundbreaking open-source artificial intelligence (AI) model rocked the tech industry as a more cost-effective alternative to models running on more advanced chips. The launch coincided with industrial policy gaining popularity as a strategic tool for governments aiming to build AI capacity and competitiveness. Once dismissed under neoliberal economic frameworks, industrial policy is making a strong comeback with more governments worldwide embracing it to build digital public infrastructure and foster local AI ecosystems. This paper examines how the national innovation system framework can guide AI industrial policy to foster innovation and reduce reliance on dominant tech companies…(More)”.

DOGE comes for the data wonks


The Economist: “For nearly three decades the federal government has painstakingly surveyed tens of thousands of Americans each year about their health. Door-knockers collect data on the financial toll of chronic conditions like obesity and asthma, and probe the exact doses of medications sufferers take. The result, known as the Medical Expenditure Panel Survey (MEPS), is the single most comprehensive, nationally representative portrait of American health care, a balkanised and unwieldy $5trn industry that accounts for some 17% of GDP.

MEPS is part of a largely hidden infrastructure of government statistics collection now in the crosshairs of the Department of Government Efficiency (DOGE). In mid-March officials at a unit of the Department of Health and Human Services (HHS) that runs the survey told employees that DOGE had slated them for an 80-90% reduction in staff and that this would “not be a negotiation”. Since then scores of researchers have taken voluntary buyouts. Those left behind worry about the integrity of MEPS. “Very unclear whether or how we can put on MEPS” with roughly half of the staff leaving, one said. On March 27th, the health secretary, Robert F. Kennedy junior, announced an overall reduction of 10,000 personnel at the department, in addition to those who took buyouts.

There are scores of underpublicised government surveys like MEPS that document trends in everything from house prices to the amount of lead in people’s blood. Many provide standard-setting datasets and insights into the world’s largest economy that the private sector has no incentive to replicate.

Even so, America’s system of statistics research is overly analogue and needs modernising. “Using surveys as the main source of information is just not working” because it is too slow and suffers from declining rates of participation, says Julia Lane, an economist at New York University. In a world where the economy shifts by the day, the lags in traditional surveys—whose results can take weeks or even years to refine and publish—are unsatisfactory. One practical reform DOGE might encourage is better integration of administrative data such as tax records and social-security filings which often capture the entire population and are collected as a matter of course.

As in so many other areas, however, DOGE’s sledgehammer is more likely to cause harm than to achieve improvements. And for all its clunkiness, America’s current system manages a spectacular feat. From Inuits in remote corners of Alaska to Spanish-speakers in the Bronx, it measures the country and its inhabitants remarkably well, given that the population is highly diverse and spread out over 4m square miles. Each month surveys from the federal government reach about 1.5m people, a number roughly equivalent to the population of Hawaii or West Virginia…(More)”.

Researching data discomfort: The case of Statistics Norway’s quest for billing data


Paper by Lisa Reutter: “National statistics offices are increasingly exploring the possibilities of utilizing new data sources to position themselves in emerging data markets. In 2022, Statistics Norway announced that the national agency will require the biggest grocers in Norway to hand over all collected billing data to produce consumer behavior statistics which had previously been produced by other sampling methods. An online article discussing this proposal sparked a surprisingly (at least to Statistics Norway) high level of interest among readers, many of whom expressed concerns about this intended change in data practice. This paper focuses on the multifaceted online discussions of the proposal, as these enable us to study citizens’ reactions and feelings towards increased data collection and emerging public-private data flows in a Nordic context. Through an explorative empirical analysis of comment sections, this paper investigates what is discussed by commenters and reflects upon why this case sparked so much interest among citizens in the first place. It therefore contributes to the growing literature of citizens’ voices in data-driven administration and to a wider discussion on how to research public feeling towards datafication. I argue that this presents an interesting case of discomfort voiced by citizens, which demonstrates the contested nature of data practices among citizens–and their ability to regard data as deeply intertwined with power and politics. This case also reminds researchers to pay attention to seemingly benign and small changes in administration beyond artificial intelligence…(More)”

Oxford Intersections: AI in Society


Series edited by Philipp Hacker: “…provides an interdisciplinary corpus for understanding artificial intelligence (AI) as a global phenomenon that transcends geographical and disciplinary boundaries. Edited by a consortium of experts hailing from diverse academic traditions and regions, the 11 edited and curated sections provide a holistic view of AI’s societal impact. Critically, the work goes beyond the often Eurocentric or U.S.-centric perspectives that dominate the discourse, offering nuanced analyses that encompass the implications of AI for a range of regions of the world. Taken together, the sections of this work seek to move beyond the state of the art in three specific respects. First, they venture decisively beyond existing research efforts to develop a comprehensive account and framework for the rapidly growing importance of AI in virtually all sectors of society. Going beyond a mere mapping exercise, the curated sections assess opportunities, critically discuss risks, and offer solutions to the manifold challenges AI harbors in various societal contexts, from individual labor to global business, law and governance, and interpersonal relationships. Second, the work tackles specific societal and regulatory challenges triggered by the advent of AI and, more specifically, large generative AI models and foundation models, such as ChatGPT or GPT-4, which have so far received limited attention in the literature, particularly in monographs or edited volumes. Third, the novelty of the project is underscored by its decidedly interdisciplinary perspective: each section, whether covering Conflict; Culture, Art, and Knowledge Work; Relationships; or Personhood—among others—will draw on various strands of knowledge and research, crossing disciplinary boundaries and uniting perspectives most appropriate for the context at hand…(More)”.

Legal frictions for data openness


Paper by Ramya Chandrasekhar: “investigates legal entanglements of re-use, when data and content from the open web is used to train foundation AI models. Based on conversations with AI researchers and practitioners, an online workshop, and legal analysis of a repository of 41 legal disputes relating to copyright and data protection, this report highlights tensions between legal imaginations of data flows and computational processes involved in training foundation models.

To realise the promise of the open web as open for all, this report argues that efforts oriented solely towards techno-legal openness of training datasets are not enough. Techno-legal openness of datasets facilitates easy re-use of data. But, certain well-resourced actors like Big Tech are able to take advantage of data flows on the open web to internet to train proprietary foundation models, while giving little to no value back to either the maintenance of shared informational resources or communities of commoners. At the same time, open licenses no longer accommodate changing community preferences of sharing and re-use of data and content.
In addition to techno-legal openness of training datasets, there is a need for certain limits on the extractive power of well-resourced actors like BigTech combined with increased recognition of community data sovereignty. Alternative licensing frameworks, such as the Nwulite Obodo License, Kaitiakitanga Licenses, the Montreal License, the OpenRAIL Licenses, the Open Data Commons License, and the AI2Impact Licenses hold valuable insights in this regard. While these licensing frameworks impose more obligations on re-users and necessitate more collective thinking on interoperability,they are nonetheless necessary for the creation of healthy digital and data commons, to realise the original promise of the open web as open for all…(More)”.