Bridging the Data Provenance Gap Across Text, Speech and Video


Paper by Shayne Longpre et al: “Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities–popular text, speech, and video datasets–from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video…(More)”.

Reconciling open science with technological sovereignty


Paper by C. Huang & L. Soete: “In history, open science has been effective in facilitating knowledge sharing and promoting and diffusing innovations. However, as a result of geopolitical tensions, technological sovereignty has recently been increasingly emphasized in various countries’ science and technology policy making, posing a challenge to open science policy. In this paper, we argue that the European Union significantly benefits from and contributes to open science and should continue to support it. Similarly, China embraced foreign technologies and engaged in open science as its economy developed rapidly in the last 40 years. Today both economies could learn from each other in finding the right balance between open science and technological sovereignty particularly given the very different policy experience and the urgency of implementing new technologies addressing the grand challenges such as climate change faced by mankind…(More)”.

Nurturing innovation through intelligent failure: The art of failing on purpose


Paper by Alessandro Narduzzo and Valentina Forrer: “Failure, even in the context of innovation, is primarily conceived and experienced as an inevitable (e.g., innovation funnel) or unintended (e.g., unexpected drawbacks) outcome. This paper aims to provide a more systematic understanding of innovation failure by considering and problematizing the case of “intelligent failures”, namely experiments that are intentionally designed and implemented to explore technological and market uncertainty. We conceptualize intelligent failure through an epistemic perspective that recognizes its contribution to challenging and revising the organizational knowledge system. We also outline an original process model of intelligent failure that fully reveals its potential and distinctiveness in the context of learning from failure (i.e., failure as an outcome vs failure of expectations and initial beliefs), analyzing and comparing intended and unintended innovation failures. By positioning intelligent failure in the context of innovation and explaining its critical role in enhancing the ability of innovative firms to achieve breakthroughs, we identify important landmarks for practitioners in designing an intelligent failure approach to innovation…(More)”.

A Roadmap to Accessing Mobile Network Data for Statistics


Guide by Global Partnership for Sustainable Development Data: “… introduces milestones on the path to mobile network data access. While it is aimed at stakeholders in national statistical systems and across national governments in general, the lessons should resonate with others seeking to take this route. The steps in this guide are written in the order in which they should be taken, and some readers who have already embarked on this journey may find they have completed some of these steps. 

This roadmap is meant to be followed in steps, and readers may start, stop, and return to points on the path at any point. 

The path to mobile network data access has three milestones:

  1. Evaluating the opportunity – setting clear goals for the desired impact of data innovation.
  2. Engaging with stakeholders – getting critical stakeholders to support your cause.
  3. Executing collaboration agreements – signing a written agreement among partners…(More)”

Announcing the Youth Engagement Toolkit for Responsible Data Reuse: An Innovative Methodology for the Future of Data-Driven Services


Blog by Elena Murray, Moiz Shaikh, and Stefaan G. Verhulst: “Young people seeking essential services — whether mental health support, education, or government benefits — often face a critical challenge: they are asked to share their data without having a say in how it is used or for what purpose. While the responsible use of data can help tailor services to better meet their needs and ensure that vulnerable populations are not overlooked, a lack of trust in data collection and usage can have the opposite effect.

When young people feel uncertain or uneasy about how their data is being handled, they may adopt privacy-protective behaviors — choosing not to seek services at all or withholding critical information out of fear of misuse. This risks deepening existing inequalities rather than addressing them.

To build trust, those designing and delivering services must engage young people meaningfully in shaping data practices. Understanding their concerns, expectations, and values is key to aligning data use with their preferences. But how can this be done effectively?

This question was at the heart of a year-long global collaboration through the NextGenData project, which brought together partners worldwide to explore solutions. Today, we are releasing a key deliverable of that project: The Youth Engagement Toolkit for Responsible Data Reuse:

Based on a methodology developed and piloted during the NextGenData project, the Toolkit describes an innovative methodology for engaging young people on responsible data reuse practices, to improve services that matter to them…(More)”.

Presenting the StanDat database on international standards: improving data accessibility on marginal topics


Article by Solveig Bjørkholt: “This article presents an original database on international standards, constructed using modern data gathering methods. StanDat facilitates studies into the role of standards in the global political economy by (1) being a source for descriptive statistics, (2) enabling researchers to assess scope conditions of previous findings, and (3) providing data for new analyses, for example the exploration of the relationship between standardization and trade, as demonstrated in this article. The creation of StanDat aims to stimulate further research into the domain of standards. Moreover, by exemplifying data collection and dissemination techniques applicable to investigating less-explored subjects in the social sciences, it serves as a model for gathering, systematizing, and sharing data in areas where information is plentiful yet not readily accessible for research…(More)”.

Diversifying Professional Roles in Data Science


Policy Briefing by Emma Karoune and Malvika Sharan: The interdisciplinary nature of the data science workforce extends beyond the traditional notion of a “data scientist.” A successful data science team requires a wide range of technical expertise, domain knowledge and leadership capabilities. To strengthen such a team-based approach, this note recommends that institutions, funders and policymakers invest in developing and professionalising diverse roles, fostering a resilient data science ecosystem for the future. 


By recognising the diverse specialist roles that collaborate within interdisciplinary teams, organisations can leverage deep expertise across multiple skill sets, enhancing responsible decision-making and fostering innovation at all levels. Ultimately, this note seeks to shift the perception of data science professionals from the conventional view of individual data scientists to a competency-based model of specialist roles within a team, each essential to the success of data science initiatives…(More)”.

Future of AI Research


Report by the Association for the Advancement of Artificial Intelligence:  “As AI capabilities evolve rapidly, AI research is also undergoing a fast and significant transformation along many dimensions, including its topics, its methods, the research community, and the working environment. Topics such as AI reasoning and agentic AI have been studied for decades but now have an expanded scope in light of current AI capabilities and limitations. AI ethics and safety, AI for social good, and sustainable AI have become central themes in all major AI conferences. Moreover, research on AI algorithms and software systems is becoming increasingly tied to substantial amounts of dedicated AI hardware, notably GPUs, which leads to AI architecture co-creation, in a way that is more prominent now than over the last 3 decades. Related to this shift, more and more AI researchers work in corporate environments, where the necessary hardware and other resources are more easily available, compared to academia, questioning the roles of academic AI research, student retention, and faculty recruiting. The pervasive use of AI in our daily lives and its impact on people, society, and the environment makes AI a socio-technical field of study, thus highlighting the need for AI researchers to work with experts from other disciplines, such as psychologists, sociologists, philosophers, and economists. The growing focus on emergent AI behaviors rather than on designed and validated properties of AI systems renders principled empirical evaluation more important than ever. Hence the need arises for well-designed benchmarks, test methodologies, and sound processes to infer conclusions from the results of computational experiments. The exponentially increasing quantity of AI research publications and the speed of AI innovation are testing the resilience of the peer-review system, with the immediate release of papers without peer-review evaluation having become widely accepted across many areas of AI research. Legacy and social media increasingly cover AI research advancements, often with contradictory statements that confuse the readers and blur the line between reality and perception of AI capabilities. All this is happening in a geo-political environment, in which companies and countries compete fiercely and globally to lead the AI race. This rivalry may impact access to research results and infrastructure as well as global governance efforts, underscoring the need for international cooperation in AI research and innovation.

In this overwhelming multi-dimensional and very dynamic scenario, it is important to be able to clearly identify the trajectory of AI research in a structured way. Such an effort can define the current trends and the research challenges still ahead of us to make AI more capable and reliable, so we can safely use it in mundane but also, most importantly, in high-stake scenarios.

This study aims to do this by including 17 topics related to AI research, covering most of the transformations mentioned above. Each chapter of the study is devoted to one of these topics, sketching its history, current trends and open challenges…(More)”.

AI could supercharge human collective intelligence in everything from disaster relief to medical research


Article by Hao Cui and Taha Yasseri: “Imagine a large city recovering from a devastating hurricane. Roads are flooded, the power is down, and local authorities are overwhelmed. Emergency responders are doing their best, but the chaos is massive.

AI-controlled drones survey the damage from above, while intelligent systems process satellite images and data from sensors on the ground and air to identify which neighbourhoods are most vulnerable.

Meanwhile, AI-equipped robots are deployed to deliver food, water and medical supplies into areas that human responders can’t reach. Emergency teams, guided and coordinated by AI and the insights it produces, are able to prioritise their efforts, sending rescue squads where they’re needed most.

This is no longer the realm of science fiction. In a recent paper published in the journal Patterns, we argue that it’s an emerging and inevitable reality.

Collective intelligence is the shared intelligence of a group or groups of people working together. Different groups of people with diverse skills, such as firefighters and drone operators, for instance, work together to generate better ideas and solutions. AI can enhance this human collective intelligence, and transform how we approach large-scale crises. It’s a form of what’s called hybrid collective intelligence.

Instead of simply relying on human intuition or traditional tools, experts can use AI to process vast amounts of data, identify patterns and make predictions. By enhancing human decision-making, AI systems offer faster and more accurate insights – whether in medical research, disaster response, or environmental protection.

AI can do this, by for example, processing large datasets and uncovering insights that would take much longer for humans to identify. AI can also get involved in physical tasks. In manufacturing, AI-powered robots can automate assembly lines, helping improve efficiency and reduce downtime.

Equally crucial is information exchange, where AI enhances the flow of information, helping human teams coordinate more effectively and make data-driven decisions faster. Finally, AI can act as social catalysts to facilitate more effective collaboration within human teams or even help build hybrid teams of humans and machines working alongside one another…(More)”.

China wants tech companies to monetize data, but few are buying in


Article by Lizzi C. Lee: “Chinese firms generate staggering amounts of data daily, from ride-hailing trips to online shopping transactions. A recent policy allowed Chinese companies to record data as assets on their balance sheets, the first such regulation in the world, paving the way for data to be traded in a marketplace and boost company valuations. 

But uptake has been slow. When China Unicom, one of the world’s largest mobile operators, reported its earnings recently, eagle-eyed accountants spotted that the company had listed 204 million yuan ($28 million) in data assets on its balance sheet. The state-owned operator was the first Chinese tech giant to take advantage of the Ministry of Finance’s new corporate data policy, which permits companies to classify data as inventory or intangible assets. 

“No other country is trying to do this on a national level. It could drive global standards of data management and accounting,” Ran Guo, an affiliated researcher at the Asia Society Policy Institute specializing in data governance in China, told Rest of World. 

In 2023 alone, China generated 32.85 zettabytes — more than 27% of the global total, according to a government survey. To put that in perspective, storing this volume on standard 1-terabyte hard drives would require more than 32 billion units….Tech companies that are data-rich are well-positioned tobenefit from logging data as assets to turn the formalized assets into tradable commodities, said Guo. But companies must first invest in secure storage and show that the data is legally obtained in order to meet strict government rules on data security. 

“This can be costly and complex,” he said. “Not all data qualifies as an asset, and companies must meet stringent requirements.” 

Even China Unicom, a state-owned enterprise, is likely complying with the new policy due to political pressure rather than economic incentive, said Guo, who conducted field research in China last year on the government push for data resource development. The telecom operator did not respond to a request for comment. 

Private technology companies in China, meanwhile, tend to be protective of their data. A Chinese government statement in 2022 pushed private enterprises to “open up their data.” But smaller firms could lack the resources to meet the stringent data storage and consumer protection standards, experts and Chinese tech company employees told Rest of World...(More)”.