How crawlers impact the operations of the Wikimedia projects


Article by the Wikimedia Foundation: “Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.

The Wikimedia projects are the largest collection of open knowledge in the world. Our sites are an invaluable destination for humans searching for information, and for all kinds of businesses that access our content automatically as a core input to their products. Most notably, the content has been a critical component of search engine results, which in turn has brought users back to our sites. But with the rise of AI, the dynamic is changing: We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases. Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads. This expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement, and is causing a significant load on the underlying infrastructure that keeps our sites available for everyone. 

When Jimmy Carter died in December 2024, his page on English Wikipedia saw more than 2.8 million views over the course of a day. This was relatively high, but manageable. At the same time, quite a few users played a 1.5 hour long video of Carter’s 1980 presidential debate with Ronald Reagan. This caused a surge in the network traffic, doubling its normal rate. As a consequence, for about one hour a small number of Wikimedia’s connections to the Internet filled up entirely, causing slow page load times for some users. The sudden traffic surge alerted our Site Reliability team, who were swiftly able to address this by changing the paths our internet connections go through to reduce the congestion. But still, this should not have caused any issues, as the Foundation is well equipped to handle high traffic spikes during exceptional events. So what happened?…

Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models. Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs…(More)”.

Web 3.0 Requires Data Integrity


Article by Bruce Schneier and Davi Ottenheimer: “If you’ve ever taken a computer security class, you’ve probably learned about the three legs of computer security—confidentiality, integrity, and availability—known as the CIA triad.a When we talk about a system being secure, that’s what we’re referring to. All are important, but to different degrees in different contexts. In a world populated by artificial intelligence (AI) systems and artificial intelligent agents, integrity will be paramount.

What is data integrity? It’s ensuring that no one can modify data—that’s the security angle—but it’s much more than that. It encompasses accuracy, completeness, and quality of data—all over both time and space. It’s preventing accidental data loss; the “undo” button is a primitive integrity measure. It’s also making sure that data is accurate when it’s collected—that it comes from a trustworthy source, that nothing important is missing, and that it doesn’t change as it moves from format to format. The ability to restart your computer is another integrity measure.

The CIA triad has evolved with the Internet. The first iteration of the Web—Web 1.0 of the 1990s and early 2000s—prioritized availability. This era saw organizations and individuals rush to digitize their content, creating what has become an unprecedented repository of human knowledge. Organizations worldwide established their digital presence, leading to massive digitization projects where quantity took precedence over quality. The emphasis on making information available overshadowed other concerns.

As Web technologies matured, the focus shifted to protecting the vast amounts of data flowing through online systems. This is Web 2.0: the Internet of today. Interactive features and user-generated content transformed the Web from a read-only medium to a participatory platform. The increase in personal data, and the emergence of interactive platforms for e-commerce, social media, and online everything demanded both data protection and user privacy. Confidentiality became paramount.

We stand at the threshold of a new Web paradigm: Web 3.0. This is a distributed, decentralized, intelligent Web. Peer-to-peer social-networking systems promise to break the tech monopolies’ control on how we interact with each other. Tim Berners-Lee’s open W3C protocol, Solid, represents a fundamental shift in how we think about data ownership and control. A future filled with AI agents requires verifiable, trustworthy personal data and computation. In this world, data integrity takes center stage…(More)”.

Should AGI-preppers embrace DOGE?


Blog by Henry Farrell: “…AGI-prepping is reshaping our politics. Wildly ambitious claims for AGI have not only shaped America’s grand strategy, but are plausibly among the justifying reasons for DOGE.

After the announcement of DOGE, but before it properly got going, I talked to someone who was not formally affiliated, but was very definitely DOGE adjacent. I put it to this individual that tearing out the decision making capacities of government would not be good for America’s ability to do things in the world. Their response (paraphrased slightly) was: so what? We’ll have AGI by late 2026. And indeed, one of DOGE’s major ambitions, as described in a new article in WIRED, appears to have been to pull as much government information as possible into a large model that could then provide useful information across the totality of government.

The point – which I don’t think is understood nearly widely enough – is that radical institutional revolutions such as DOGE follow naturally from the AGI-prepper framework. If AGI is right around the corner, we don’t need to have a massive federal government apparatus, organizing funding for science via the National Science Foundation and the National Institute for Health. After all, in Amodei and Pottinger’s prediction:

By 2027, AI developed by frontier labs will likely be smarter than Nobel Prize winners across most fields of science and engineering. … It will be able to … complete complex tasks that would take people months or years, such as designing new weapons or curing diseases.

Who needs expensive and cumbersome bureaucratic institutions for organizing funding scientists in a near future where a “country of geniuses [will be] contained in a data center,” ready to solve whatever problems we ask them to? Indeed, if these bottled geniuses are cognitively superior to humans across most or all tasks, why do we need human expertise at all, beyond describing and explaining human wants? From this perspective, most human based institutions are obsolescing assets that need to be ripped out, and DOGE is only the barest of beginnings…(More)”.

Expanding the Horizons of Collective Artificial Intelligence (CAI): From Individual Nudges to Relational Cognition


Blog by Evelien Verschroeven: “As AI continues to evolve, it is essential to move beyond focusing solely on individual behavior changes. The individual input — whether through behavior, data, or critical content — remains important. New data and fresh perspectives are necessary for AI to continue learning, growing, and improving its relevance. However, as we head into what some are calling the golden years of AI, it’s critical to acknowledge a potential challenge: within five years, it is predicted that 50% of AI-generated content will be based on AI-created material, creating a risk of inbreeding where AI learns from itself, rather than from the diversity of human experience and knowledge.

Platforms like Google’s AI for Social Good and Unanimous AI’s Swarm play pivotal roles in breaking this cycle. By encouraging the aggregation of real-world data, they add new content that can influence and shape AI’s evolution. While they focus on individual data contributions, they also help keep AI systems grounded in real-world scenarios, ensuring that the content remains critical and diverse.

However, human oversight is key. AI systems, even with the best intentions, are still learning from patterns that humans provide. It’s essential that AI continues to receive diverse human input, so that its understanding remains grounded in real-world perspectives. AI should be continuously checked and guided by human creativity, critical thinking, and social contexts, to ensure that it doesn’t become isolated or too self-referential.

As we continue advancing AI, it is crucial to embrace relational cognition and collective intelligence. This approach will allow AI to address both individual and collective needs, enhancing not only personal development but also strengthening social bonds and fostering more resilient, adaptive communities…(More)”.

Bridging Digital Divides: How PescaData is Connecting Small-Scale Fishing Cooperatives to the Blue Economy


Article by Stuart Fulton: “In this research project, we examine how digital platforms – specifically PescaData – can be leveraged to connect small-scale fishing cooperatives with impact investors and donors, creating new pathways for sustainable blue economy financing, while simultaneously ensuring fair data practices that respect data sovereignty and traditional ecological knowledge.

PescaData emerged as a pioneering digital platform that enables fishing communities to collect more accurate data to ensure sustainable fisheries. Since then, PescaData has evolved to provide software as a service to fishing cooperatives and to allow fishers to document their solutions to environmental and economic challenges. Since 2022, small-scale fishers have used it to document nearly 300 initiatives that contribute to multiple Sustainable Development Goals. 

Respecting Data Sovereignty in the Digital Age

One critical aspect of our research acknowledges the unique challenges of implementing digital tools in traditional cooperative settings. Unlike conventional tech implementations that often extract value from communities, PescaData´s approach centers on data sovereignty – the principle that fishing communities should maintain ownership and control over their data. As the PescaData case study demonstrates, a humanity-centric rather than merely user-centric approach is essential. This means designing with compassion and establishing clear governance around data from the very beginning. The data generated by fishing cooperatives represents not just information, but traditional knowledge accumulated over generations of resource management.

The fishers themselves have articulated clear principles for data governance in a cooperative model:

  • Ownership: Fishers, as data producers, decide who has access and under what conditions.
  • Transparency: Clear agreements on data use.
  • Knowledge assessment: Highlighting fishers’ contributions and placing them in decision-making positions.
  • Co-design: Ensuring the platform meets their specific needs.
  • Security: Protecting collected data…(More)”.

What is a fair exchange for access to public data?


Blog and policy brief by Jeni Tennison: “The most obvious approach to get companies to share value back to the public sector in return for access to data is to charge them. However, there are a number of challenges with a “pay to access” approach: it’s hard to set the right price; it creates access barriers, particularly for cash-poor start-ups; and it creates a public perception that the government is willing to sell their data, and might be tempted to loosen privacy-protecting governance controls in exchange for cash.

Are there other options? The policy brief explores a range of other approaches and assesses these against five goals that a value-sharing framework should ideally meet, to:

  • Encourage use of public data, including by being easy for organisations to understand and administer.
  • Provide a return on investment for the public sector, offsetting at least some of the costs of supporting the NDL infrastructure and minimising administrative costs.
  • Promote equitable innovation and economic growth in the UK, which might mean particularly encouraging smaller, home-grown businesses.
  • Create social value, particularly towards this Government’s other missions, such as achieving Net Zero or unlocking opportunity for all.
  • Build public trust by being easily explainable, avoiding misaligned incentives that encourage the breaking of governance guardrails, and feeling like a fair exchange.

In brief, alternatives to a pay-to-access model that still provide direct financial returns include:

  • Discounts: the public sector could secure discounts on products and services created using public data. However, this could be difficult to administer and enforce.
  • Royalties: taking a percentage of charges for products and services created using public data might be similarly hard to administer and enforce, but applies to more companies.
  • Equity: taking equity in startups can provide long-term returns and align with public investment goals.
  • Levies: targeted taxes on businesses that use public data can provide predictable revenue and encourage data use.
  • General taxation: general taxation can fund data infrastructure, but it may lack the targeted approach and public visibility of other methods.

It’s also useful to consider non-financial conditions that could be put on organisations accessing public data..(More)”.

Being heard: Shaping digital futures for and with children


Blog by Laura Betancourt Basallo, Kim R. Sylwander and Sonia Livingstone: “One in three internet users is a child. Digital technologies are shaping children’s present and future, yet most digital spaces are designed by adults, for adults. Despite this disconnect, digital platforms have emerged as important spaces for children’s participation in political and cultural life, partly because this is often limited in traditional spaces.

Children’s access to and participation in the digital environment is not just desirable: the UN Convention on the Rights of the Child applies equally online and offline. Article 12 outlines children’s right to be heard in ways that genuinely influence the decisions affecting their lives. In 2021, the Committee on the Rights of the Child published its General comment No. 25, the authoritative framework on how children’s rights should be applied in relation to the digital environment—this emphasises the importance of children’s right to be heard, and to participation in the digital sphere.

Core elements for meaningful participation

Creating meaningful and rights-respecting opportunities for child and youth participation in research, policymaking, and product design demands strategic planning and practical actions. As scholar Laura Lundy explains, these opportunities should guarantee to children:

  • SPACE: Children must be allowed to express their views.
  • VOICE: Children must be facilitated to express their views.
  • AUDIENCE: Their views must be listened to.
  • INFLUENCE: Their views must be acted upon as appropriate.

This rights-based approach emphasises the importance of not just collecting children’s views but actively listening to them and ensuring that their input is meaningfully acted upon, while avoiding the pitfalls of tokenism, manipulation or unsafe practices. Implementing such engagement requires careful consideration of safeguards regarding privacy, freedom of thought, and inclusive access for children with limited digital skills or access.

Here we provide a curated list of resources to conduct consultations with children, using digital technologies and then about the digital environment. ..(More)”.

How data can transform government in Latin America and the Caribbean


Article by William Maloney, Daniel Rogger, and Christian Schuster: ” Governments across Latin America and the Caribbean are grappling with deep governance challenges that threaten progress and stability, including the need to improve efficiency, accountability and transparency.

Amid these obstacles, however, the region possesses a powerful, often underutilized asset: the administrative data it collects as a part of its everyday operations.

When harnessed effectively using data analytics, this data has the potential to drive transformative change, unlock new opportunities for growth and help address some of the most pressing issues facing the region. It’s time to tap into this potential and use data to chart a path forward. To help governments make the most of the opportunities that this data presents, the World Bank has embarked on a decade-long project to synthesize the latest knowledge on how to measure and improve government performance. We have found that governments already have a lot of the data they need to dramatically improve public services while conserving scarce resources.

But it’s not enough to collect data. It must also be put to good use to improve decision making, design better public policy and strengthen public sector functioning. We call these tools and practices for repurposing government data government analytics…(More)”.

Announcing the Youth Engagement Toolkit for Responsible Data Reuse: An Innovative Methodology for the Future of Data-Driven Services


Blog by Elena Murray, Moiz Shaikh, and Stefaan G. Verhulst: “Young people seeking essential services — whether mental health support, education, or government benefits — often face a critical challenge: they are asked to share their data without having a say in how it is used or for what purpose. While the responsible use of data can help tailor services to better meet their needs and ensure that vulnerable populations are not overlooked, a lack of trust in data collection and usage can have the opposite effect.

When young people feel uncertain or uneasy about how their data is being handled, they may adopt privacy-protective behaviors — choosing not to seek services at all or withholding critical information out of fear of misuse. This risks deepening existing inequalities rather than addressing them.

To build trust, those designing and delivering services must engage young people meaningfully in shaping data practices. Understanding their concerns, expectations, and values is key to aligning data use with their preferences. But how can this be done effectively?

This question was at the heart of a year-long global collaboration through the NextGenData project, which brought together partners worldwide to explore solutions. Today, we are releasing a key deliverable of that project: The Youth Engagement Toolkit for Responsible Data Reuse:

Based on a methodology developed and piloted during the NextGenData project, the Toolkit describes an innovative methodology for engaging young people on responsible data reuse practices, to improve services that matter to them…(More)”.

Redesigning Public Organizations: From “what” to “how


Essay by the Transition Collective: “Government organizations and their leaders are in a pinch. They are caught between pressures from politicians, citizens and increasingly complex external environments on the one hand — and from civil servants calling for new ways of working, thriving and belonging on the other hand. They have to enable meaningful, joined-up and efficient services for people, leveraging digital and physical resources, while building an attractive organizational culture. Indeed, the challenge is to build systems as human as the people they are intended to serve.

While this creates massive challenges for public sector organizations, this is also an opportunity to reimagine our institutions to meet the challenges of today and the future. To succeed, we must not only think about other models of organization — we also have to think of other ways of changing them.

Traditionally, we think of the organization as something static, a goal we arrive at or a fixed model we decide upon. If asked to describe their organization, most civil servants will point to an organigram — and more often than not it will consist of a number of boxes and lines, ordered in a hierarchy.

But in today’s world of complex challenges, accelerated frequency of change and dynamic interplay between the public sector and its surroundings, such a fixed model is less and less fit for the purposes it must fulfill. Not only does it not allow the collective intelligence and creativity of the organization’s members to be fully unleashed, it also does not allow for the speed and adaptability required by today’s turbulent environment. It does not allow for truly joined up, meaningful human services.

Unfreezing the organization

Rather than thinking mainly about models and forms, we should think of organizational design as an act or a series of actions. In other words, we should think about the organization not just as a what but also as a how: Less as a set of boxes describing a power hierarchy, and more as a set of living, organic roles and relationships. We need to thaw up our organizations from their frozen state — and keep them warmer and more fluid.

In this piece, we suggest that many efforts to reimagine public sector organizations have failed because the challenge of transforming an organization has been underestimated. We draw on concrete experiences from working with international and Danish public sector institutions, in particular in health and welfare services.

We propose a set of four approaches which, taken together, can support the work of redesigning organizations to be more ambitious, free, human, creative and self-managing — and thus better suited to meet the ever more complex challenges they are faced with…(More)”.