OECD Report: “Protecting consumers when they are most vulnerable has long been a core focus of consumer policy. This report first discusses the nature and scale of consumer vulnerability in the digital age, including its evolving conceptualisation, the role of emerging digital trends, and implications for consumer policy. It finds that in the digital age, vulnerability may be experienced not only by some consumers, but increasingly by most, if not all, consumers. Accordingly, it sets out several measures to address the vulnerability of specific consumer groups and all consumers, and concludes with avenues for more research on the topic…(More)”.
Training Data for the Price of a Sandwich
Article by Stefan Baack: “Common Crawl (henceforth also referred to as CC) is an organization that has been essential to the technological advancements of generative AI, but is largely unknown to the broader public. This California nonprofit with only a handful of employees has crawled billions of web pages since 2008 and it makes this data available without charge via Amazon Web Services (AWS). Because of the enormous size and diversity (in terms of sources and formats) of the data, it has been pivotal as a source for training data for many AI builders. Generative AI in its current form would probably not be possible without Common Crawl, given that the vast majority of data used to train the original model behind OpenAI’s ChatGPT, the generative AI product that set off the current hype, came from it (Brown et al. 2020). The same is true for many models published since then.
Although pivotal, Common Crawl has so far received relatively little attention for its contribution to generative AI…(More)”.
Outpacing Pandemics: Solving the First and Last Mile Challenges of Data-Driven Policy Making
Article by Stefaan Verhulst, Daniela Paolotti, Ciro Cattuto, and Alessandro Vespignani: “As society continues to emerge from the legacy of COVID-19, a dangerous complacency seems to be setting in. Amidst recurrent surges of cases, each serving as a reminder of the virus’s persistence, there is a noticeable decline in collective urgency to prepare for future pandemics. This situation represents not just a lapse in memory but a significant shortfall in our approach to pandemic preparedness. It dramatically underscores the urgent need to develop novel and sustainable approaches and responses and to reinvent how we approach public health emergencies.
Among the many lessons learned from previous infectious disease outbreaks, the potential and utility of data, and particularly non-traditional forms of data, are surely among the most important lessons. Among other benefits, data has proven useful in providing intelligence and situational awareness in early stages of outbreaks, empowering citizens to protect their health and the health of vulnerable community members, advancing compliance with non-pharmaceutical interventions to mitigate societal impacts, tracking vaccination rates and the availability of treatment, and more. A variety of research now highlights the particular role played by open source data (and other non-traditional forms of data) in these initiatives.
Although multiple data sources are useful at various stages of outbreaks, we focus on two critical stages proven to be especially challenging: what we call the first mile and the last mile.
We argue that focusing on these two stages (or chokepoints) can help pandemic responses and rationalize resources. In particular, we highlight the role of Data Stewards at both stages and in overall pandemic response effectiveness…(More)”.
Data4Philanthropy
New Resource and Peer-to-Peer Learning Network: “Today’s global challenges have become increasingly complex and interconnected–from a global pandemic to the climate crisis. Solving these complex problems not only require new solutions, they also demand new methods for developing solutions and making decisions. By responsibly analyzing and using data, we can transform our understanding and approach to addressing societal issues and drive impact through our work.
However, many of these data-driven methods have not yet been adopted by the social sector or integrated across the grant-making cycle.
So we asked, how can innovations in data-driven methods and tools from multiple sectors transform decision making within philanthropy & improve the act of grant giving?
DATA4Philanthropy is a peer-to-peer learning network that aims to identify and advance the responsible use and value of data innovations across philanthropic functions.
Philanthropies can learn more about the potential of data for their sector, who to connect with to learn more about data, and how innovations in data-driven methods and tools are increasingly relevant across the stages of strategy to grant making to impact cycles.
The rapid change in both data supply, and now methods can be integrated across the philanthropy, civil society and government decision-making cycles–from developing joint priorities to improving implementation efficacy to evaluating the impact of investments…(More)”

Nobody knows how to audit AI
Axios: “Some legislators and experts are pushing independent auditing of AI systems to minimize risks and build trust, Ryan reports.
Why it matters: Consumers don’t trust big tech to self-regulate and government standards may come slowly or never.
The big picture: Failure to manage risk and articulate values early in the development of an AI system can lead to problems ranging from biased outcomes from unrepresentative data to lawsuits alleging stolen intellectual property.
Driving the news: Sen. John Hickenlooper (D-Colo.) announced in a speech on Monday that he will push for the auditing of AI systems, because AI models are using our data “in ways we never imagined and certainly never consented to.”
- “We need qualified third parties to effectively audit generative AI systems,” Hickenlooper said, “We cannot rely on self-reporting alone. We should trust but verify” claims of compliance with federal laws and regulations, he said.
Catch up quick: The National Institute of Standards and Technology (NIST) developed an AI Risk Management Framework to help organizations think about and measure AI risks, but it does not certify or validate AI products.
- President Biden’s executive order on AI mandated that NIST expand its support for generative AI developers and “create guidance and benchmarks for evaluating and auditing AI capabilities,” especially in risky areas such as cybersecurity and bioweapons.
What’s happening: A growing range of companies provide services that evaluate whether AI models are complying with local regulations or promises made by their developers — but some AI companies remain committed to their own internal risk research and processes.
- NIST is only the “tip of the spear” in AI safety, Hickenlooper said. He now wants to establish criteria and a path to certification for third-party auditors.
The “Big Four” accounting firms — Deloitte, EY, KPMG and PwC — sense business opportunities in applying audit methodologies to AI systems, Nicola Morini Bianzino, EY’s global chief technology officer, tells Axios.
- Morini Bianzino cautions that AI audits might “look more like risk management for a financial institution, as opposed to audit as a certifying mark. Because, honestly, I don’t know technically how we would do that.”
- Laura Newinski, KPMG’s COO, tells Axios the firm is developing AI auditing services and “attestation about whether data sets are accurate and follow certain standards.”
Established players such as IBM and startups such as Credo provide AI governance dashboards that tell clients in real time where AI models could be causing problems — around data privacy, for example.
- Anthropic believes NIST should focus on “building a robust and standardized benchmark for generative AI systems” that all private AI companies can adhere to.
Market leader OpenAI announced in October that it’s creating a “risk-informed development policy” and has invited experts to apply to join its OpenAI Red Teaming Network.
- OpenAI also released a paper Jan. 31 purporting to examine whether its models increase the risk of bioweapons. The company’s answer: not really.
- NYU professor Gary Marcus argues the paper is misleading. “The more I look at the results, the more worried I become,” Marcus wrote in his blog. “Company white papers are not peer-reviewed articles,” he notes.
Yes, but: An AI audit industry without clear standards could be a recipe for confusion, both for corporate customers and consumers using AI…(More)”.
Data Is What Data Does: Regulating Based on Harm and Risk Instead of Sensitive Data
Paper by Daniel J. Solove: “Heightened protection for sensitive data is becoming quite trendy in privacy laws around the world. Originating in European Union (EU) data protection law and included in the EU’s General Data Protection Regulation, sensitive data singles out certain categories of personal data for extra protection. Commonly recognized special categories of sensitive data include racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, health, sexual orientation and sex life, and biometric and genetic data.
Although heightened protection for sensitive data appropriately recognizes that not all situations involving personal data should be protected uniformly, the sensitive data approach is a dead end. The sensitive data categories are arbitrary and lack any coherent theory for identifying them. The borderlines of many categories are so blurry that they are useless. Moreover, it is easy to use nonsensitive data as a proxy for certain types of sensitive data.
Personal data is akin to a grand tapestry, with different types of data interwoven to a degree that makes it impossible to separate out the strands. With Big Data and powerful machine learning algorithms, most nonsensitive data give rise to inferences about sensitive data. In many privacy laws, data giving rise to inferences about sensitive data is also protected as sensitive data. Arguably, then, nearly all personal data can be sensitive, and the sensitive data categories can swallow up everything. As a result, most organizations are currently processing a vast amount of data in violation of the laws.
This Article argues that the problems with the sensitive data approach make it unworkable and counterproductive as well as expose a deeper flaw at the root of many privacy laws. These laws make a fundamental conceptual mistake—they embrace the idea that the nature of personal data is a sufficiently useful focal point for the law. But nothing meaningful for regulation can be determined solely by looking at the data itself. Data is what data does.
To be effective, privacy law must focus on harm and risk rather than on the nature of personal data…(More)”.
The story of the R number: How an obscure epidemiological figure took over our lives
Article by Gavin Freeguard: “Covid-19 did not only dominate our lives in April 2020. It also dominated the list of new words entered into the Oxford English Dictionary.
Alongside Covid-19 itself (noun, “An acute respiratory illness in humans caused by a coronavirus”), the vocabulary of the virus included “self-quarantine”, “social distancing”, “infodemic”, “flatten the curve”, “personal protective equipment”, “elbow bump”, “WFH” and much else. But nestled among this pantheon of new pandemic words was a number, one that would shape our conversations, our politics, our lives for the next 18 months like no other: “Basic reproduction number (R0): The average number of cases of an infectious disease arising by transmission from a single infected individual, in a population that has not previously encountered the disease.”

“There have been many important figures in this pandemic,” wrote The Times in January 2021, “but one has come to tower over the rest: the reproduction rate. The R number, as everyone calls it, has been used by the government to justify imposing and lifting lockdowns. Indeed while there are many important numbers — gross domestic product, parliamentary majorities, interest rates — few can compete right now with R” (tinyurl.com/v7j6cth9).
Descriptions of it at the start of the pandemic made R the star of the disaster movie reality we lived through. And it wasn’t just a breakout star of the UK’s coronavirus press conferences; in Germany, (then) Chancellor Angela Merkel made the most of her scientific background to explain the meaning of R and its consequences to the public (tinyurl.com/mva7urw5).
But for others, the “obsession” (Professor Linda Bauld, University of Edinburgh) with “the pandemic’s misunderstood metric” (Nature: tinyurl.com/y3sr6n6m) has been “a distraction”, an “unhelpful focus”; as the University of Edinburgh’s Professor Mark Woolhouse told one parliamentary select committee, “we’ve created a monster”.
How did this epidemiological number come to dominate our discourse? How useful is it? And where does it come from?…(More)”.
Future-Proofing Transparency: Re-Thinking Public Record Governance For the Age of Big Data
Paper by Beatriz Botero Arcila: “Public records, public deeds, and even open data portals often include personal information that can now be easily accessed online. Yet, for all the recent attention given to informational privacy and data protection, scant literature exists on the governance of personal information that is available in public documents. This Article examines the critical issue of balancing privacy and transparency within public record governance in the age of Big Data.
With Big Data and powerful machine learning algorithms, personal information in public records can easily be used to infer sensitive data about people or aggregated to create a comprehensive personal profile of almost anyone. This information is public and open, however, for many good reasons: ensuring political accountability, facilitating democratic participation, enabling economic transactions, combating illegal activities such as money laundering and terrorism financing, and facilitating. Can the interest in record publicity coexist with the growing ease of deanonymizing and revealing sensitive information about individuals?
This Article addresses this question from a comparative perspective, focusing on US and EU access to information law. The Article shows that the publicity of records was, in the past and not withstanding its presumptive public nature, protected because most people would not trouble themselves to go to public offices to review them, and it was practical impossible to aggregate them to draw extensive profiles about people. Drawing from this insight and contemporary debates on data governance, this Article challenges the binary classification of data as either published or not and proposes a risk-based framework that re-insert that natural friction to public record governance by leveraging techno-legal methods in how information is published and accessed…(More)”.
Creating Real Value: Skills Data in Learning and Employment Records
Article by Nora Heffernan: “Over the last few months, I’ve asked the same question to corporate leaders from human resources, talent acquisition, learning and development, and management backgrounds. The question is this:
What kind of data needs to be included in learning and employment records to be of greatest value to you in your role and to your organization?
By data, I’m talking about credential attainment, employment history, and, emphatically, verified skills data: showing at an individual level what a candidate or employee knows and is able to do.
The answer varies slightly by industry and position, but unanimously, the employers I’ve talked to would find the greatest value in utilizing learning and employment records that include verified skills data. There is no equivocation.
And as the national conversation about skills-first talent management continues to ramp up, with half of companies indicating they plan to eliminate degree requirements for some jobs in the next year, the call for verified skill data will only get louder. Employers value skills data for multiple reasons…(More)”.
Defending the rights of refugees and migrants in the digital age
Primer by Amnesty International: “This is an introduction to the pervasive and rapid deployment of digital technologies in asylum and migration management systems across the globe including the United States, United Kingdom and the European Union. Defending the rights of refugees and migrants in the digital age, highlights some of the key digital technology developments in asylum and migration management systems, in particular systems that process large quantities of data, and the human rights issues arising from their use. This introductory briefing aims to build our collective understanding of these emerging technologies and hopes to add to wider advocacy efforts to stem their harmful effects…(More)”.