Data Collaboratives as an enabling infrastructure for AI for Good


Blog Post by Stefaan G. Verhulst: “…The value of data collaboratives stems from the fact that the supply of and demand for data are generally widely dispersed — spread across government, the private sector, and civil society — and often poorly matched. This failure (a form of “market failure”) results in tremendous inefficiencies and lost potential. Much data that is released is never used. And much data that is actually needed is never made accessible to those who could productively put it to use.

Data collaboratives, when designed responsibly, are the key to addressing this shortcoming. They draw together otherwise siloed data and a dispersed range of expertise, helping match supply and demand, and ensuring that the correct institutions and individuals are using and analyzing data in ways that maximize the possibility of new, innovative social solutions.

Roadmap for Data Collaboratives

Despite their clear potential, the evidence base for data collaboratives is thin. There’s an absence of a systemic, structured framework that can be replicated across projects and geographies, and there’s a lack of clear understanding about what works, what doesn’t, and how best to maximize the potential of data collaboratives.

At the GovLab, we’ve been working to address these shortcomings. For emerging economies considering the use of data collaboratives, whether in pursuit of Artificial Intelligence or other solutions, we present six steps that can be considered in order to create data collaborative that are more systematic, sustainable, and responsible.

The need for making Data Collaboratives Systematic, Sustainable and Responsible
  • Increase Evidence and Awareness
  • Increase Readiness and Capacity
  • Address Data Supply and Demand Inefficiencies and Uncertainties
  • Establish a New “Data Stewards” Function
  • Develop and strengthen policies and governance practices for data collaboration

Safeguards for human studies can’t cope with big data


Nathaniel Raymond at Nature: “One of the primary documents aiming to protect human research participants was published in the US Federal Register 40 years ago this week. The Belmont Report was commissioned by Congress in the wake of the notorious Tuskegee syphilis study, in which researchers withheld treatment from African American men for years and observed how the disease caused blindness, heart disease, dementia and, in some cases, death.

The Belmont Report lays out core principles now generally required for human research to be considered ethical. Although technically governing only US federally supported research, its influence reverberates across academia and industry globally. Before academics with US government funding can begin research involving humans, their institutional review boards (IRBs) must determine that the studies comply with regulation largely derived from a document that was written more than a decade before the World Wide Web and nearly a quarter of a century before Facebook.

It is past time for a Belmont 2.0. We should not be asking those tasked with protecting human participants to single-handedly identify and contend with the implications of the digital revolution. Technological progress, including machine learning, data analytics and artificial intelligence, has altered the potential risks of research in ways that the authors of the first Belmont report could not have predicted. For example, Muslim cab drivers can be identified from patterns indicating that they stop to pray; the Ugandan government can try to identify gay men from their social-media habits; and researchers can monitor and influence individuals’ behaviour online without enrolling them in a study.

Consider the 2014 Facebook ‘emotional contagion study’, which manipulated users’ exposure to emotional content to evaluate effects on mood. That project, a collaboration with academic researchers, led the US Department of Health and Human Services to launch a long rule-making process that tweaked some regulations governing IRBs.

A broader fix is needed. Right now, data science overlooks risks to human participants by default….(More)”.

Synthetic data: innovation for public good


Blog Post by Catrin Cheung: “What is synthetic data, and how can it be used for public good? ….Synthetic data are artificially generated data that have the look and structure of real data, but do not contain any information on individuals. They also contain more general characteristics that are used to find patterns in the data.

They are modelled on real data, but designed in a way which safeguards the legal, ethical and confidentiality requirements of the original data. Given their resemblance to the original data, synthetic data are useful in a range of situations, for example when data is sensitive or missing. They are used widely as teaching materials, to test code or mathematical models, or as training data for machine learning models….

There’s currently a wealth of research emerging from the health sector, as the nature of data published is often sensitive. Public Health England have synthesised cancer data which can be freely accessed online. NHS Scotland are making advances in cutting-edge machine learning methods such as Variational Auto Encoders and Generative Adversarial Networks (GANs).

There is growing interest in this area of research, and its influence extends beyond the statistical community. While the Data Science Campus have also used GANs to generate synthetic data in their latest research, its power is not limited to data generation. It can be trained to construct features almost identical to our own across imagery, music, speech and text. In fact, GANs have been used to create a painting of Edmond de Belamy, which sold for $432,500 in 2018!

Within the ONS, a pilot to create synthetic versions of securely held Labour Force Survey data has been carried out using a package in R called “synthpop”. This synthetic dataset can be shared with approved researchers to de-bug codes, prior to analysis of data held in the Secure Research Service….

Although much progress is done in this field, one challenge that persists is guaranteeing the accuracy of synthetic data. We must ensure that the statistical properties of synthetic data match properties of the original data.

Additional features, such as the presence of non-numerical data, add to this difficult task. For example, if something is listed as “animal” and can take the possible values “dog”,”cat” or “elephant”, it is difficult to convert this information into a format suitable for precise calculations. Furthermore, given that datasets have different characteristics, there is no straightforward solution that can be applied to all types of data….particular focus was also placed on the use of synthetic data in the field of privacy, following from the challenges and opportunities identified by the National Statistician’s Quality Review of privacy and data confidentiality methods published in December 2018….(More)”.

Digital Data for Development


LinkedIn: “The World Bank Group and LinkedIn share a commitment to helping workers around the world access opportunities that make good use of their talents and skills. The two organizations have come together to identify new ways that data from LinkedIn can help inform policymakers who seek to boost employment and grow their economies.

This site offers data and automated visuals of industries where LinkedIn data is comprehensive enough to provide an emerging picture. The data complements a wealth of official sources and can offer a more real-time view in some areas particularly for new, rapidly changing digital and technology industries.

The data shared in the first phase of this collaboration focuses on 100+ countries with at least 100,000 LinkedIn members each, distributed across 148 industries and 50,000 skills categories. In the near term, it will help World Bank Group teams and government partners pinpoint ways that developing countries could stimulate growth and expand opportunity, especially as disruptive technologies reshape the economic landscape. As LinkedIn’s membership and digital platforms continue to grow in developing countries, this collaboration will assess the possibility to expand the sectors and countries covered in the next annual update.

This site offers downloadable data, visualizations, and an expanding body of insights and joint research from the World Bank Group and LinkedIn. The data is being made accessible as a public good, though it will be most useful for policy analysts, economists, and researchers….(More)”.

Statistics Estonia to coordinate data governance


Article by Miriam van der Sangen at CBS: “In 2018, Statistics Estonia launched a new strategy for the period 2018-2022. This strategy addresses the organisation’s aim to produce statistics more quickly while minimising the response burden on both businesses and citizens. Another element in the strategy is addressing the high expectations in Estonian society regarding the use of data. ‘We aim to transform Statistics Estonia into a national data agency,’ says Director General Mägi. ‘This means our role as a producer of official statistics will be enlarged by data governance responsibilities in the public sector. Taking on such responsibilities requires a clear vision of the whole public data ecosystem and also agreement to establish data stewards in most public sector institutions.’…

the Estonian Parliament passed new legislation that effectively expanded the number of official tasks for Statistics Estonia. Mägi elaborates: ‘Most importantly, we shall be responsible for coordinating data governance. The detailed requirements and conditions of data governance will be specified further in the coming period.’ Under the new Act, Statistics Estonia will also have more possibilities to share data with other parties….

Statistics Estonia is fully committed to producing statistics which are based on big data. Mägi explains: ‘At the moment, we are actively working on two big data projects. One project involves the use of smart electricity meters. In this project, we are looking into ways to visualise business and household electricity consumption information. The second project involves web scraping of prices and enterprise characteristics. This project is still in an initial phase, but we can already see that the use of web scraping can improve the efficiency of our production process.’ We are aiming to extend the web scraping project by also identifying e-commerce and innovation activities of enterprises.’

Yet another ambitious goal for Statistics Estonia lies in the field of data science. ‘Similarly to Statistics Netherlands, we established experimental statistics and data mining activities years ago. Last year, we developed a so-called think-tank service, providing insights from data into all aspects of our lives. Think of birth, education, employment, et cetera. Our key clients are the various ministries, municipalities and the private sector. The main aim in the coming years is to speed up service time thanks to visualisations and data lake solutions.’ …(More)”.

New Data-Driven Map Shows Spread of Participation in Democracy


Loren Peabody at the Participatory Budgeting Project: “As we celebrate the first 30 years of participatory budgeting (PB) in the world and the first 10 years of the Participatory Budgeting Project (PBP), we reflect on how far and wide PB has spread–and how it continues to grow! We’re thrilled to introduce a new tool to help us look back as we plan for the next 30+ years of PB. And so we’re introducing a map of PB across the U.S. and Canada. Each dot on the map represents a place where democracy has been deepened by bringing people together to decide together how to invest public resources in their community….

This data sheds light on larger questions, such as what is the relationship between the size of PB budgets and the number of people who participate? Looking at PBP data on processes in counties, cities, and urban districts, we find a positive correlation between the size of the PB budget per person and the number of people who take part in a PB vote (r=.22, n=245). In other words, where officials make a stronger commitment to funding PB, more people take part in the process–all the more reason to continue growing PB!….(More)”.

Open Justice: Public Entrepreneurs Learn to Use New Technology to Increase the Efficiency, Legitimacy, and Effectiveness of the Judiciary


The GovLab: “Open justice is a growing movement to leverage new technologies – including big data, digital platforms, blockchain and more – to improve legal systems by making the workings of courts easier to understand, scrutinize and improve. Through the use of new technology, open justice innovators are enabling greater efficiency, fairness, accountability and a reduction in corruption in the third branch of government. For example, the open data portal ‘Atviras Teismas’ Lithuania (translated ‘open court’ Lithuania) is a platform for monitoring courts and judges through performance metrics’. This portal serves to make the courts of Lithuania transparent and benefits both courts and citizens by presenting comparative data on the Lithuanian Judiciary.

To promote more Open Justice projects, the GovLab in partnership with the Electoral Tribunal of the Federal Judiciary (TEPJF) of Mexico, launched an historic, first of its kind, online course on Open Justice. Designed primarily for lawyers, judges, and public officials – but also intended to appeal to technologists, and members of the public – the Spanish-language course consists of 10 modules.

Each of the ten modules comprises:

  1. A short video-based lecture
  2. An original Open Justice reader
  3. Associated additional readings
  4. A self-assessment quiz
  5. A demonstration of a platform or tool
  6. An interview with a global practitioner

Among those featured in the interviews are Felipe Moreno of Jusbrasil, Justin Erlich of OpenJustice California, Liam Hayes of Aurecon, UK, Steve Ghiassi of Legaler, Australia, and Sara Castillo of Poder Judicial, Chile….(More)”.

Facebook’s AI team maps the whole population of Africa


Devin Coldewey at TechCrunch: “A new map of nearly all of Africa shows exactly where the continent’s 1.3 billion people live, down to the meter, which could help everyone from local governments to aid organizations. The map joins others like it from Facebook  created by running satellite imagery through a machine learning model.

It’s not exactly that there was some mystery about where people live, but the degree of precision matters. You may know that a million people live in a given region, and that about half are in the bigger city and another quarter in assorted towns. But that leaves hundreds of thousands only accounted for in the vaguest way.

Fortunately, you can always inspect satellite imagery and pick out the spots where small villages and isolated houses and communities are located. The only problem is that Africa is big. Really big. Manually labeling the satellite imagery even from a single mid-sized country like Gabon or Malawi would take a huge amount of time and effort. And for many applications of the data, such as coordinating the response to a natural disaster or distributing vaccinations, time lost is lives lost.

Better to get it all done at once then, right? That’s the idea behind Facebook’s Population Density Maps project, which had already mapped several countries over the last couple of years before the decision was made to take on the entire African continent….

“The maps from Facebook ensure we focus our volunteers’ time and resources on the places they’re most needed, improving the efficacy of our programs,” said Tyler Radford, executive director of the Humanitarian OpenStreetMap Team, one of the project’s partners.

The core idea is straightforward: Match census data (how many people live in a region) with structure data derived from satellite imagery to get a much better idea of where those people are located.

“With just the census data, the best you can do is assume that people live everywhere in the district – buildings, fields, and forests alike,” said Facebook engineer James Gill. “But once you know the building locations, you can skip the fields and forests and only allocate the population to the buildings. This gives you very detailed 30 meter by 30 meter population maps.”

That’s several times more accurate than any extant population map of this size. The analysis is done by a machine learning agent trained on OpenStreetMap data from all over the world, where people have labeled and outlined buildings and other features.

First the huge amount of Africa’s surface that obviously has no structure had to be removed from consideration, reducing the amount of space the team had to evaluate by a factor of a thousand or more. Then, using a region-specific algorithm (because things look a lot different in coastal Morocco than they do in central Chad), the model identifies patches that contain a building….(More)”.

Rethink government with AI


Helen Margetts and Cosmina Dorobantu at Nature: “People produce more than 2.5 quintillion bytes of data each day. Businesses are harnessing these riches using artificial intelligence (AI) to add trillions of dollars in value to goods and services each year. Amazon dispatches items it anticipates customers will buy to regional hubs before they are purchased. Thanks to the vast extractive might of Google and Facebook, every bakery and bicycle shop is the beneficiary of personalized targeted advertising.

But governments have been slow to apply AI to hone their policies and services. The reams of data that governments collect about citizens could, in theory, be used to tailor education to the needs of each child or to fit health care to the genetics and lifestyle of each patient. They could help to predict and prevent traffic deaths, street crime or the necessity of taking children into care. Huge costs of floods, disease outbreaks and financial crises could be alleviated using state-of-the-art modelling. All of these services could become cheaper and more effective.

This dream seems rather distant. Governments have long struggled with much simpler technologies. Flagship policies that rely on information technology (IT) regularly flounder. The Affordable Care Act of former US president Barack Obama nearly crumbled in 2013 when HealthCare.gov, the website enabling Americans to enrol in health insurance plans, kept crashing. Universal Credit, the biggest reform to the UK welfare state since the 1940s, is widely regarded as a disaster because of its failure to pay claimants properly. It has also wasted £837 million (US$1.1 billion) on developing one component of its digital system that was swiftly decommissioned. Canada’s Phoenix pay system, introduced in 2016 to overhaul the federal government’s payroll process, has remunerated 62% of employees incorrectly in each fiscal year since its launch. And My Health Record, Australia’s digital health-records system, saw more than 2.5 million people opt out by the end of January this year over privacy, security and efficacy concerns — roughly 1 in 10 of those who were eligible.

Such failures matter. Technological innovation is essential for the state to maintain its position of authority in a data-intensive world. The digital realm is where citizens live and work, shop and play, meet and fight. Prices for goods are increasingly set by software. Work is mediated through online platforms such as Uber and Deliveroo. Voters receive targeted information — and disinformation — through social media.

Thus the core tasks of governments, such as enforcing regulation, setting employment rights and ensuring fair elections require an understanding of data and algorithms. Here we highlight the main priorities, drawn from our experience of working with policymakers at The Alan Turing Institute in London….(More)”.

Innovation Meets Citizen Science


Caroline Nickerson at SciStarter: “Citizen science has been around as long as science, but innovative approaches are opening doors to more and deeper forms of public participation.

Below, our editors spotlight a few projects that feature new approaches, novel research, or low-cost instruments. …

Colony B: Unravel the secrets of microscopic life! Colony B is a mobile gaming app developed at McGill University that enables you to contribute to research on microbes. Collect microbes and grow your colony in a fast-paced puzzle game that advances important scientific research.

AirCasting: AirCasting is an open-source, end-to-end solution for collecting, displaying, and sharing health and environmental data using your smartphone. The platform consists of wearable sensors, including a palm-sized air quality monitor called the AirBeam, that detect and report changes in your environment. (Android only.)

LingoBoingo: Getting computers to understand language requires large amounts of linguistic data and “correct” answers to language tasks (what researchers call “gold standard annotations”). Simply by playing language games online, you can help archive languages and create the linguistic data used by researchers to improve language technologies. These games are in English, French, and a new “multi-lingual” category.

TreeSnap: Help our nation’s trees and protect human health in the process. Invasive diseases and pests threaten the health of America’s forests. With the TreeSnap app, you can record the location and health of particular tree species–those unharmed by diseases that have wiped out other species. Scientists then use the collected information to locate candidates for genetic sequencing and breeding programs. Tag trees you find in your community, on your property, or out in the wild to help scientists understand forest health….(More)”.