We’ll soon know the exact air pollution from every power plant in the world. That’s huge.


David Roberts at Vox: “A nonprofit artificial intelligence firm called WattTime is going to use satellite imagery to precisely track the air pollution (including carbon emissions) coming out of every single power plant in the world, in real time. And it’s going to make the data public.

This is a very big deal. Poor monitoring and gaming of emissions data have made it difficult to enforce pollution restrictions on power plants. This system promises to effectively eliminate poor monitoring and gaming of emissions data….

The plan is to use data from satellites that make theirs publicly available (like the European Union’s Copernicus network and the US Landsat network), as well as data from a few private companies that charge for their data (like Digital Globe). The data will come from a variety of sensors operating at different wavelengths, including thermal infrared that can detect heat.

The images will be processed by various algorithms to detect signs of emissions. It has already been demonstrated that a great deal of pollution can be tracked simply through identifying visible smoke. WattTime says it can also use infrared imaging to identify heat from smokestack plumes or cooling-water discharge. Sensors that can directly track NO2 emissions are in development, according to WattTime executive director Gavin McCormick.

Between visible smoke, heat, and NO2, WattTime will be able to derive exact, real-time emissions information, including information on carbon emissions, for every power plant in the world. (McCormick says the data may also be used to derive information about water pollutants like nitrates or mercury.)

Google.org, Google’s philanthropic wing, is getting the project off the ground (pardon the pun) with a $1.7 million grant; it was selected through the Google AI Impact Challenge….(More)”.

The EU Wants to Build One of the World’s Largest Biometric Databases. What Could Possibly Go Wrong?


Grace Dobush at Fortune: “China and India have built the world’s largest biometric databases, but the European Union is about to join the club.

The Common Identity Repository (CIR) will consolidate biometric data on almost all visitors and migrants to the bloc, as well as some EU citizens—connecting existing criminal, asylum, and migration databases and integrating new ones. It has the potential to affect hundreds of millions of people.

The plan for the database, first proposed in 2016 and approved by the EU Parliament on April 16, was sold as a way to better track and monitor terrorists, criminals, and unauthorized immigrants.

The system will target the fingerprints and identity data for visitors and immigrants initially, and represents the first step towards building a truly EU-wide citizen database. At the same time, though, critics argue its mere existence will increase the potential for hacks, leaks, and law enforcement abuse of the information….

The European Parliament and the European Council have promised to address those concerns, through “proper safeguards” to protect personal privacy and to regulate officers’ access to data. In 2016, they passed a law regarding law enforcement’s access to personal data, alongside General Data Protection Regulation or GDPR.

But total security is a tall order. Germany is currently dealing with multipleinstances of police officers allegedly leaking personal information to far-right groups. Meanwhile, a Swedish hacker went to prison for hacking into Denmark’s public records system in 2012 and dumping online the personal data of hundreds of thousands of citizens and migrants….(More)”.


Facebook will open its data up to academics to see how it impacts elections


MIT Technology Review: “More than 60 researchers from 30 institutions will get access to Facebook user data to study its impact on elections and democracy, and how it’s used by advertisers and publishers.

A vast trove: Facebook will let academics see which websites its users linked to from January 2017 to February 2019. Notably, that means they won’t be able to look at the platform’s impact on the US presidential election in 2016, or on the Brexit referendum in the UK in the same year.

Despite this slightly glaring omission, it’s still hard to wrap your head around the scale of the data that will be shared, given that Facebook is used by 1.6 billion people every day. That’s more people than live in all of China, the most populous country on Earth. It will be one of the largest data sets on human behavior online to ever be released.

The process: Facebook didn’t pick the researchers. They were chosen by the Social Science Research Council, a US nonprofit. Facebook has been working on this project for over a year, as it tries to balance research interests against user privacy and confidentiality.

Privacy: In a blog post, Facebook said it will use a number of statistical techniques to make sure the data set can’t be used to identify individuals. Researchers will be able to access it only via a secure portal that uses a VPN and two-factor authentication, and there will be limits on the number of queries they can each run….(More)”.

Nowcasting the Local Economy: Using Yelp Data to Measure Economic Activity


Paper by Edward L. Glaeser, Hyunjin Kim and Michael Luca: “Can new data sources from online platforms help to measure local economic activity? Government datasets from agencies such as the U.S. Census Bureau provide the standard measures of local economic activity at the local level. However, these statistics typically appear only after multi-year lags, and the public-facing versions are aggregated to the county or ZIP code level. In contrast, crowdsourced data from online platforms such as Yelp are often contemporaneous and geographically finer than official government statistics. Glaeser, Kim, and Luca present evidence that Yelp data can complement government surveys by measuring economic activity in close to real time, at a granular level, and at almost any geographic scale. Changes in the number of businesses and restaurants reviewed on Yelp can predict changes in the number of overall establishments and restaurants in County Business Patterns. An algorithm using contemporaneous and lagged Yelp data can explain 29.2 percent of the residual variance after accounting for lagged CBP data, in a testing sample not used to generate the algorithm. The algorithm is more accurate for denser, wealthier, and more educated ZIP codes….(More)”.

See all papers presented at the NBER Conference on Big Data for 21st Century Economic Statistics here.

A weather tech startup wants to do forecasts based on cell phone signals


Douglas Heaven at MIT Technology Review: “On 14 April more snow fell on Chicago than it had in nearly 40 years. Weather services didn’t see it coming: they forecast one or two inches at worst. But when the late winter snowstorm came it caused widespread disruption, dumping enough snow that airlines had to cancel more than 700 flights across all of the city’s airports.

One airline did better than most, however. Instead of relying on the usual weather forecasts, it listened to ClimaCell – a Boston-based “weather tech” start-up that claims it can predict the weather more accurately than anyone else. According to the company, its correct forecast of the severity of the coming snowstorm allowed the airline to better manage its schedules and minimize losses due to delays and diversions. 

Founded in 2015, ClimaCell has spent the last few years developing the technology and business relationships that allow it to tap into millions of signals from cell phones and other wireless devices around the world. It uses the quality of these signals as a proxy for local weather conditions, such as precipitation and air quality. It also analyzes images from street cameras. It is offering a weather forecasting service to subscribers that it claims is 60 percent more accurate than that of existing providers, such as NOAA.

The internet of weather

The approach makes sense, in principle. Other forecasters use proxies, such as radar signals. But by using information from millions of everyday wireless devices, ClimaCell claims it has a far more fine-grained view of most of the globe than other forecasters get from the existing network of weather sensors, which range from ground-based devices to satellites. (ClimaCell also taps into these, too.)…(More)”.

Introducing the Contractual Wheel of Data Collaboration


Blog by Andrew Young and Stefaan Verhulst: “Earlier this year we launched the Contracts for Data Collaboration (C4DC) initiative — an open collaborative with charter members from The GovLab, UN SDSN Thematic Research Network on Data and Statistics (TReNDS), University of Washington and the World Economic Forum. C4DC seeks to address the inefficiencies of developing contractual agreements for public-private data collaboration by informing and guiding those seeking to establish a data collaborative by developing and making available a shared repository of relevant contractual clauses taken from existing legal agreements. Today TReNDS published “Partnerships Founded on Trust,” a brief capturing some initial findings from the C4DC initiative.

The Contractual Wheel of Data Collaboration [beta]

The Contractual Wheel of Data Collaboration [beta] — Stefaan G. Verhulst and Andrew Young, The GovLab

As part of the C4DC effort, and to support Data Stewards in the private sector and decision-makers in the public and civil sectors seeking to establish Data Collaboratives, The GovLab developed the Contractual Wheel of Data Collaboration [beta]. The Wheel seeks to capture key elements involved in data collaboration while demystifying contracts and moving beyond the type of legalese that can create confusion and barriers to experimentation.

The Wheel was developed based on an assessment of existing legal agreements, engagement with The GovLab-facilitated Data Stewards Network, and analysis of the key elements of our Data Collaboratives Methodology. It features 22 legal considerations organized across 6 operational categories that can act as a checklist for the development of a legal agreement between parties participating in a Data Collaborative:…(More)”.

San Francisco teams up with Uber, location tracker on 911 call responses


Gwendolyn Wu at San Francisco Chronicle: “In an effort to shorten emergency response times in San Francisco, the city announced on Monday that it is now using location data from RapidSOS, a New York-based public safety tech company, and ride-hailing company Uber to improve location coordinates generated from 911 calls.

An increasing amount of emergency calls are made from cell phones, said Michelle Cahn, RapidSOS’s director of community engagement. The new technology should allow emergency responders to narrow down the location of such callers and replace existing 911 technology that was built for landlines and tied to home addresses.

Cell phone location data currently given to dispatchers when they receive a 911 call can be vague, especially if the person can’t articulate their exact location, according to the Department of Emergency Management.

But if a dispatcher can narrow down where the emergency is happening, that increases the chance of a timely response and better result, Cahn said.

“It doesn’t matter what’s going on with the emergency if we don’t know where it is,” she said.

RapidSOS shares its location data — collected by Apple and Google for their in-house map apps — free of charge to public safety agencies. San Francisco’s 911 call center adopted the data service in September 2018.

The Federal Communications Commission estimates agencies could save as many as 10,000 lives a year if they shave a minute off response times. Federal officials issued new rules to improve wireless 911 calls in 2015, asking mobile carriers to provide more accurate locations to call centers. Carriers are required to find a way to triangulate the caller’s location within 50 meters — a much smaller radius than the eight blocks city officials were initially presented in October when the caller dialed 911…(More)”.

Characterizing the Biomedical Data-Sharing Landscape


Paper by Angela G. Villanueva et al: “Advances in technologies and biomedical informatics have expanded capacity to generate and share biomedical data. With a lens on genomic data, we present a typology characterizing the data-sharing landscape in biomedical research to advance understanding of the key stakeholders and existing data-sharing practices. The typology highlights the diversity of data-sharing efforts and facilitators and reveals how novel data-sharing efforts are challenging existing norms regarding the role of individuals whom the data describe.

Technologies such as next-generation sequencing have dramatically expanded capacity to generate genomic data at a reasonable cost, while advances in biomedical informatics have created new tools for linking and analyzing diverse data types from multiple sources. Further, many research-funding agencies now mandate that grantees share data. The National Institutes of Health’s (NIH) Genomic Data Sharing (GDS) Policy, for example, requires NIH-funded research projects generating large-scale human genomic data to share those data via an NIH-designated data repository such as the Database of Geno-types and Phenotypes (dbGaP). Another example is the Parent Project Muscular Dystrophy, a non-profit organization that requires applicants to propose a data-sharing plan and take into account an applicant’s history of data sharing.

The flow of data to and from different projects, institutions, and sectors is creating a medical information commons (MIC), a data-sharing ecosystem consisting of networked resources sharing diverse health-related data from multiple sources for research and clinical uses. This concept aligns with the 2018 NIH Strategic Plan for Data Science, which uses the term “data ecosystem” to describe “a distributed, adaptive, open system with properties of self-organization, scalability and sustainability” and proposes to “modernize the biomedical research data ecosystem” by funding projects such as the NIH Data Commons. Consistent with Elinor Ostrom’s discussion of nested institutional arrangements, an MIC is both singular and plural and may describe the ecosystem as a whole or individual components contributing to the ecosystem. Thus, resources like the NIH Data Commons with its associated institutional arrangements are MICs, and also form part of the larger MIC that encompasses all such resources and arrangements.

Although many research funders incentivize data sharing, in practice, progress in making biomedical data broadly available to maximize its utility is often hampered by a broad range of technical, legal, cultural, normative, and policy challenges that include achieving interoperability, changing the standards for academic promotion, and addressing data privacy and security concerns. Addressing these challenges requires multi-stakeholder involvement. To identify relevant stakeholders and advance understanding of the contributors to an MIC, we conducted a landscape analysis of existing data-sharing efforts and facilitators. Our work builds on typologies describing various aspects of data sharing that focused on biobanks, research consortia, or where data reside (e.g., degree of data centralization).7 While these works are informative, we aimed to capture the biomedical data-sharing ecosystem with a wider scope. Understanding the components of an MIC ecosystem and how they interact, and identifying emerging trends that test existing norms (such as norms respecting the role of individuals from whom the data describe), is essential to fostering effective practices, policies and governance structures, guiding resource allocation, and promoting the overall sustainability of the MIC….(More)”

The Importance of Data Access Regimes for Artificial Intelligence and Machine Learning


JRC Digital Economy Working Paper by Bertin Martens: “Digitization triggered a steep drop in the cost of information. The resulting data glut created a bottleneck because human cognitive capacity is unable to cope with large amounts of information. Artificial intelligence and machine learning (AI/ML) triggered a similar drop in the cost of machine-based decision-making and helps in overcoming this bottleneck. Substantial change in the relative price of resources puts pressure on ownership and access rights to these resources. This explains pressure on access rights to data. ML thrives on access to big and varied datasets. We discuss the implications of access regimes for the development of AI in its current form of ML. The economic characteristics of data (non-rivalry, economies of scale and scope) favour data aggregation in big datasets. Non-rivalry implies the need for exclusive rights in order to incentivise data production when it is costly. The balance between access and exclusion is at the centre of the debate on data regimes. We explore the economic implications of several modalities for access to data, ranging from exclusive monopolistic control to monopolistic competition and free access. Regulatory intervention may push the market beyond voluntary exchanges, either towards more openness or reduced access. This may generate private costs for firms and individuals. Society can choose to do so if the social benefits of this intervention outweigh the private costs.

We briefly discuss the main EU legal instruments that are relevant for data access and ownership, including the General Data Protection Regulation (GDPR) that defines the rights of data subjects with respect to their personal data and the Database Directive (DBD) that grants ownership rights to database producers. These two instruments leave a wide legal no-man’s land where data access is ruled by bilateral contracts and Technical Protection Measures that give exclusive control to de facto data holders, and by market forces that drive access, trade and pricing of data. The absence of exclusive rights might facilitate data sharing and access or it may result in a segmented data landscape where data aggregation for ML purposes is hard to achieve. It is unclear if incompletely specified ownership and access rights maximize the welfare of society and facilitate the development of AI/ML…(More)”

Data Trusts: More Data than Trust? The Perspective of the Data Subject in the Face of a Growing Problem


Paper by Christine Rinik: “In the recent report, Growing the Artificial Intelligence Industry in the UK, Hall and Pesenti suggest the use of a ‘data trust’ to facilitate data sharing. Whilst government and corporations are focusing on their need to facilitate data sharing, the perspective of many individuals is that too much data is being shared. The issue is not only about data, but about power. The individual does not often have a voice when issues relating to data sharing are tackled. Regulators can cite the ‘public interest’ when data governance is discussed, but the individual’s interests may diverge from that of the public.

This paper considers the data subject’s position with respect to data collection leading to considerations about surveillance and datafication. Proposals for data trusts will be considered applying principles of English trust law to possibly mitigate the imbalance of power between large data users and individual data subjects. Finally, the possibility of a workable remedy in the form of a class action lawsuit which could give the data subjects some collective power in the event of a data breach will be explored. Despite regulatory efforts to protect personal data, there is a lack of public trust in the current data sharing system….(More)”.