Vincent Duclos in Medicine Anthropology Theory: “In the last few years, tracking systems that harvest web data to identify trends, calculate predictions, and warn about potential epidemic outbreaks have proliferated. These systems integrate crowdsourced data and digital traces, collecting information from a variety of online sources, and they promise to change the way governments, institutions, and individuals understand and respond to health concerns. This article examines some of the conceptual and practical challenges raised by the online algorithmic tracking of disease by focusing on the case of Google Flu Trends (GFT). Launched in 2008, GFT was Google’s flagship syndromic surveillance system, specializing in ‘real-time’ tracking of outbreaks of influenza. GFT mined massive amounts of data about online search behavior to extract patterns and anticipate the future of viral activity. But it did a poor job, and Google shut the system down in 2015. This paper focuses on GFT’s shortcomings, which were particularly severe during flu epidemics, when GFT struggled to make sense of the unexpected surges in the number of search queries. I suggest two reasons for GFT’s difficulties. First, it failed to keep track of the dynamics of contagion, at once biological and digital, as it affected what I call here the ‘googling crowds’. Search behavior during epidemics in part stems from a sort of viral anxiety not easily amenable to algorithmic anticipation, to the extent that the algorithm’s predictive capacity remains dependent on past data and patterns. Second, I suggest that GFT’s troubles were the result of how it collected data and performed what I call ‘epidemic reality’. GFT’s data became severed from the processes Google aimed to track, and the data took on a life of their own: a trackable life, in which there was little flu left. The story of GFT, I suggest, offers insight into contemporary tensions between the indomitable intensity of collective life and stubborn attempts at its algorithmic formalization.Vincent DuclosIn the last few years, tracking systems that harvest web data to identify trends, calculate predictions, and warn about potential epidemic outbreaks have proliferated. These systems integrate crowdsourced data and digital traces, collecting information from a variety of online sources, and they promise to change the way governments, institutions, and individuals understand and respond to health concerns. This article examines some of the conceptual and practical challenges raised by the online algorithmic tracking of disease by focusing on the case of Google Flu Trends (GFT). Launched in 2008, GFT was Google’s flagship syndromic surveillance system, specializing in ‘real-time’ tracking of outbreaks of influenza. GFT mined massive amounts of data about online search behavior to extract patterns and anticipate the future of viral activity. But it did a poor job, and Google shut the system down in 2015. This paper focuses on GFT’s shortcomings, which were particularly severe during flu epidemics, when GFT struggled to make sense of the unexpected surges in the number of search queries. I suggest two reasons for GFT’s difficulties. First, it failed to keep track of the dynamics of contagion, at once biological and digital, as it affected what I call here the ‘googling crowds’. Search behavior during epidemics in part stems from a sort of viral anxiety not easily amenable to algorithmic anticipation, to the extent that the algorithm’s predictive capacity remains dependent on past data and patterns. Second, I suggest that GFT’s troubles were the result of how it collected data and performed what I call ‘epidemic reality’. GFT’s data became severed from the processes Google aimed to track, and the data took on a life of their own: a trackable life, in which there was little flu left. The story of GFT, I suggest, offers insight into contemporary tensions between the indomitable intensity of collective life and stubborn attempts at its algorithmic formalization….(More)”.
Leveraging Private Data for Public Good: A Descriptive Analysis and Typology of Existing Practices

New report by Stefaan Verhulst, Andrew Young, Michelle Winowatan. and Andrew J. Zahuranec: “To address the challenges of our times, we need both new solutions and new ways to develop those solutions. The responsible use of data will be key toward that end. Since pioneering the concept of “data collaboratives” in 2015, The GovLab has studied and experimented with innovative ways to leverage private-sector data to tackle various societal challenges, such as urban mobility, public health, and climate change.
While we have seen an uptake in normative discussions on how data should be shared, little analysis exists of the actual practice. This paper seeks to address that gap and seeks to answer the following question: What are the variables and models that determine functional access to private sector data for public good? In Leveraging Private Data for Public Good: A Descriptive Analysis and Typology of Existing Practices, we describe the emerging universe of data collaboratives and develop a typology of six practice areas. Our goal is to provide insight into current applications to accelerate the creation of new data collaboratives. The report outlines dozens of examples, as well as a set of recommendations to enable more systematic, sustainable, and responsible data collaboration….(More)”
Internet of Water
About: “Water is the essence of life and vital to the well-being of every person, economy, and ecosystem on the planet. But around the globe and here in the United States, water challenges are mounting as climate change, population growth, and other drivers of water stress increase. Many of these challenges are regional in scope and larger than any one organization (or even states), such as the depletion of multi-state aquifers, basin-scale flooding, or the wide-spread accumulation of nutrients leading to dead zones. Much of the infrastructure built to address these problems decades ago, including our data infrastructure, are struggling to meet these challenges. Much of our water data exists in paper formats unique to the organization collecting the data. Often, these organizations existed long before the personal computer was created (1975) or the internet became mainstream (mid 1990’s). As organizations adopted data infrastructure in the late 1990’s, it was with the mindset of “normal infrastructure” at the time. It was built to last for decades, rather than adapt with rapid technological changes.
New water data infrastructure with new technologies that enable data to flow seamlessly between users and generate information for real-time management are needed to meet our growing water challenges. Decision-makers need accurate, timely data to understand current conditions, identify sustainability problems, illuminate possible solutions, track progress, and adapt along the way. Stakeholders need easy-to-understand metrics of water conditions so they can make sure managers and policymakers protect the environment and the public’s water supplies. The water community needs to continually improve how they manage this complex resource by using data and communicating information to support decision-making. In short, a sustained effort is required to accelerate the development of open data and information systems to support sustainable water resources management. The Internet of Water (IoW) is designed to be just such an effort….(More)”.
Waze launches data-sharing integration for cities with Google Cloud
Ryan Johnston at StateScoop: “Thousands of cities across the world that rely on externally-sourced traffic data from Waze, the route-finding mobile app, will now have access to the data through the Google Cloud suite of analytics tools instead of a raw feed, making it easier for city transportation and planning officials to reach data-driven decisions.
Waze said Tuesday that the anonymized data is now available through Google Cloud, with the goal of making curbside management, roadway maintenance and transit investment easier for small to midsize cities that don’t have the resources to invest in enterprise data-analytics platforms of their own. Since 2014, Waze — which became a Google subsidiary in 2013 — has submitted traffic data to its partner cities through its “Waze for Cities” program, but those data sets arrived in raw feeds without any built-in analysis or insights.
While some cities have built their own analysis tools to understand the free data from the company, others have struggled to stay afloat in the sea of data, said Dani Simons, Waze’s head of public sector partnerships.
“[What] we’ve realized is providing the data itself isn’t enough for our city partners or for a lot of our city and state partners,” Simons said. “We have been asked over time for better ways to analyze and turn that raw data into something more actionable for our public partners, and that’s why we’re doing this.”
The data will now arrive automatically integrated with Google’s free data analysis tool, BigQuery, and a visualization tool, Data Studio. Cities can use the tools to analyze up to a terabyte of data and store up to 10 gigabytes a month for free, but they can also choose to continue to use in-house analysis tools, Simons said.
The integration was also designed with input from Waze’s top partner cities, including Los Angeles; Seattle; and San Jose, California. One of Waze’s private sector partners, Genesis Pulse, which designs software for emergency responders, reported that Waze users identified 40 percent of roadside accidents an average of 4.5 minutes before those incidents were reported to 911 or public safety.
The integration is Waze’s attempt at solving two of the biggest data problems that cities have today, Simons told StateScoop. For some cities in the U.S., Waze is one of the several private companies sharing transit data with them. Other cities are drowning in data from traffic sensors, city-owned fleets data or private mobility companies….(More)”.
From Transactions Data to Economic Statistics: Constructing Real-Time, High-Frequency, Geographic Measures of Consumer Spending
Paper by Aditya Aladangady et al: “Access to timely information on consumer spending is important to economic policymakers. The Census Bureau’s monthly retail trade survey is a primary source for monitoring consumer spending nationally, but it is not well suited to study localized or short-lived economic shocks. Moreover, lags in the publication of the Census estimates and subsequent, sometimes large, revisions diminish its usefulness for real-time analysis. Expanding the Census survey to include higher frequencies and subnational detail would be costly and would add substantially to respondent burden. We take an alternative approach to fill these information gaps. Using anonymized transactions data from a large electronic payments technology company, we create daily estimates of retail spending at detailed geographies. Our daily estimates are available only a few days after the transactions occur, and the historical time series are available from 2010 to the present. When aggregated to the national leve l, the pattern of monthly growth rates is similar to the official Census statistics. We discuss two applications of these new data for economic analysis: First, we describe how our monthly spending estimates are useful for real-time monitoring of aggregate spending, especially during the government shutdown in 2019, when Census data were delayed and concerns about the economy spiked. Second, we show how the geographic detail allowed us quantify in real time the spending effects of Hurricanes Harvey and Irma in 2017….(More)”.
Addressing the Challenges of Drafting Contracts for Data Collaboration
Blog post by Andrew Young, Andrew J. Zahuranec, Stephen Burley Tubman, William Hoffman, and Stefaan Verhulst at Data & Society: “To deal with complex public challenges, organizations increasingly seek to leverage data across sectors in new and innovative ways — from establishing prize-backed challenges around the use of diverse datasets to creating cross-sector federated data systems. These and other forms of data collaboratives are part of a new paradigm in data-driven innovation in which participants from different sectors provide access to data for the creation of public value. It provides an essential new problem-solving approach for our increasingly datafied society. However, the operational challenges associated with creating such partnerships often prevent the transformative potential of data collaboration from being achieved.
One such operational challenge relates to developing data sharing agreements — through contracts and other legal documentation. The current practice suffers from large inefficiencies and transaction costs resulting from (i) the lack of a common understanding of what the core issues are with data exchange; (ii) lack of common language or models; (iii) large heterogeneity in agreements used; (iv) lack of familiarity among lawyers of the technologies involved and (v) a sense that every initiative needs to (re)invent the wheel. Removing these barriers may enable collaborators to partner more systematically and responsibly around the re-use of data assets. Contracts for Data Collaboration (C4DC) is a new initiative seeking to address these barriers to data collaboration…
In the longer term, participants focused on three major themes that, if addressed, could steer contracting for data collaboration toward greater effectiveness and legitimacy.
Data Stewardship and Responsibility: First, much of the discussion centered on the need to promote responsible data practices through data stewardship. Though part of this work involves creating teams and individuals empowered to share, it also means empowering them to operationalize ethical principles.
By developing international standards and moving beyond the bare minimum legal obligation, these actors can build trust between parties, a quality that has often been difficult to foster. Such relationships are key in engaging intermediaries or building complex contractual agreements between multiple organizations. It is also essential to come to an agreement about which practices are legitimate and illegitimate.
Incorporation of the Citizen Perspective: Trust is also needed between the actors in a data collaborative and the general public. In light of many recent stories about the misuse of data, many people are suspicious, if not outright hostile, to data partnerships. Many data subjects don’t understand why organizations want their data or how the information can be valuable in advancing public good.
In data-sharing arrangements, all actors need to explain intended uses and outcomes to data subjects. Attendees spoke about the need to explain the data’s utility in clear and accessible terms. They also noted data collaborative contracts are more legitimate if they incorporate citizen perspectives, especially those of marginalized groups. To take this work a step further, the public could be brought into the contract writing process by creating mechanisms capable of soliciting their views and concerns.
Improving Internal and External Collaboration: Lastly, participants discussed the need for actors across the data ecosystem to strengthen relationships inside and outside their organizations. Part of this work entails securing internal buy-in for data collaboration, ensuring that the different components of an organization understand what assets are being shared and why.
It also entails engaging with intermediaries to fill gaps. Each actor has limitations to their capacities and expertise and, by engaging with start-ups, funders, NGOs, and others, organizations can improve the odds of a successful collaboration. Together, organizations can create norms and shared languages that allow for more effective data flows.
One such operational challenge relates to developing data sharing agreements — through contracts and other legal documentation. The current practice suffers from large inefficiencies and transaction costs resulting from (i) the lack of a common understanding of what the core issues are with data exchange; (ii) lack of common language or models; (iii) large heterogeneity in agreements used; (iv) lack of familiarity among lawyers of the technologies involved and (v) a sense that every initiative needs to (re)invent the wheel. Removing these barriers may enable collaborators to partner more systematically and responsibly around the re-use of data assets. Contracts for Data Collaboration (C4DC) is a new initiative seeking to address these barriers to data collaboration…(More)”.
Becoming a data steward
Shalini Kurapati at the LSE Impact Blog: “In the context of higher education, data stewards are the first point of reference for all data related questions. In my role as a data steward at TU Delft, I was able to advise, support and train researchers on various aspects of data management throughout the life cycle of a research project, from initial planning to post-publication. This included storing, managing and sharing research outputs such as data, images, models and code.
Data stewards also advise researchers on the ethical, policy and legal considerations during data collection, processing and dissemination. In a way, they are general practitioners for research data management and can usually solve most problems faced by academics. In cases that require specialist intervention, they also serve as a key point for referral (eg: IT, patent, legal experts).
Data stewardship is often organised centrally through the university library. (Subject) Data librarians, research data consultants and research data officers, usually perform similar roles to data stewards. However, TU Delft operates a decentralised model, where data stewards are placed within faculties as disciplinary experts with research experience. This allows data stewards to provide discipline specific support to researchers, which is particularly beneficial, as the concept of what data is itself varies across disciplines….(More)”.
Breaking Down Information Silos with Big Data: A Legal Analysis of Data Sharing
Chapter by Giovanni De Gregorio and Sofia Ranchordas in J. Cannataci, V. Falce & O. Pollicino (Eds), New Legal Challenges of Big Data (Edward Elgar, 2020, Forthcoming): “In the digital society, individuals play different roles depending on the situation they are placed in: they are consumers when they purchase a good, citizens when they vote for elections, content providers when they post information on a platform, and data subjects when their data is collected. Public authorities have thus far regulated citizens and the data collected on their different roles in silos (e.g., bankruptcy registrations, social welfare databases), resulting in inconsistent decisions, redundant paperwork, and delays in processing citizen requests. Data silos are considered to be inefficient both for companies and governments. Big data and data analytics are disrupting these silos allowing the different roles of individuals and the respective data to converge. In practice, this happens in several countries with data sharing arrangements or ad hoc data requests. However, breaking down the existing structure of information silos in the public sector remains problematic. While big data disrupts artificial silos that may not make sense in the digital society and promotes a truly efficient digitalization of data, removing information out of its original context may alter its meaning and violate the privacy of citizens. In addition, silos ensure that citizens are not assessed in one field by information generated in a totally different context. This chapter discusses how big data and data analytics are changing information silos and how digital technology is challenging citizens’ autonomy and right to privacy and data protection. This chapter also explores the need for a more integrated approach to the study of information, particularly in the public sector.
Data Fiduciary in Order to Alleviate Principal-Agent Problems in the Artificial Big Data Age
Paper by Julia M. Puaschunder: “The classic principal-agent problem in political science and economics describes agency dilemmas or problems when one person, the agent, is put in a situation to make decisions on behalf of another entity, the principal. A dilemma occurs in situations when individual profit maximization or principal and agent are pitted against each other. This so-called moral hazard is nowadays emerging in the artificial big data age, when big data reaping entities have to act on behalf of agents, who provide their data with trust in the principal’s integrity and responsible big data conduct. Yet to this day, no data fiduciary has been clearly described and established to protect the agent from misuse of data. This article introduces the agent’s predicament between utility derived from information sharing and dignity in privacy as well as hyper-hyperbolic discounting fallibilities to not clearly foresee what consequences information sharing can have over time and in groups. The principal’s predicament between secrecy and selling big data insights or using big data for manipulative purposes will be outlined. Finally, the article draws a clear distinction between manipulation and nudging in relation to the potential social class division of those who nudge and those who are nudged…(More)”.
Risk identification and management for the research use of government administrative data
Paper by Elizabeth Shepherd, Anna Sexton, Oliver Duke-Williams, and Alexandra Eveleigh: “Government administrative data have enormous potential for public and individual benefit through improved educational and health services to citizens, medical research, environmental and climate interventions and exploitation of scarce energy resources. Administrative data is usually “collected primarily for administrative (not research) purposes by government departments and other organizations for the purposes of registration, transaction and record keeping, during the delivery of a service” such as health care, vehicle licensing, tax and social security systems (https://esrc.ukri.org/funding/guidance-for-applicants/research-ethics/useful-resources/key-terms-glossary/). Administrative data are usually distinguished from data collected for statistical use such as the census. Unlike administrative records, they do not provide evidence of activities and generally lack metadata and context relating to provenance. Administrative data, unlike open data, are not routinely made open or accessible, but access can be provided only on request to named researchers for specified research projects through research access protocols that often take months to negotiate and are subject to significant constraints around re-use such as the use of safe havens. Researchers seldom make use of freedom of information or access to information protocols to access such data because they need specific datasets and particular levels of granularity and an ability to re-process data, which are not made generally available. This study draws on research undertaken by the authors as part of the Administrative Data Research Centre in England (ADRC-E). The research examined perspectives on the sharing, linking and re-use (secondary use) of administrative data in England, viewed through three analytical themes: trust, consent and risk. This study presents the analysis of the identification and management of risk in the research use of government administrative data and presents a risk framework. Risk management (i.e. coordinated activities that allow organizations to control risks, Lemieux, 2010) enables us to think about the balance between risk and benefit for the public good and for other stakeholders. Mitigating activities or management mechanisms used to control the identified risks depend on the resources available to implement the options, on the risk appetite or tolerance of the community and on the cost and likely effectiveness of the mitigation. Mitigation and risk do not work in isolation and should be holistically viewed by keeping the whole information infrastructure in balance across the administrative data system and between multiple stakeholders.
This study seeks to establish a clearer picture of risk with regard to government administrative data in England. It identifies and categorizes the risks arising from the research use of government administrative data. It identifies mitigating risk management activities, linked to five key stakeholder communities and discusses the locus of responsibility for risk management actions. The identification of the risks and of mitigation strategies is derived from the viewpoints of the interviewees and associated documentation; therefore, they reflect their lived experience. The five stakeholder groups identified from the data are as follows: individual researchers; employers of researchers; wider research community; data creators and providers and data subjects and the broader public. The primary sections of the study, following the methodology and research context, set out the seven identified types of risk events in the research use of administrative data, present a stakeholder mapping of the communities in this research affected by the risks and discuss the findings related to managing and mitigating the risks identified. The conclusion presents the elements of a new risk framework to inform future actions by the government data community and enable researchers to exploit the power of administrative data for public good….(More)”.