Wanted: Data Stewards: (Re-)Defining The Roles and Responsibilities of Data Stewards for an Age of Data Collaboration


Wanted: Data Stewards: (Re-)Defining The Roles and Responsibilities of Data Stewards for an Age of Data Collaboration

Stefaan G. Verhulst, Andrew Zahuranec, Andrew Young and Michelle Winowatan at Data & Policy: “As data grows increasingly prevalent in our economy, it is increasingly clear, too, that tremendous societal value can be derived from reusing and combining previously separate datasets. One avenue that holds particular promise are data collaboratives. Data collaboratives are a new form of partnership in which data (such as data owned by corporations) or data expertise is made accessible for external parties (such as academics or statistical offices) working in the public interest. By bringing together a wide range of inter-sectoral expertise to bear on the data, collaboration can result in new insights and innovations, and can help unlock the public good potential of previously siloed data or expertise.

Yet, not all data collaboratives are successful or go beyond pilots. Based on research and analysis of hundreds of data collaboratives, one factor seems to stand out as determinative of success above all others — whether there exist individuals or teams within data-holding organizations who are empowered to proactively initiate, facilitate and coordinate data collaboratives toward the public interest. We call these individuals and teams “data stewards.”

They systematize the process of partnering, and help scale efforts when there are fledgling signs of success. Data stewards are essential for accelerating the re-use of data in the public interest by providing functional access, and more generally, to unlock the potential of our data age. Data stewards form an important — and new — link in the data value chain.

In its final report, the European Commission’s High-Level Expert Group on Business-to-Government (B2G) Data Sharing also noted the need for data stewards to enable responsible, accountable data sharing for the public interest. In their report, they write:

“A key success factor in setting up sustainable and responsible B2G partnerships is the existence, within both public- and private-sector organisations, of individuals or teams that are empowered to proactively initiate, facilitate and coordinate B2G data sharing when necessary. As such, ‘data stewards’ should become a recognised function.”

The report goes on further to acknowledge the need to scope, design, and establish a network or a community of practice around data stewardship.

Wanted: Data Stewards

A new position paper, released by The GovLab within the context of the UN Statistical Commission High-Level Forum on Official Statistics which focused on “Data stewardship — a solution for official statistics’ predicament?” seeks to begin that work. The paper, titled “Wanted: Data Stewards: (Re-)Defining The Roles and Responsibilities of Data Stewards for an Age of Data Collaboration” tackles questions regarding the profile and potential of data stewards. It aims to provide an operational roadmap to support the implementation (or expansion) of data stewardship functions in public- and private-sector entities; and to start building a community of expertise.

Moreover, it addresses the tendency to conflate the roles of data stewards with those of individuals or groups who might better be described as chief privacy, chief data or chief security officers. This slippage is perhaps understandable, we need to redefine the role that is somewhat broader. While data management, privacy and security are key components of trusted and effective data collaboratives, the real goal is to re-use data for broader social goals (while preventing any potential harms that may result from sharing).

In particular the position paper — which captures lived experience of numerous data stewards- seeks to provide more clarity on how data stewards can accomplish these duties by:

  • Defining the responsibilities of a data steward; and
  • Identifying the roles which a data steward must fill to achieve these responsibilities…(More)”.

Is Your Data Being Collected? These Signs Will Tell You Where


Flavie Halais at Wired: “Alphabet’s Sidewalk Labs is testing icons that provide “digital transparency” when information is collected in public spaces….

As cities incorporate digital technologies into their landscapes, they face the challenge of informing people of the many sensors, cameras, and other smart technologies that surround them. Few people have the patience to read through the lengthy privacy notice on a website or smartphone app. So how can a city let them know how they’re being monitored?

Sidewalk Labs, the Google sister company that applies technology to urban problems, is taking a shot. Through a project called Digital Transparency in the Public Realm, or DTPR, the company is demonstrating a set of icons, to be displayed in public spaces, that shows where and what kinds of data are being collected. The icons are being tested as part Sidewalk Labs’ flagship project in Toronto, where it plans to redevelop a 12-acre stretch of the city’s waterfront. The signs would be displayed at each location where data would be collected—streets, parks, businesses, and courtyards.

Data collection is a core feature of the project, called Sidewalk Toronto, and the source of much of the controversy surrounding it. In 2017, Waterfront Toronto, the organization in charge of administering the redevelopment of the city’s eastern waterfront, awarded Sidewalk Labs the contract to develop the waterfront site. The project has ambitious goals: It says it could create 44,000 direct jobs by 2040 and has the potential to be the largest “climate-positive” community—removing more CO2 from the atmosphere than it produces—in North America. It will make use of new urban technology like modular street pavers and underground freight delivery. Sensors, cameras, and Wi-Fi hotspots will monitor and control traffic flows, building temperature, and crosswalk signals.

All that monitoring raises inevitable concerns about privacy, which Sidewalk aims to address—at least partly—by posting signs in the places where data is being collected.

The signs display a set of icons in the form of stackable hexagons, derived in part from a set of design rules developed by Google in 2014. Some describe the purpose for collecting the data (mobility, energy efficiency, or waste management, for example). Others refer to the type of data that’s collected, such as photos, air quality, or sound. When the data is identifiable, meaning it can be associated with a person, the hexagon is yellow. When the information is stripped of personal identifiers, the hexagon is blue…(More)”.

Eurobarometer survey shows support for sustainability and data sharing


Press Release: “Europeans want their digital devices to be easier to repair or recycle and are willing to share their personal information to improve public services, as a special Eurobarometer survey shows. The survey, released today, measured attitudes towards the impact of digitalisation on daily lives of Europeans in 27 EU Member States and the United Kingdom. It covers several different areas including digitalisation and the environment, sharing personal information, disinformation, digital skills and the use of digital ID….

Overall, 59% of respondents would be willing to share some of their personal information securely to improve public services. In particular, most respondents are willing to share their data to improve medical research and care (42%), to improve the response to crisis (31%) or to improve public transport and reduce air pollution (26%).

An overwhelming majority of respondents who use their social media accounts to log in to other online services (74%) want to know how their data is used. A large majority would consider it useful to have a secure single digital ID that could serve for all online services and give them control over the use of their data….

In addition to the Special Eurobarometer report, the last iteration of the Standard Eurobarometer conducted in November 2019 also tested public perceptions related to Artificial Intelligence. The findings also published in a separate report today.

Around half of the respondents (51%) said that public policy intervention is needed to ensure ethical applications. Half of the respondents (50%) mention the healthcare sector as the area where AI could be most beneficial. A strong majority (80%) of the respondents think that they should be informed when a digital service or mobile application uses AI in various situations….(More)”.

Beyond Randomized Controlled Trials


Iqbal Dhaliwal, John Floretta & Sam Friedlander at SSIR: “…In its post-Nobel phase, one of J-PAL’s priorities is to unleash the treasure troves of big digital data in the hands of governments, nonprofits, and private firms. Primary data collection is by far the most time-, money-, and labor-intensive component of the vast majority of experiments that evaluate social policies. Randomized evaluations have been constrained by simple numbers: Some questions are just too big or expensive to answer. Leveraging administrative data has the potential to dramatically expand the types of questions we can ask and the experiments we can run, as well as implement quicker, less expensive, larger, and more reliable RCTs, an invaluable opportunity to scale up evidence-informed policymaking massively without dramatically increasing evaluation budgets.

Although administrative data hasn’t always been of the highest quality, recent advances have significantly increased the reliability and accuracy of GPS coordinates, biometrics, and digital methods of collection. But despite good intentions, many implementers—governments, businesses, and big NGOs—aren’t currently using the data they already collect on program participants and outcomes to improve anti-poverty programs and policies. This may be because they aren’t aware of its potential, don’t have the in-house technical capacity necessary to create use and privacy guidelines or analyze the data, or don’t have established partnerships with researchers who can collaborate to design innovative programs and run rigorous experiments to determine which are the most impactful. 

At J-PAL, we are leveraging this opportunity through a new global research initiative we are calling the “Innovations in Data and Experiments for Action” Initiative (IDEA). IDEA supports implementers to make their administrative data accessible, analyze it to improve decision-making, and partner with researchers in using this data to design innovative programs, evaluate impact through RCTs, and scale up successful ideas. IDEA will also build the capacity of governments and NGOs to conduct these types of activities with their own data in the future….(More)”.

Car Data Facts


About: “Welcome to CarDataFacts.eu! This website provides a fact-based overview on everything related to the sharing of vehicle-generated data with third parties. Through a series of educational infographics, this website answers the most common questions about access to car data in a clear and simple way.

CarDataFacts.eu also addresses consumer concerns about sharing data in a safe and a secure way, as well as explaining some of the complex and technical terminology surrounding the debate.

CarDataFacts.eu is brought to you by ACEA, the European Automobile Manufacturers’ Association, which represents the 15 Europe-based car, van, truck and bus makers….(More)”.

Invest 5% of research funds in ensuring data are reusable


Barend Mons at Nature: “It is irresponsible to support research but not data stewardship…

Many of the world’s hardest problems can be tackled only with data-intensive, computer-assisted research. And I’d speculate that the vast majority of research data are never published. Huge sums of taxpayer funds go to waste because such data cannot be reused. Policies for data reuse are falling into place, but fixing the situation will require more resources than the scientific community is willing to face.

In 2013, I was part of a group of Dutch experts from many disciplines that called on our national science funder to support data stewardship. Seven years later, policies that I helped to draft are starting to be put into practice. These require data created by machines and humans to meet the FAIR principles (that is, they are findable, accessible, interoperable and reusable). I now direct an international Global Open FAIR office tasked with helping communities to implement the guidelines, and I am convinced that doing so will require a large cadre of professionals, about one for every 20 researchers.

Even when data are shared, the metadata, expertise, technologies and infrastructure necessary for reuse are lacking. Most published data sets are scattered into ‘supplemental files’ that are often impossible for machines or even humans to find. These and other sloppy data practices keep researchers from building on each other’s work. In cases of disease outbreaks, for instance, this might even cost lives….(More)”.

Tesco Grocery 1.0, a large-scale dataset of grocery purchases in London


Paper by Luca Maria Aiello, Daniele Quercia, Rossano Schifanella & Lucia Del Prete: “We present the Tesco Grocery 1.0 dataset: a record of 420 M food items purchased by 1.6 M fidelity card owners who shopped at the 411 Tesco stores in Greater London over the course of the entire year of 2015, aggregated at the level of census areas to preserve anonymity. For each area, we report the number of transactions and nutritional properties of the typical food item bought including the average caloric intake and the composition of nutrients.

The set of global trade international numbers (barcodes) for each food type is also included. To establish data validity we: i) compare food purchase volumes to population from census to assess representativeness, and ii) match nutrient and energy intake to official statistics of food-related illnesses to appraise the extent to which the dataset is ecologically valid. Given its unprecedented scale and geographic granularity, the data can be used to link food purchases to a number of geographically-salient indicators, which enables studies on health outcomes, cultural aspects, and economic factors….(More)”.

Monitoring of the Venezuelan exodus through Facebook’s advertising platform


Paper by Palotti et al: “Venezuela is going through the worst economical, political and social crisis in its modern history. Basic products like food or medicine are scarce and hyperinflation is combined with economic depression. This situation is creating an unprecedented refugee and migrant crisis in the region. Governments and international agencies have not been able to consistently leverage reliable information using traditional methods. Therefore, to organize and deploy any kind of humanitarian response, it is crucial to evaluate new methodologies to measure the number and location of Venezuelan refugees and migrants across Latin America.

In this paper, we propose to use Facebook’s advertising platform as an additional data source for monitoring the ongoing crisis. We estimate and validate national and sub-national numbers of refugees and migrants and break-down their socio-economic profiles to further understand the complexity of the phenomenon. Although limitations exist, we believe that the presented methodology can be of value for real-time assessment of refugee and migrant crises world-wide….(More)”.

Experts say privately held data available in the European Union should be used better and more


European Commission: “Data can solve problems from traffic jams to disaster relief, but European countries are not yet using this data to its full potential, experts say in a report released today. More secure and regular data sharing across the EU could help public administrations use private sector data for the public good.

In order to increase Business-to-Government (B2G) data sharing, the experts advise to make data sharing in the EU easier by taking policy, legal and investment measures in three main areas:

  1. Governance of B2G data sharing across the EU: such as putting in place national governance structures, setting up a recognised function (‘data stewards’) in public and private organisations, and exploring the creation of a cross-EU regulatory framework.
  2. Transparency, citizen engagement and ethics: such as making B2G data sharing more citizen-centric, developing ethical guidelines, and investing in training and education.
  3. Operational models, structures and technical tools: such as creating incentives for companies to share data, carrying out studies on the benefits of B2G data sharing, and providing support to develop the technical infrastructure through the Horizon Europe and Digital Europe programmes.

They also revised the principles on private sector data sharing in B2G contexts and included new principles on accountability and on fair and ethical data use, which should guide B2G data sharing for the public interest. Examples of successful B2G data sharing partnerships in the EU include an open forest data system in Finland to help manage the ecosystem, mapping of EU fishing activities using ship tracking data, and genome sequencing data of breast cancer patients to identify new personalised treatments. …

The High-Level Expert Group on Business-to-Government Data Sharing was set up in autumn 2018 and includes members from a broad range of interests and sectors. The recommendations presented today in its final report feed into the European strategy for data and can be used as input for other possible future Commission initiatives on Business-to-Government data sharing….(More)”.

New privacy-protected Facebook data for independent research on social media’s impact on democracy


Chaya Nayak at Facebook: “In 2018, Facebook began an initiative to support independent academic research on social media’s role in elections and democracy. This first-of-its-kind project seeks to provide researchers access to privacy-preserving data sets in order to support research on these important topics.

Today, we are announcing that we have substantially increased the amount of data we’re providing to 60 academic researchers across 17 labs and 30 universities around the world. This release delivers on the commitment we made in July 2018 to share a data set that enables researchers to study information and misinformation on Facebook, while also ensuring that we protect the privacy of our users.

This new data release supplants data we released in the fall of 2019. That 2019 data set consisted of links that had been shared publicly on Facebook by at least 100 unique Facebook users. It included information about share counts, ratings by Facebook’s third-party fact-checkers, and user reporting on spam, hate speech, and false news associated with those links. We have expanded the data set to now include more than 38 million unique links with new aggregated information to help academic researchers analyze how many people saw these links on Facebook and how they interacted with that content – including views, clicks, shares, likes, and other reactions. We’ve also aggregated these shares by age, gender, country, and month. And, we have expanded the time frame covered by the data from January 2017 – February 2019 to January 2017 – August 2019.

With this data, researchers will be able to understand important aspects of how social media shapes our world. They’ll be able to make progress on the research questions they proposed, such as “how to characterize mainstream and non-mainstream online news sources in social media” and “studying polarization, misinformation, and manipulation across multiple platforms and the larger information ecosystem.”

In addition to the data set of URLs, researchers will continue to have access to CrowdTangle and Facebook’s Ad Library API to augment their analyses. Per the original plan for this project, outside of a limited review to ensure that no confidential or user data is inadvertently released, these researchers will be able to publish their findings without approval from Facebook.

We are sharing this data with researchers while continuing to prioritize the privacy of people who use our services. This new data set, like the data we released before it, is protected by a method known as differential privacy. Researchers have access to data tables from which they can learn about aggregated groups, but where they cannot identify any individual user. As Harvard University’s Privacy Tools project puts it:

“The guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset — anything the algorithm might output on a database containing some individual’s information is almost as likely to have come from a database without that individual’s information. … This gives a formal guarantee that individual-level information about participants in the database is not leaked.” …(More)”