Ten lessons for data sharing with a data commons


Article by Robert L. Grossman: “..Lesson 1. Build a commons for a specific community with a specific set of research challenges

Although there are a few data repositories that serve the general scientific community that have proved successful, in general data commons that target a specific user community have proven to be the most successful. The first lesson is to build a data commons for a specific research community that is struggling to answer specific research challenges with data. As a consequence, a data commons is a partnership between the data scientists developing and supporting the commons and the disciplinary scientists with the research challenges.

Lesson 2. Successful commons curate and harmonize the data

Successful commons curate and harmonize the data and produce data products of broad interest to the community. It’s time consuming, expensive, and labor intensive to curate and harmonize data, by much of the value of data commons is centralizing this work so that it can be done once instead of many times by each group that needs the data. These days, it is very easy to think of a data commons as a platform containing data, not spend the time curating or harmonizing it, and then be surprised that the data in the commons is not used more widely used and its impact is not as high as expected.

Lesson 3. It’s ultimately about the data and its value to generate new research discoveries

Despite the importance of a study, few scientists will try to replicate previously published studies. Instead, data is usually accessed if it can lead to a new high impact paper. For this reason, data commons play two different but related roles. First, they preserve data for reproducible science. This is a small fraction of the data access, but plays a critical role in reproducible science. Second, data commons make data available for new high value science.

Lesson 4. Reduce barriers to access to increase usage

A useful rule of thumb is that every barrier to data access cuts down access by a factor of 10. Common barriers that reduce use of a commons include: registration vs no-registration; open access vs controlled access; click through agreements vs signing of data usage agreements and approval by data access committees; license restrictions on the use of the data vs no license restrictions…(More)”.

Satellite data: The other type of smartphone data you might not know about


Article by Tommy Cooke et al: “Smartphones determine your location in several ways. The first way involves phones triangulating distances between cell towers or Wi-Fi routers.

The second way involves smartphones interacting with navigation satellites. When satellites pass overhead, they transmit signals to smartphones, which allows smartphones to calculate their own location. This process uses a specialized piece of hardware called the Global Navigation Satellite System (GNSS) chipset. Every smartphone has one.

When these GNSS chipsets calculate navigation satellite signals, they output data in two standardized formats (known as protocols or languages): the GNSS raw measurement protocol and the National Marine Electronics Association protocol (NMEA 0183).

GNSS raw measurements include data such as the distance between satellites and cellphones and measurements of the signal itself.

NMEA 0183 contains similar information to GNSS raw measurements, but also includes additional information such as satellite identification numbers, the number of satellites in a constellation, what country owns a satellite, and the position of a satellite.

NMEA 0183 was created and is governed by the NMEA, a not-for-profit lobby group that is also a marine electronics trade organization. The NMEA was formed at the 1957 New York Boat Show when boating equipment manufacturers decided to build stronger relationships within the electronic manufacturing industry.

In the decades since, the NMEA 0183 data standard has improved marine electronics communications and is now found on a wide variety of non-marine communications devices today, including smartphones…

It is difficult to know who has access to data produced by these protocols. Access to NMEA protocols is only available under licence to businesses for a fee.

GNSS raw measurements, on the other hand, are a universal standard and can be read by different devices in the same way without a license. In 2016, Google allowed industries to have open access to it to foster innovation around device tracking accuracy, precision, analytics about how we move in real-time, and predictions about our movements in the future.

While automated processes can quietly harvest location data — like when a French-based company extracted location data from Salaat First, a Muslim prayer app — these data don’t need to be taken directly from smartphones to be exploited.

Data can be modelled, experimented with, or emulated in licensed devices in labs for innovation and algorithmic development.

Satellite-driven raw measurements from our devices were used to power global surveillance networks like STRIKE3, a now defunct European-led initiative that monitored and reported perceived threats to navigation satellites…(More)”.

Data sharing during coronavirus: lessons for government


Report by Gavin Freeguard and Paul Shepley: “This report synthesises the lessons from six case studies and other research on government data sharing during the pandemic. It finds that current legislation, such as the Digital Economy Act and UK General Data Protection Regulation (GDPR), does not constitute a barrier to data sharing and that while technical barriers – incompatible IT systems, for example – can slow data sharing, they do not prevent it. 

Instead, the pandemic forced changes to standard working practice that enabled new data sharing agreements to be created quickly. This report focuses on what these changes were and how they can lead to improvements in future practice.

The report recommends: 

  • The government should retain data protection officers and data protection impact assessments within the Data Protection and Digital Information Bill, and consider strengthening provisions around citizen engagement and how to ensure data flows during emergency response.
  • The Department for Levelling Up, Housing and Communities should consult on how to improve working around data between central and local government in England. This should include the role of the proposed Office for Local Government, data skills and capabilities at the local level, reform of the Single Data List and the creation of a data brokering function to facilitate two-way data sharing between national and local government.
  • The Central Digital and Data Office (CDDO) should create a data sharing ‘playbook’ to support public servants building new services founded on data. The playbook should contain templates for standard documents, links to relevant legislation and codes of practice (like those from the Information Commissioner’s Office), guidance on public engagement and case studies covering who to engage and when whilst setting up a new service.
  • The Centre for Data Ethics and Innovation, working with CDDO, should take the lead on guidance and resources on how to engage the public at every stage of data sharing…(More)”.

How an Open-Source Disaster Map Helped Thousands of Earthquake Survivors


Article by Eray Gündoğmuş: “On February 6, 2023, earthquakes measuring 7.7 and 7.6 hit the Kahramanmaraş region of Turkey, affecting 10 cities and resulting in more than 42.000 deaths and 120.000 injuries as of February 21.

In the hours following the earthquake, a group of programmers quickly become together on the Discord server called “Açık Yazılım Ağı” , inviting IT professionals to volunteer and develop a project that could serve as a resource for rescue teams, earthquake survivors, and those who wanted to help: afetharita.com. It literally means “disaster map”.

As there was a lack of preparation for the first few days of such a huge earthquake, disaster victims in distress started making urgent aid requests on social media. With the help of thousands of volunteers, we utilized technologies such as artificial intelligence and machine learning to transform these aid requests into readable data and visualized them on afetharita.com. Later, we gathered critical data related to the disaster from necessary institutions and added them to the map.

Disaster Map, which received a total of 35 million requests and 627,000 unique visitors, played a significant role in providing software support during the most urgent and critical periods of the disaster, and helped NGOs, volunteers, and disaster victims to access important information. I wanted to share the process, our experiences, and technical details of this project clearly in writing…(More)”.

COVID isn’t going anywhere, neither should our efforts to increase responsible access to data


Article by Andrew J. Zahuranec, Hannah Chafetz and Stefaan Verhulst: “..Moving forward, institutions will need to consider how to embed non-traditional data capacity into their decision-making to better understand the world around them and respond to it.

For example, wastewater surveillance programmes that emerged during the pandemic continue to provide valuable insights about outbreaks before they are reported by clinical testing and have the potential to be used for other emerging diseases.

We need these and other programmes now more than ever. Governments and their partners need to maintain and, in many cases, strengthen the collaborations they established through the pandemic.

To address future crises, we need to institutionalize new data capacities – particularly those involving non-traditional datasets that may capture digital information that traditional health surveys and statistical methods often miss.

The figure above summarizes the types and sources of non-traditional data sources that stood out most during the COVID-19 response.

The types and sources of non-traditional data sources that stood out most during the COVID-19 response. Image: The GovLab

In our report, we suggest four pathways to advance the responsible access to non-traditional data during future health crises…(More)”.

Data solidarity: why sharing is not always caring 


Essay by Barbara Prainsack: “To solve these problems, we need to think about data governance in new ways. It is no longer enough to assume that asking people to consent to how their data is used is sufficient to prevent harm. In our example of telehealth, and in virtually all data-related scandals of the last decade, from Cambridge Analytica to Robodebt, informed consent did not, or could not, have avoided the problem. We all regularly agree to data uses that we know are problematic – not because we do not care about privacy. We agree because this is the only way to get access to benefits, a mortgage, or teachers and health professionals. In a world where face-to-face assessments are unavailable or excessively expensive, opting out of digital practices would no longer be an option (Prainsack, 2017, pp. 126-131; see also Oudshoorn, 2011).

Solidarity-based data governance (in short: data solidarity) can help us to distribute the risks and the benefits of digital practices more equitably. The details of the framework are spelled out in full elsewhere (Prainsack et al., 2022a, b). In short, data solidarity seeks to facilitate data uses that create significant public value, and at the same time prevent and mitigate harm (McMahon et al., 2020). One important step towards both goals is to stop ascribing risks to data types, and to distinguish between different types of data use instead. In some situations, harm can be prevented by making sure that data is not used for harmful purposes, such as online tracking. In other contexts, however, harm prevention can require that we do not collect the data in the first place. Not recording something, making it invisible and uncountable to others, can be the most responsible way to act in some contexts.

This means that recording and sharing data should not become a default. More data is not always better. Instead, policymakers need to consider carefully – in a dialogue with the people and communities that have a stake in it – what should be recorded, where it will be stored and who governs the data once it has been collected – if at all (see also Kukutai and Taylor, 2016)…(More)”.

Researchers scramble as Twitter plans to end free data access


Article by Heidi Ledford: “Akin Ünver has been using Twitter data for years. He investigates some of the biggest issues in social science, including political polarization, fake news and online extremism. But earlier this month, he had to set aside time to focus on a pressing emergency: helping relief efforts in Turkey and Syria after the devastating earthquake on 6 February.

Aid workers in the region have been racing to rescue people trapped by debris and to provide health care and supplies to those displaced by the tragedy. Twitter has been invaluable for collecting real-time data and generating crucial maps to direct the response, says Ünver, a computational social scientist at Özyeğin University in Istanbul.

So when he heard that Twitter was about to end its policy of providing free access to its application programming interface (API) — a pivotal set of rules that allows people to extract and process large amounts of data from the platform — he was dismayed. “Couldn’t come at a worse time,” he tweeted. “Most analysts and programmers that are building apps and functions for Turkey earthquake aid and relief, and are literally saving lives, are reliant on Twitter API.”..

Twitter has long offered academics free access to its API, an unusual approach that has been instrumental in the rise of computational approaches to studying social media. So when the company announced on 2 February that it would end that free access in a matter of days, it sent the field into a tailspin. “Thousands of research projects running over more than a decade would not be possible if the API wasn’t free,” says Patty Kostkova, who specializes in digital health studies at University College London…(More)”.

Data from satellites is starting to spur climate action


Miriam Kramer and Alison Snyder at Axios: “Data from space is being used to try to fight climate change by optimizing shipping lanes, adjusting rail schedules and pinpointing greenhouse gas emissions.

Why it matters: Satellite data has been used to monitor how human activities are changing Earth’s climate. Now it’s being used to attempt to alter those activities and take action against that change.

  • “Pixels are great but nobody really wants pixels except as a step to answering their questions about how the world is changing and how that should assess and inform their decisionmaking,” Steven Brumby, CEO and co-founder of Impact Observatory, which uses AI to create maps from satellite data, tells Axios in an email.

What’s happening: Several satellite companies are beginning to use their capabilities to guide on-the-ground actions that contribute to greenhouse gas emissions cuts.

  • UK-based satellite company Inmarsat, which provides telecommunications to the shipping and agriculture industries, is working with Brazilian railway operator Rumo to optimize train trips — and reduce fuel use.
  • Maritime shipping, which relies on heavy fuel oil, is another sector where satellites could help to reduce emissions by routing ships more efficiently and prevent communications-caused delays, says Inmarsat’s CEO Rajeev Suri. The industry contributes 3% of global greenhouse gas emissions.
  • Carbon capture, innovations in steel and cement production and other inventions are important for addressing climate change, Suri says. But using satellites is “potentially low-hanging fruit because these technologies are already available.”

Other satellites are also tracking emissions of methane — a strong greenhouse gas — from landfills and oil and gas production.

  • “It’s a needle in a haystack problem. There are literally millions of potential leak points all over the world,” says Stéphane Germain, founder and CEO of GHGSat, which monitors methane emissions from its six satellites in orbit.
  • A satellite dedicated to honing in on carbon dioxide emissions is due to launch later this year…(More)”.

Federated machine learning in data-protection-compliant research


Paper by Alissa Brauneck et al : “In recent years, interest in machine learning (ML) as well as in multi-institutional collaborations has grown, especially in the medical field. However, strict application of data-protection laws reduces the size of training datasets, hurts the performance of ML systems and, in the worst case, can prevent the implementation of research insights in clinical practice. Federated learning can help overcome this bottleneck through decentralised training of ML models within the local data environment, while maintaining the predictive performance of ‘classical’ ML. Thus, federated learning provides immense benefits for cross-institutional collaboration by avoiding the sharing of sensitive personal data(Fig. 1; refs.). Because existing regulations (especially the General Data Protection Regulation 2016/679 of the European Union, or GDPR) set stringent requirements for medical data and rather vague rules for ML systems, researchers are faced with uncertainty. In this comment, we provide recommendations for researchers who intend to use federated learning, a privacy-preserving ML technique, in their research. We also point to areas where regulations are lacking, discussing some fundamental conceptual problems with ML regulation through the GDPR, related especially to notions of transparency, fairness and error-free data. We then provide an outlook on how implications from data-protection laws can be directly incorporated into federated learning tools…(More)”.

Predicting Socio-Economic Well-being Using Mobile Apps Data: A Case Study of France


Paper by Rahul Goel, Angelo Furno, and Rajesh Sharma: “Socio-economic indicators provide context for assessing a country’s overall condition. These indicators contain information about education, gender, poverty, employment, and other factors. Therefore, reliable and accurate information is critical for social research and government policing. Most data sources available today, such as censuses, have sparse population coverage or are updated infrequently. Nonetheless, alternative data sources, such as call data records (CDR) and mobile app usage, can serve as cost-effective and up-to-date sources for identifying socio-economic indicators.
This work investigates mobile app data to predict socio-economic features. We present a large-scale study using data that captures the traffic of thousands of mobile applications by approximately 30 million users distributed over 550,000 km square and served by over 25,000 base stations. The dataset covers the whole France territory and spans more than 2.5 months, starting from 16th March 2019 to 6th June 2019. Using the app usage patterns, our best model can estimate socio-economic indicators (attaining an R-squared score upto 0.66). Furthermore, using models’ explainability, we discover that mobile app usage patterns have the potential to reveal socio-economic disparities in IRIS. Insights of this study provide several avenues for future interventions, including users’ temporal network analysis and exploration of alternative data sources…(More)”.