Open Data on GitHub: Unlocking the Potential of AI


Paper by Anthony Cintron Roman, Kevin Xu, Arfon Smith, Jehu Torres Vega, Caleb Robinson, Juan M Lavista Ferres: “GitHub is the world’s largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes of data. This study highlights the potential of open data on GitHub and demonstrates how it can accelerate AI research. We analyze the existing landscape of open data on GitHub and the patterns of how users share datasets. Our findings show that GitHub is one of the largest hosts of open data in the world and has experienced an accelerated growth of open data assets over the past four years. By examining the open data landscape on GitHub, we aim to empower users and organizations to leverage existing open datasets and improve their discoverability — ultimately contributing to the ongoing AI revolution to help address complex societal issues. We release the three datasets that we have collected to support this analysis as open datasets at this https URL…(More)”

How Does Data Access Shape Science?


Paper by Abhishek Nagaraj & Matteo Tranchero: “This study examines the impact of access to confidential administrative data on the rate, direction, and policy relevance of economics research. To study this question, we exploit the progressive geographic expansion of the U.S. Census Bureau’s Federal Statistical Research Data Centers (FSRDCs). FSRDCs boost data diffusion, help empirical researchers publish more articles in top outlets, and increase citation-weighted publications. Besides direct data usage, spillovers to non-adopters also drive this effect. Further, citations to exposed researchers in policy documents increase significantly. Our findings underscore the importance of data access for scientific progress and evidence-based policy formulation…(More)”.

Critical factors influencing information disclosure in public organisations


Paper by Francisca Tejedo-Romero & Joaquim Filipe Ferraz Esteves Araujo: “Open government initiatives around the world and the passage of freedom of information laws are opening public organisations through information disclosure to ensure transparency and encourage citizen participation and engagement. At the municipal level, social, economic, and political factors are found to account for this trend. However, the findings on this issue are inconclusive and may differ from country to country. This paper contributes to this discussion by analysing a unitary country where the same set of laws and rules governs the constituent municipalities. It seeks to identify critical factors that affect the disclosure of municipal information. For this purpose, a longitudinal study was carried out over a period of 4 years using panel data methodology. The main conclusions seem to point to municipalities’ intention to increase the dissemination of information to reduce low levels of voter turnout and increase civic involvement and political participation. Municipalities governed by leftist parties and those that have high indebtedness are most likely to disclose information. Additionally, internet access has created new opportunities for citizens to access information, which exerts pressure for greater dissemination of information by municipalities. These findings are important to practitioners because they indicate the need to improve citizens’ access to the Internet and maintain information disclosure strategies beyond election periods…(More)”.

Towards High-Value Datasets determination for data-driven development: a systematic literature review


Paper by Anastasija Nikiforova, Nina Rizun, Magdalena Ciesielska, Charalampos Alexopoulos, and Andrea Miletič: “The OGD is seen as a political and socio-economic phenomenon that promises to promote civic engagement and stimulate public sector innovations in various areas of public life. To bring the expected benefits, data must be reused and transformed into value-added products or services. This, in turn, sets another precondition for data that are expected to not only be available and comply with open data principles, but also be of value, i.e., of interest for reuse by the end-user. This refers to the notion of ‘high-value dataset’ (HVD), recognized by the European Data Portal as a key trend in the OGD area in 2022. While there is a progress in this direction, e.g., the Open Data Directive, incl. identifying 6 key categories, a list of HVDs and arrangements for their publication and re-use, they can be seen as ‘core’ / ‘base’ datasets aimed at increasing interoperability of public sector data with a high priority, contributing to the development of a more mature OGD initiative. Depending on the specifics of a region and country – geographical location, social, environmental, economic issues, cultural characteristics, (under)developed sectors and market specificities, more datasets can be recognized as of high value for a particular country. However, there is no standardized approach to assist chief data officers in this. In this paper, we present a systematic review of existing literature on the HVD determination, which is expected to form an initial knowledge base for this process, incl. used approaches and indicators to determine them, data, stakeholders…(More)”.

For chemists, the AI revolution has yet to happen


Editorial Team at Nature: “Many people are expressing fears that artificial intelligence (AI) has gone too far — or risks doing so. Take Geoffrey Hinton, a prominent figure in AI, who recently resigned from his position at Google, citing the desire to speak out about the technology’s potential risks to society and human well-being.

But against those big-picture concerns, in many areas of science you will hear a different frustration being expressed more quietly: that AI has not yet gone far enough. One of those areas is chemistry, for which machine-learning tools promise a revolution in the way researchers seek and synthesize useful new substances. But a wholesale revolution has yet to happen — because of the lack of data available to feed hungry AI systems.

Any AI system is only as good as the data it is trained on. These systems rely on what are called neural networks, which their developers teach using training data sets that must be large, reliable and free of bias. If chemists want to harness the full potential of generative-AI tools, they need to help to establish such training data sets. More data are needed — both experimental and simulated — including historical data and otherwise obscure knowledge, such as that from unsuccessful experiments. And researchers must ensure that the resulting information is accessible. This task is still very much a work in progress…(More)”.

What do data portals do? Tracing the politics of online devices for making data public


Paper by Jonathan Gray: “The past decade has seen the rise of “data portals” as online devices for making data public. They have been accorded a prominent status in political speeches, policy documents, and official communications as sites of innovation, transparency, accountability, and participation. Drawing on research on data portals around the world, data portal software, and associated infrastructures, this paper explores three approaches for studying the social life of data portals as technopolitical devices: (a) interface analysis, (b) software analysis, and (c) metadata analysis. These three approaches contribute to the study of the social lives of data portals as dynamic, heterogeneous, and contested sites of public sector datafication. They are intended to contribute to critically assessing how participation around public sector datafication is invited and organized with portals, as well as to rethinking and recomposing them…(More)”.

Why Does Open Data Get Underused? A Focus on the Role of (Open) Data Literacy


Paper by Gema Santos-Hermosa et al: “Open data has been conceptualised as a strategic form of public knowledge. Tightly connected with the developments in open government and open science, the main claim is that access to open data (OD) might be a catalyser of social innovation and citizen empowerment. Nevertheless, the so-called (open) data divide, as a problem connected to the situation of OD usage and engagement, is a concern.

In this chapter, we introduce the OD usage trends, focusing on the role played by (open) data literacy amongst either users or producers: citizens, professionals, and researchers. Indeed, we attempted to cover the problem of OD through a holistic approach including two areas of research and practice: open government data (OGD) and open research data (ORD). After uncovering several factors blocking OD consumption, we point out that more OD is being published (albeit with low usage), and we overview the research on data literacy. While the intentions of stakeholders are driven by many motivations, the abilities that put them in the condition to enhance OD might require further attention. In the end, we focus on several lifelong learning activities supporting open data literacy, uncovering the challenges ahead to unleash the power of OD in society…(More)”.

AI-Ready Open Data


Explainer by Sean Long and Tom Romanoff: “Artificial intelligence and machine learning (AI/ML) have the potential to create applications that tackle societal challenges from human health to climate change. These applications, however, require data to power AI model development and implementation. Government’s vast amount of open data can fill this gap: McKinsey estimates that open data can help unlock $3 trillion to $5 trillion in economic value annually across seven sectors. But for open data to fuel innovations in academia and the private sector, the data must be both easy to find and use. While Data.gov makes it simpler to find the federal government’s open data, researchers still spend up to 80% of their time preparing data into a usable, AI-ready format. As Intel warns, “You’re not AI-ready until your data is.”

In this explainer, the Bipartisan Policy Center provides an overview of existing efforts across the federal government to improve the AI readiness of its open data. We answer the following questions:

  • What is AI-ready data?
  • Why is AI-ready data important to the federal government’s AI agenda?
  • Where is AI-ready data being applied across federal agencies?
  • How could AI-ready data become the federal standard?…(More)”.

Rethinking the impact of open data: A first step towards a European impact assessment for open data


Report for data.europa.eu: “This report is the first in a series of four that aims to establish a standard methodology for open data impact assessments that can be used across Europe. This exercise is key because a consistent definition of the impact of open data does not exist. The lack of a robust, conceptual foundation has made it more difficult for data portals to demonstrate their value through empirical evidence. It also challenges the EU’s ability to understand and compare performance across Member States. Most academic articles that look to explore the impact of data refer to existing open data frameworks, with the open data maturity (ODM) and open data barometer (ODB) ones most frequently represented. These two frameworks distinguish between different kinds of impact, and both mention social, political and economic impacts in particular. The ODM also includes the environmental impact in its framework.

Sometimes, these frameworks diverge from the European Commission’s own recommendations of how best to measure impact, as explained in specific sections of the better regulation guidelines and the better regulation toolbox. They help to answer a critical question for policymakers: do the benefits provided outweigh the costs of assembling and distributing (open) data? Future reports in this series will further explore how to better align existing frameworks, such as the ODM, with these critically important guidelines…(More)”.

Ready, set, share: Researchers brace for new data-sharing rules


Jocelyn Kaiser and Jeffrey Brainard in Science: “…By 2025, new U.S. requirements for data sharing will extend beyond biomedical research to encompass researchers across all scientific disciplines who receive federal research funding. Some funders in the European Union and China have also enacted data-sharing requirements. The new U.S. moves are feeding hopes that a worldwide movement toward increased sharing is in the offing. Supporters think it could speed the pace and reliability of science.

Some scientists may only need to make a few adjustments to comply with the policies. That’s because data sharing is already common in fields such as protein crystallography and astronomy. But in other fields the task could be weighty, because sharing is often an afterthought. For example, a study involving 7750 medical research papers found that just 9% of those published from 2015 to 2020 promised to make their data publicly available, and authors of just 3% actually shared, says lead author Daniel Hamilton of the University of Melbourne, who described the finding at the International Congress on Peer Review and Scientific Publication in September 2022. Even when authors promise to share their data, they often fail to follow through. Out of 21,000 journal articles that included data-sharing plans, a study published in PLOS ONE in 2020 found, fewer than 21% provided links to the repository storing the data.

Journals and funders, too, have a mixed record when it comes to supporting data sharing. Research presented at the September 2022 peer-review congress found only about half of the 110 largest public, corporate, and philanthropic funders of health research around the world recommend or require grantees to share data…

“Health research is the field where the ethical obligation to share data is the highest,” says Aidan Tan, a clinician-researcher at the University of Sydney who led the study. “People volunteer in clinical trials and put themselves at risk to advance medical research and ultimately improve human health.”

Across many fields of science, researchers’ support for sharing data has increased during the past decade, surveys show. But given the potential cost and complexity, many are apprehensive about the NIH policy, and other requirements to follow. “How we get there is pretty messy right now,” says Parker Antin, a developmental biologist and associate vice president for research at the University of Arizona. “I’m really not sure whether the total return will justify the cost. But I don’t know of any other way to find out than trying to do it.”

Science offers this guide as researchers prepare to plunge in….(More)”.