The State of Open Data Policy Repository


The State of Open Data Policy Repository is a collection of recent policy developments surrounding open data, data reuse, and data collaboration around the world. 

A refinement of compilation of policies launched at the Open Data Policy Summit last year, the State of Open Data Policy Online Repository is an interactive resource that looks at recent legislation, directives, and proposals that affect open data and data collaboration all around the world. It captures what kinds of data collaboration issues policymakers are currently focused on and where the momentum for data innovation is heading in countries around the world.

Users can filter policies according to region, country, focus, and type of data sharing. The review currently surfaced approximately 60 examples of recent legislative acts, proposals, directives, and other policy documents, from which the Open Data Policy Lab draws findings about the need to promote more innovative policy frameworks.

This collection shows that, despite increased interest in the third wave conception of open data, policy development remains nascent. It is primarily concerned with open data repositories at the expense of alternative forms of collaboration. Most policies listed focus on releasing government data and, elsewhere, most nations still don’t have open data rules or a method to put the policies in place. 

This work reveals a pressing need for institutions to create frameworks that can direct data professionals since there are worries that inaction may both allow for misuse of data and lead to missed chances to use data…(More)”.

Commission defines high-value datasets to be made available for re-use


Press Release: “Today, the Commission has published a list of high-value datasets that public sector bodies will have to make available for re-use, free of charge, within 16 months.

Certain public sector data, such as meteorological or air quality data are particularly interesting for creators of value-added services and applications and have important benefits for society, the environment and the economy – which is why they should be made available to the public…

The Regulation is set up under the Open Data Directive, which defines six categories of such high-value datasets: geospatial, earth observation and environment, meteorological, statistics, companies and mobility. This thematic range can be extended at a later stage to reflect technological and market developments. The datasets will be available in machine-readable format, via an Application Programming Interface and, where relevant, as bulk download.

The increased availability of data will boost entrepreneurship and result in the creation of new companies. High-value datasets can be an important resource for SMEs to develop new digital products and services, and therefore also an enabler helping them to attract investors. The re-use of datasets such as mobility or geolocalisation of buildings can open business opportunities for the logistics or transport sectors, as well as improve the efficiency of public service delivery, for example by understanding traffic flows to make transport more efficient. Meteorological observation data, radar data, air quality and soil contamination data can also support research and digital innovation as well as better-informed policymaking, especially in the fight against climate change….(More)”. See also: List of specific high-value datasets

Studying open government data: Acknowledging practices and politics


Paper by Gijs van Maanen: “Open government and open data are often presented as the Asterix and Obelix of modern government—one cannot discuss one, without involving the other. Modern government, in this narrative, should open itself up, be more transparent, and allow the governed to have a say in their governance. The usage of technologies, and especially the communication of governmental data, is then thought to be one of the crucial instruments helping governments achieving these goals. Much open government data research, hence, focuses on the publication of open government data, their reuse, and re-users. Recent research trends, by contrast, divert from this focus on data and emphasize the importance of studying open government data in practice, in interaction with practitioners, while simultaneously paying attention to their political character. This commentary looks more closely at the implications of emphasizing the practical and political dimensions of open government data. It argues that researchers should explicate how and in what way open government data policies present solutions to what kind of problems. Such explications should be based on a detailed empirical analysis of how different actors do or do not do open data. The key question to be continuously asked and answered when studying and implementing open government data is how the solutions openness present latch onto the problem they aim to solve…(More)”.

ResearchDataGov


ResearchDataGov.org is a product of the federal statistical agencies and units, created in response to the Foundations of Evidence-based Policymaking Act of 2018. The site is the single portal for discovery of restricted data in the federal statistical system. The agencies have provided detailed descriptions of each data asset. Users can search for data by topic, agency, and keywords. Questions related to the data should be directed to the owning agency, using the contact information on the page that describes the data. In late 2022, users will be able to apply for access to these data using a single-application process built into ResearchDataGov. ResearchDataGov.org is built by and hosted at ICPSR at the University of Michigan, under contract and guidance from the National Center for Science and Engineering Statistics within the National Science Foundation.

The data described in ResearchDataGov.org are owned by and accessed through the agencies and units of the federal statistical system. Data access is determined by the owning or distributing agency and is limited to specific physical or virtual data enclaves. Even though all data assets are listed in a single inventory, they are not necessarily available for use in the same location(s). Multiple data assets accessed in the same location may not be able to be used together due to disclosure risk and other requirements. Please note the access modality of the data in which you are interested and seek guidance from the owning agency about whether assets can be linked or otherwise used together…(More)”.

A Landscape of Open Science Policies Research


Paper by Alejandra Manco: “This literature review aims to examine the approach given to open science policy in the different studies. The main findings are that the approach given to open science has different aspects: policy framing and its geopolitical aspects are described as an asymmetries replication and epistemic governance tool. The main geopolitical aspects of open science policies described in the literature are the relations between international, regional, and national policies. There are also different components of open science covered in the literature: open data seems much discussed in the works in the English language, while open access is the main component discussed in the Portuguese and Spanish speaking papers. Finally, the relationship between open science policies and the science policy is framed by highlighting the innovation and transparency that open science can bring into it…(More)”

Explore the first Open Science Indicators dataset


Article by Lauren Cadwallader, Lindsay Morton, and Iain Hrynaszkiewicz: “Open Science is on the rise. We can infer as much from the proliferation of Open Access publishing options; the steady upward trend in bioRxiv postings; the periodic rollout of new national, institutional, or funder policies. 

But what do we actually know about the day-to-day realities of Open Science practice? What are the norms? How do they vary across different research subject areas and regions? Are Open Science practices shifting over time? Where might the next opportunity lie and where do barriers to adoption persist? 

To even begin exploring these questions and others like them we need to establish a shared understanding of how we define and measure Open Science practices. We also need to understand the current state of adoption in order to track progress over time. That’s where the Open Science Indicators project comes in. PLOS conceptualized a framework for measuring Open Science practices according to the FAIR principles, and partnered with DataSeer to develop a set of numerical “indicators” linked to specific Open Science characteristics and behaviors observable in published research articles. Our very first dataset, now available for download at Figshare, focuses on three Open Science practices: data sharing, code sharing, and preprint posting…(More)”.

Smart OCR – Advancing the Use of Artificial Intelligence with Open Data


Article by Parth Jain, Abhinay Mannepalli, Raj Parikh, and Jim Samuel: “Optical character recognition (OCR) is growing at a projected compounded annual growth rate (CAGR) of 16%, and is expected to have a value of 39.7 billion USD by 2030, as estimated by Straits research. There has been a growing interest in OCR technologies over the past decade. Optical character recognition is the technological process for transforming images of typed, handwritten, scanned, or printed texts into machine-encoded and machine-readable texts (Tappert, et al., 1990). OCR can be used with a broad range of image or scan formats – for example, these could be in the form of a scanned document such as a .pdf file, a picture of a piece of paper in .png or .jpeg format, or images with embedded text, such as characters on a coffee cup, title on the cover page of a book, the license number on vehicular plates, and images of code on websites. OCR has proven to be a valuable technological process for tackling the important challenge of transforming non-machine-readable data into machine readable data. This enables the use of natural language processing and computational methods on information-rich data which were previously largely non-processable. Given the broad array of scanned and image documents in open government data and other open data sources, OCR holds tremendous promise for value generation with open data.

Open data has been defined as “being data that is made freely available for open consumption, at no direct cost to the public, which can be efficiently located, filtered, downloaded, processed, shared, and reused without any significant restrictions on associated derivatives, use, and reuse” (Chidipothu et al., 2022). Large segments of open data contain images, visuals, scans, and other non-machine-readable content. The size and complexity associated with the manual analysis of such content is prohibitive. The most efficient way would be to establish standardized processes for transforming documents into their OCR output versions. Such machine-readable text could then be analyzed using a range of NLP methods. Artificial Intelligence (AI) can be viewed as being a “set of technologies that mimic the functions and expressions of human intelligence, specifically cognition and logic” (Samuel, 2021). OCR was one of the earliest AI technologies implemented. The first ever optical reader to identify handwritten numerals was the advanced reading machine “IBM 1287,” presented at the World Fair in New York in 1965 (Mori, et al., 1990). The value of open data is well established – however, the extent of usefulness of open data is dependent on “accessibility, machine readability, quality” and the degree to which data can be processed by using analytical and NLP methods (data.gov, 2022John, et al., 2022)…(More)”

Is Facebook’s advertising data accurate enough for use in social science research? Insights from a cross-national online survey


Paper by André Grow et al: “Social scientists increasingly use Facebook’s advertising platform for research, either in the form of conducting digital censuses of the general population, or for recruiting participants for survey research. Both approaches depend on the accuracy of the data that Facebook provides about its users, but little is known about how accurate these data are. We address this gap in a large-scale, cross-national online survey (N = 137,224), in which we compare self-reported and Facebook-classified demographic information (sex, age and region of residence). Our results suggest that Facebook’s advertising platform can be fruitfully used for conducing social science research if additional steps are taken to assess the accuracy of the characteristics under consideration…(More)”.

The Data Liberation Project 


About: “The Data Liberation Project is a new initiative I’m launching today to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest. Vast troves of government data are inaccessible to the people and communities who need them most. These datasets are inaccessible. The Process:

  • Identify: Through its own research, as well as through consultations with journalists, community groups, government-data experts, and others, the Data Liberation Project aims to identify a large number of datasets worth pursuing.
  • Obtain: The Data Liberation Project plans to use a wide range of methods to obtain the datasets, including via Freedom of Information Act requests, intervening in lawsuits, web-scraping, and advanced document parsing. To improve public knowledge about government data systems, the Data Liberation Project also files FOIA requests for essential metadata, such as database schemas, record layouts, data dictionaries, user guides, and glossaries.
  • Reformat: Many datasets are delivered to journalists and the public in difficult-to-use formats. Some may follow arcane conventions or require proprietary software to access, for instance. The Data Liberation Project will convert these datasets into open formats, and restructure them so that they can be more easily examined.
  • Clean: The Data Liberation Project will not alter the raw records it receives. But when the messiness of datasets inhibits their usefulness, the project will create secondary, “clean” versions of datasets that fix these problems.
  • Document: Datasets are meaningless without context, and practically useless without documentation. The Data Liberation Project will gather official documentation for each dataset into a central location. It will also fill observed gaps in the documentation through its own research, interviews, and analysis.
  • Disseminate: The Data Liberation Project will not expect reporters and other members of the public simply to stumble upon these datasets. Instead, it will reach out to the newsrooms and communities that stand to benefit most from the data. The project will host hands-on workshops, webinars, and other events to help others to understand and use the data.”…(More)”