Commission defines high-value datasets to be made available for re-use


Press Release: “Today, the Commission has published a list of high-value datasets that public sector bodies will have to make available for re-use, free of charge, within 16 months.

Certain public sector data, such as meteorological or air quality data are particularly interesting for creators of value-added services and applications and have important benefits for society, the environment and the economy – which is why they should be made available to the public…

The Regulation is set up under the Open Data Directive, which defines six categories of such high-value datasets: geospatial, earth observation and environment, meteorological, statistics, companies and mobility. This thematic range can be extended at a later stage to reflect technological and market developments. The datasets will be available in machine-readable format, via an Application Programming Interface and, where relevant, as bulk download.

The increased availability of data will boost entrepreneurship and result in the creation of new companies. High-value datasets can be an important resource for SMEs to develop new digital products and services, and therefore also an enabler helping them to attract investors. The re-use of datasets such as mobility or geolocalisation of buildings can open business opportunities for the logistics or transport sectors, as well as improve the efficiency of public service delivery, for example by understanding traffic flows to make transport more efficient. Meteorological observation data, radar data, air quality and soil contamination data can also support research and digital innovation as well as better-informed policymaking, especially in the fight against climate change….(More)”. See also: List of specific high-value datasets

Studying open government data: Acknowledging practices and politics


Paper by Gijs van Maanen: “Open government and open data are often presented as the Asterix and Obelix of modern government—one cannot discuss one, without involving the other. Modern government, in this narrative, should open itself up, be more transparent, and allow the governed to have a say in their governance. The usage of technologies, and especially the communication of governmental data, is then thought to be one of the crucial instruments helping governments achieving these goals. Much open government data research, hence, focuses on the publication of open government data, their reuse, and re-users. Recent research trends, by contrast, divert from this focus on data and emphasize the importance of studying open government data in practice, in interaction with practitioners, while simultaneously paying attention to their political character. This commentary looks more closely at the implications of emphasizing the practical and political dimensions of open government data. It argues that researchers should explicate how and in what way open government data policies present solutions to what kind of problems. Such explications should be based on a detailed empirical analysis of how different actors do or do not do open data. The key question to be continuously asked and answered when studying and implementing open government data is how the solutions openness present latch onto the problem they aim to solve…(More)”.

ResearchDataGov


ResearchDataGov.org is a product of the federal statistical agencies and units, created in response to the Foundations of Evidence-based Policymaking Act of 2018. The site is the single portal for discovery of restricted data in the federal statistical system. The agencies have provided detailed descriptions of each data asset. Users can search for data by topic, agency, and keywords. Questions related to the data should be directed to the owning agency, using the contact information on the page that describes the data. In late 2022, users will be able to apply for access to these data using a single-application process built into ResearchDataGov. ResearchDataGov.org is built by and hosted at ICPSR at the University of Michigan, under contract and guidance from the National Center for Science and Engineering Statistics within the National Science Foundation.

The data described in ResearchDataGov.org are owned by and accessed through the agencies and units of the federal statistical system. Data access is determined by the owning or distributing agency and is limited to specific physical or virtual data enclaves. Even though all data assets are listed in a single inventory, they are not necessarily available for use in the same location(s). Multiple data assets accessed in the same location may not be able to be used together due to disclosure risk and other requirements. Please note the access modality of the data in which you are interested and seek guidance from the owning agency about whether assets can be linked or otherwise used together…(More)”.

A Landscape of Open Science Policies Research


Paper by Alejandra Manco: “This literature review aims to examine the approach given to open science policy in the different studies. The main findings are that the approach given to open science has different aspects: policy framing and its geopolitical aspects are described as an asymmetries replication and epistemic governance tool. The main geopolitical aspects of open science policies described in the literature are the relations between international, regional, and national policies. There are also different components of open science covered in the literature: open data seems much discussed in the works in the English language, while open access is the main component discussed in the Portuguese and Spanish speaking papers. Finally, the relationship between open science policies and the science policy is framed by highlighting the innovation and transparency that open science can bring into it…(More)”

Explore the first Open Science Indicators dataset


Article by Lauren Cadwallader, Lindsay Morton, and Iain Hrynaszkiewicz: “Open Science is on the rise. We can infer as much from the proliferation of Open Access publishing options; the steady upward trend in bioRxiv postings; the periodic rollout of new national, institutional, or funder policies. 

But what do we actually know about the day-to-day realities of Open Science practice? What are the norms? How do they vary across different research subject areas and regions? Are Open Science practices shifting over time? Where might the next opportunity lie and where do barriers to adoption persist? 

To even begin exploring these questions and others like them we need to establish a shared understanding of how we define and measure Open Science practices. We also need to understand the current state of adoption in order to track progress over time. That’s where the Open Science Indicators project comes in. PLOS conceptualized a framework for measuring Open Science practices according to the FAIR principles, and partnered with DataSeer to develop a set of numerical “indicators” linked to specific Open Science characteristics and behaviors observable in published research articles. Our very first dataset, now available for download at Figshare, focuses on three Open Science practices: data sharing, code sharing, and preprint posting…(More)”.

Smart OCR – Advancing the Use of Artificial Intelligence with Open Data


Article by Parth Jain, Abhinay Mannepalli, Raj Parikh, and Jim Samuel: “Optical character recognition (OCR) is growing at a projected compounded annual growth rate (CAGR) of 16%, and is expected to have a value of 39.7 billion USD by 2030, as estimated by Straits research. There has been a growing interest in OCR technologies over the past decade. Optical character recognition is the technological process for transforming images of typed, handwritten, scanned, or printed texts into machine-encoded and machine-readable texts (Tappert, et al., 1990). OCR can be used with a broad range of image or scan formats – for example, these could be in the form of a scanned document such as a .pdf file, a picture of a piece of paper in .png or .jpeg format, or images with embedded text, such as characters on a coffee cup, title on the cover page of a book, the license number on vehicular plates, and images of code on websites. OCR has proven to be a valuable technological process for tackling the important challenge of transforming non-machine-readable data into machine readable data. This enables the use of natural language processing and computational methods on information-rich data which were previously largely non-processable. Given the broad array of scanned and image documents in open government data and other open data sources, OCR holds tremendous promise for value generation with open data.

Open data has been defined as “being data that is made freely available for open consumption, at no direct cost to the public, which can be efficiently located, filtered, downloaded, processed, shared, and reused without any significant restrictions on associated derivatives, use, and reuse” (Chidipothu et al., 2022). Large segments of open data contain images, visuals, scans, and other non-machine-readable content. The size and complexity associated with the manual analysis of such content is prohibitive. The most efficient way would be to establish standardized processes for transforming documents into their OCR output versions. Such machine-readable text could then be analyzed using a range of NLP methods. Artificial Intelligence (AI) can be viewed as being a “set of technologies that mimic the functions and expressions of human intelligence, specifically cognition and logic” (Samuel, 2021). OCR was one of the earliest AI technologies implemented. The first ever optical reader to identify handwritten numerals was the advanced reading machine “IBM 1287,” presented at the World Fair in New York in 1965 (Mori, et al., 1990). The value of open data is well established – however, the extent of usefulness of open data is dependent on “accessibility, machine readability, quality” and the degree to which data can be processed by using analytical and NLP methods (data.gov, 2022John, et al., 2022)…(More)”

Is Facebook’s advertising data accurate enough for use in social science research? Insights from a cross-national online survey


Paper by André Grow et al: “Social scientists increasingly use Facebook’s advertising platform for research, either in the form of conducting digital censuses of the general population, or for recruiting participants for survey research. Both approaches depend on the accuracy of the data that Facebook provides about its users, but little is known about how accurate these data are. We address this gap in a large-scale, cross-national online survey (N = 137,224), in which we compare self-reported and Facebook-classified demographic information (sex, age and region of residence). Our results suggest that Facebook’s advertising platform can be fruitfully used for conducing social science research if additional steps are taken to assess the accuracy of the characteristics under consideration…(More)”.

The Data Liberation Project 


About: “The Data Liberation Project is a new initiative I’m launching today to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest. Vast troves of government data are inaccessible to the people and communities who need them most. These datasets are inaccessible. The Process:

  • Identify: Through its own research, as well as through consultations with journalists, community groups, government-data experts, and others, the Data Liberation Project aims to identify a large number of datasets worth pursuing.
  • Obtain: The Data Liberation Project plans to use a wide range of methods to obtain the datasets, including via Freedom of Information Act requests, intervening in lawsuits, web-scraping, and advanced document parsing. To improve public knowledge about government data systems, the Data Liberation Project also files FOIA requests for essential metadata, such as database schemas, record layouts, data dictionaries, user guides, and glossaries.
  • Reformat: Many datasets are delivered to journalists and the public in difficult-to-use formats. Some may follow arcane conventions or require proprietary software to access, for instance. The Data Liberation Project will convert these datasets into open formats, and restructure them so that they can be more easily examined.
  • Clean: The Data Liberation Project will not alter the raw records it receives. But when the messiness of datasets inhibits their usefulness, the project will create secondary, “clean” versions of datasets that fix these problems.
  • Document: Datasets are meaningless without context, and practically useless without documentation. The Data Liberation Project will gather official documentation for each dataset into a central location. It will also fill observed gaps in the documentation through its own research, interviews, and analysis.
  • Disseminate: The Data Liberation Project will not expect reporters and other members of the public simply to stumble upon these datasets. Instead, it will reach out to the newsrooms and communities that stand to benefit most from the data. The project will host hands-on workshops, webinars, and other events to help others to understand and use the data.”…(More)”

Five-year campaign breaks science’s citation paywall


Article by Dalmeet Singh Chawla: “The more than 60 million scientific-journal papers indexed by Crossref — the database that registers DOIs, or digital object identifiers, for many of the world’s academic publications — now contain reference lists that are free to access and reuse.

The milestone, announced on Twitter on 18 August, is the result of an effort by the Initiative for Open Citations (I4OC), launched in 2017. Open-science advocates have for years campaigned to make papers’ citation data accessible under liberal copyright licences so that they can be studied, and those analyses shared. Free access to citations enables researchers to identify research trends, lets them conduct studies on which areas of research need funding, and helps them to spot when scientists are manipulating citation counts….

The move means that bibliometricians, scientometricians and information scientists will be able to reuse citation data in any way they please under the most liberal copyright licence, called CC0. This, in turn, allows other researchers to build on their work.

Before I4OC, researchers generally had to obtain permission to access data from major scholarly databases such as Web of Science and Scopus, and weren’t able to share the samples.

However, the opening up of Crossref articles’ citations doesn’t mean that all the world’s scholarly content now has open references. Although most major international academic publishers, including Elsevier, Springer Nature (which publishes Nature) and Taylor & Francis, index their papers on Crossref, some do not. These often include regional and non-English-language publications.

I4OC co-founder Dario Taraborelli, who is science programme officer at the Chan Zuckerberg Initiative and based in San Francisco, California, says that the next challenge will be to encourage publishers who don’t already deposit reference data in Crossref to do so….(More)”.