Smart OCR – Advancing the Use of Artificial Intelligence with Open Data


Article by Parth Jain, Abhinay Mannepalli, Raj Parikh, and Jim Samuel: “Optical character recognition (OCR) is growing at a projected compounded annual growth rate (CAGR) of 16%, and is expected to have a value of 39.7 billion USD by 2030, as estimated by Straits research. There has been a growing interest in OCR technologies over the past decade. Optical character recognition is the technological process for transforming images of typed, handwritten, scanned, or printed texts into machine-encoded and machine-readable texts (Tappert, et al., 1990). OCR can be used with a broad range of image or scan formats – for example, these could be in the form of a scanned document such as a .pdf file, a picture of a piece of paper in .png or .jpeg format, or images with embedded text, such as characters on a coffee cup, title on the cover page of a book, the license number on vehicular plates, and images of code on websites. OCR has proven to be a valuable technological process for tackling the important challenge of transforming non-machine-readable data into machine readable data. This enables the use of natural language processing and computational methods on information-rich data which were previously largely non-processable. Given the broad array of scanned and image documents in open government data and other open data sources, OCR holds tremendous promise for value generation with open data.

Open data has been defined as “being data that is made freely available for open consumption, at no direct cost to the public, which can be efficiently located, filtered, downloaded, processed, shared, and reused without any significant restrictions on associated derivatives, use, and reuse” (Chidipothu et al., 2022). Large segments of open data contain images, visuals, scans, and other non-machine-readable content. The size and complexity associated with the manual analysis of such content is prohibitive. The most efficient way would be to establish standardized processes for transforming documents into their OCR output versions. Such machine-readable text could then be analyzed using a range of NLP methods. Artificial Intelligence (AI) can be viewed as being a “set of technologies that mimic the functions and expressions of human intelligence, specifically cognition and logic” (Samuel, 2021). OCR was one of the earliest AI technologies implemented. The first ever optical reader to identify handwritten numerals was the advanced reading machine “IBM 1287,” presented at the World Fair in New York in 1965 (Mori, et al., 1990). The value of open data is well established – however, the extent of usefulness of open data is dependent on “accessibility, machine readability, quality” and the degree to which data can be processed by using analytical and NLP methods (data.gov, 2022John, et al., 2022)…(More)”

Is Facebook’s advertising data accurate enough for use in social science research? Insights from a cross-national online survey


Paper by André Grow et al: “Social scientists increasingly use Facebook’s advertising platform for research, either in the form of conducting digital censuses of the general population, or for recruiting participants for survey research. Both approaches depend on the accuracy of the data that Facebook provides about its users, but little is known about how accurate these data are. We address this gap in a large-scale, cross-national online survey (N = 137,224), in which we compare self-reported and Facebook-classified demographic information (sex, age and region of residence). Our results suggest that Facebook’s advertising platform can be fruitfully used for conducing social science research if additional steps are taken to assess the accuracy of the characteristics under consideration…(More)”.

The Data Liberation Project 


About: “The Data Liberation Project is a new initiative I’m launching today to identify, obtain, reformat, clean, document, publish, and disseminate government datasets of public interest. Vast troves of government data are inaccessible to the people and communities who need them most. These datasets are inaccessible. The Process:

  • Identify: Through its own research, as well as through consultations with journalists, community groups, government-data experts, and others, the Data Liberation Project aims to identify a large number of datasets worth pursuing.
  • Obtain: The Data Liberation Project plans to use a wide range of methods to obtain the datasets, including via Freedom of Information Act requests, intervening in lawsuits, web-scraping, and advanced document parsing. To improve public knowledge about government data systems, the Data Liberation Project also files FOIA requests for essential metadata, such as database schemas, record layouts, data dictionaries, user guides, and glossaries.
  • Reformat: Many datasets are delivered to journalists and the public in difficult-to-use formats. Some may follow arcane conventions or require proprietary software to access, for instance. The Data Liberation Project will convert these datasets into open formats, and restructure them so that they can be more easily examined.
  • Clean: The Data Liberation Project will not alter the raw records it receives. But when the messiness of datasets inhibits their usefulness, the project will create secondary, “clean” versions of datasets that fix these problems.
  • Document: Datasets are meaningless without context, and practically useless without documentation. The Data Liberation Project will gather official documentation for each dataset into a central location. It will also fill observed gaps in the documentation through its own research, interviews, and analysis.
  • Disseminate: The Data Liberation Project will not expect reporters and other members of the public simply to stumble upon these datasets. Instead, it will reach out to the newsrooms and communities that stand to benefit most from the data. The project will host hands-on workshops, webinars, and other events to help others to understand and use the data.”…(More)”

Five-year campaign breaks science’s citation paywall


Article by Dalmeet Singh Chawla: “The more than 60 million scientific-journal papers indexed by Crossref — the database that registers DOIs, or digital object identifiers, for many of the world’s academic publications — now contain reference lists that are free to access and reuse.

The milestone, announced on Twitter on 18 August, is the result of an effort by the Initiative for Open Citations (I4OC), launched in 2017. Open-science advocates have for years campaigned to make papers’ citation data accessible under liberal copyright licences so that they can be studied, and those analyses shared. Free access to citations enables researchers to identify research trends, lets them conduct studies on which areas of research need funding, and helps them to spot when scientists are manipulating citation counts….

The move means that bibliometricians, scientometricians and information scientists will be able to reuse citation data in any way they please under the most liberal copyright licence, called CC0. This, in turn, allows other researchers to build on their work.

Before I4OC, researchers generally had to obtain permission to access data from major scholarly databases such as Web of Science and Scopus, and weren’t able to share the samples.

However, the opening up of Crossref articles’ citations doesn’t mean that all the world’s scholarly content now has open references. Although most major international academic publishers, including Elsevier, Springer Nature (which publishes Nature) and Taylor & Francis, index their papers on Crossref, some do not. These often include regional and non-English-language publications.

I4OC co-founder Dario Taraborelli, who is science programme officer at the Chan Zuckerberg Initiative and based in San Francisco, California, says that the next challenge will be to encourage publishers who don’t already deposit reference data in Crossref to do so….(More)”.

Unlocking the Potential of Open 990 Data


Article by Cinthia Schuman Ottinger & Jeff Williams: “As the movement to expand public use of nonprofit data collected by the Internal Revenue Service advances, it’s a good time to review how far the social sector has come and how much work remains to reach the full potential of this treasure trove…Organizations have employed open Form 990 data in numerous ways, including to:

  • Create new tools for donors.For instance, the Nonprofit Aid Visualizer, a partnership between Candid and Vanguard Charitable, uses open 990 data to find communities vulnerable to COVID-19, and help address both their immediate needs and long-term recovery. Another tool, COVID-19 Urgent Service Provider Support Tool, developed by the consulting firm BCT Partners, uses 990 data to direct donors to service providers that are close to communities most affected by COVID-19.
  • More efficiently prosecute charitable fraud. This includes a campaign by the New York Attorney General’s Office that recovered $1.7 million from sham charities and redirected funds to legitimate groups.
  • Generate groundbreaking findings on fundraising, volunteers, equity, and management. researcher at Texas Tech University, for example, explored more than a million e-filed 990s to overturn long-held assumptions about the role of cash in fundraising. He found that when nonprofits encourage noncash gifts as opposed to only cash contributions, financial contributions to those organizations increase over time.
  • Shed light on harmful practices that hurt the poor. A large-scale investigative analysis of nonprofit hospitals’ tax forms revealed that 45 percent of them sent a total of $2.7 billion in medical bills to patients whose incomes were likely low enough to qualify for free or discounted care. When this practice was publicly exposed, some hospitals reevaluated their practices and erased unpaid bills for qualifying patients. The expense of mining data like this previously made such research next to impossible.
  • Help donors make more informed giving decisions. In hopes of maximizing contributions to Ukrainian relief efforts, a record number of donors are turning to resources like Charity Navigator, which can now use open Form 990 data to evaluate and rate a large number of charities based on finances, governance, and other factors. At the same time, donors informed by open 990 data can seek more accountability from the organizations they support. For example, anti-corruption researchers scouring open 990 data and other records uncovered donations by Russian oligarchs aligned with President Putin. This pressured US nonprofits that accepted money from the oligarchs to disavow this funding…(More)”.

The wealth of (Open Data) nations? Open government data, country-level institutions and entrepreneurial activity


Paper by Franz Huber, Alan Ponce, Francesco Rentocchini & Thomas Wainwright: “Lately, Open Data (OD) has been promoted by governments around the world as a resource to accelerate innovation within entrepreneurial ventures . However,it remains unclear to what extent OD drives innovative entrepreneurship. This paper sheds light on this open question by providing novel empirical evidence on the relationship between OD publishing and (digital) entrepreneurship at the country-level. We draw upon a longitudinal dataset comprising 90 countries observed over the period 2013–2016. We find a significant and positive association between OD publishing and entrepreneurship at the country level. The results also show that OD publishing and entrepreneurship is strong in countries with high institutional quality. We argue that publishing OD is not sufficient to improve innovative entrepreneurship alone, so states need to move beyond a focus on OD initiatives and promotion, to focus on a broader set of policy initiatives that promote good governance…(More)”.

A journey toward an open data culture through transformation of shared data into a data resource


Paper by Scott D. Kahn and Anne Koralova: “The transition to open data practices is straightforward albeit surprisingly challenging to implement largely due to cultural and policy issues. A general data sharing framework is presented along with two case studies that highlight these challenges and offer practical solutions that can be adjusted depending on the type of data collected, the country in which the study is initiated, and the prevailing research culture. Embracing the constraints imposed by data privacy considerations, especially for biomedical data, must be emphasized for data outside of the United States until data privacy law(s) are established at the Federal and/or State level…(More).”

Without appropriate metadata, data-sharing mandates are pointless


Article by Mark A. Musen: “Last month, the US government announced that research articles and most underlying data generated with federal funds should be made publicly available without cost, a policy to be implemented by the end of 2025. That’s atop other important moves. The European Union’s programme for science funding, Horizon Europe, already mandates that almost all data be FAIR (that is, findable, accessible, interoperable and reusable). The motivation behind such data-sharing policies is to make data more accessible so others can use them to both verify results and conduct further analyses.

But just getting those data sets online will not bring anticipated benefits: few data sets will really be FAIR, because most will be unfindable. What’s needed are policies and infrastructure to organize metadata.

Imagine having to search for publications on some topic — say, methods for carbon reclamation — but you could use only the article titles (no keywords, abstracts or search terms). That’s essentially the situation for finding data sets. If I wanted to identify all the deposited data related to carbon reclamation, the task would be futile. Current metadata often contain only administrative and organizational information, such as the name of the investigator and the date when the data were acquired.

What’s more, for scientific data to be useful to other researchers, metadata must sensibly and consistently communicate essentials of the experiments — what was measured, and under what conditions. As an investigator who builds technology to assist with data annotation, it’s frustrating that, in the majority of fields, the metadata standards needed to make data FAIR don’t even exist.

Metadata about data sets typically lack experiment-specific descriptors. If present, they’re sparse and idiosyncratic. An investigator searching the Gene Expression Omnibus (GEO), for example, might seek genomic data sets containing information on how a disease or condition manifests itself in young animals or humans. Performing such a search requires knowledge of how the age of individuals is represented — which in the GEO repository, could be age, AGE, age (after birth), age (years), Age (yr-old) or dozens of other possibilities. (Often, such information is missing from data sets altogether.) Because the metadata are so ad hoc, automated searches fail, and investigators waste enormous amounts of time manually sifting through records to locate relevant data sets, with no guarantee that most (or any) can be found…(More)”.

A User’s Guide to the Periodic Table of Open Data


Guide by Stefaan Verhulst and Andrew Zahuranec: “Leveraging our research on the variables that determine Open Data’s Impact, The Open Data Policy Lab is pleased to announce the publication of a new report designed to assist organizations in implementing the elements of a successful data collaborative: A User’s Guide to The Periodic Table of Open Data.

The User’s Guide is a fillable document designed to empower data stewards and others seeking to improve data access. It can be used as a checklist and tool to weigh different elements based on their context and priorities. By completing the forms (offline/online), you will be able to take a more comprehensive and strategic view of what resources and interventions may be required.

Download and fill out the User’s Guide to operationalize the elements in your data initiative

In conjunction with the release of our User’s Guide, the Open Data Policy Lab is pleased to present a completely reworked version of our Periodic Table of Open Data Elements, first launched alongside in 2016. We sought to categorize the elements that matter in open data initiatives into five categories: problem and demand definition, capacity and culture, governance and standards, personnel and partnerships, and risk mitigation. More information on each can be found in the attached report or in the interactive table below.

Read more about the Periodic Table of Open Data Elements and how you can use it to support your work…(More)”.