Smart OCR – Advancing the Use of Artificial Intelligence with Open Data


Article by Parth Jain, Abhinay Mannepalli, Raj Parikh, and Jim Samuel: “Optical character recognition (OCR) is growing at a projected compounded annual growth rate (CAGR) of 16%, and is expected to have a value of 39.7 billion USD by 2030, as estimated by Straits research. There has been a growing interest in OCR technologies over the past decade. Optical character recognition is the technological process for transforming images of typed, handwritten, scanned, or printed texts into machine-encoded and machine-readable texts (Tappert, et al., 1990). OCR can be used with a broad range of image or scan formats – for example, these could be in the form of a scanned document such as a .pdf file, a picture of a piece of paper in .png or .jpeg format, or images with embedded text, such as characters on a coffee cup, title on the cover page of a book, the license number on vehicular plates, and images of code on websites. OCR has proven to be a valuable technological process for tackling the important challenge of transforming non-machine-readable data into machine readable data. This enables the use of natural language processing and computational methods on information-rich data which were previously largely non-processable. Given the broad array of scanned and image documents in open government data and other open data sources, OCR holds tremendous promise for value generation with open data.

Open data has been defined as “being data that is made freely available for open consumption, at no direct cost to the public, which can be efficiently located, filtered, downloaded, processed, shared, and reused without any significant restrictions on associated derivatives, use, and reuse” (Chidipothu et al., 2022). Large segments of open data contain images, visuals, scans, and other non-machine-readable content. The size and complexity associated with the manual analysis of such content is prohibitive. The most efficient way would be to establish standardized processes for transforming documents into their OCR output versions. Such machine-readable text could then be analyzed using a range of NLP methods. Artificial Intelligence (AI) can be viewed as being a “set of technologies that mimic the functions and expressions of human intelligence, specifically cognition and logic” (Samuel, 2021). OCR was one of the earliest AI technologies implemented. The first ever optical reader to identify handwritten numerals was the advanced reading machine “IBM 1287,” presented at the World Fair in New York in 1965 (Mori, et al., 1990). The value of open data is well established – however, the extent of usefulness of open data is dependent on “accessibility, machine readability, quality” and the degree to which data can be processed by using analytical and NLP methods (data.gov, 2022John, et al., 2022)…(More)”

The Risks of Empowering “Citizen Data Scientists”


Article by Reid Blackman and Tamara Sipes: “New tools are enabling organizations to invite and leverage non-data scientists — say, domain data experts, team members very familiar with the business processes, or heads of various business units — to propel their AI efforts. There are advantages to empowering these internal “citizen data scientists,” but also risks. Organizations considering implementing these tools should take five steps: 1) provide ongoing education, 2) provide visibility into similar use cases throughout the organization, 3) create an expert mentor program, 4) have all projects verified by AI experts, and 5) provide resources for inspiration outside your organization…(More)”.

The rise and fall of peer review


Blog by Adam Mastroianni: “For the last 60 years or so, science has been running an experiment on itself. The experimental design wasn’t great; there was no randomization and no control group. Nobody was in charge, exactly, and nobody was really taking consistent measurements. And yet it was the most massive experiment ever run, and it included every scientist on Earth.

Most of those folks didn’t even realize they were in an experiment. Many of them, including me, weren’t born when the experiment started. If we had noticed what was going on, maybe we would have demanded a basic level of scientific rigor. Maybe nobody objected because the hypothesis seemed so obviously true: science will be better off if we have someone check every paper and reject the ones that don’t pass muster. They called it “peer review.”

This was a massive change. From antiquity to modernity, scientists wrote letters and circulated monographs, and the main barriers stopping them from communicating their findings were the cost of paper, postage, or a printing press, or on rare occasions, the cost of a visit from the Catholic Church. Scientific journals appeared in the 1600s, but they operated more like magazines or newsletters, and their processes of picking articles ranged from “we print whatever we get” to “the editor asks his friend what he thinks” to “the whole society votes.” Sometimes journals couldn’t get enough papers to publish, so editors had to go around begging their friends to submit manuscripts, or fill the space themselves. Scientific publishing remained a hodgepodge for centuries.

(Only one of Einstein’s papers was ever peer-reviewed, by the way, and he was so surprised and upset that he published his paper in a different journal instead.)

That all changed after World War II. Governments poured funding into research, and they convened “peer reviewers” to ensure they weren’t wasting their money on foolish proposals. That funding turned into a deluge of papers, and journals that previously struggled to fill their pages now struggled to pick which articles to print. Reviewing papers before publication, which was “quite rare” until the 1960s, became much more common. Then it became universal.

Now pretty much every journal uses outside experts to vet papers, and papers that don’t please reviewers get rejected. You can still write to your friends about your findings, but hiring committees and grant agencies act as if the only science that exists is the stuff published in peer-reviewed journals. This is the grand experiment we’ve been running for six decades.

The results are in. It failed…(More)”.

The Potentially Adverse Impact of Twitter 2.0 on Scientific and Research Communication


Article by Julia Cohen: “In just over a month after the change in Twitter leadership, there have been significant changes to the social media platform, in its new “Twitter 2.0.” version. For researchers who use Twitter as a primary source of data, including many of the computer scientists at USC’s Information Sciences Institute (ISI), the effects could be debilitating…

Over the years, Twitter has been extremely friendly to researchers, providing and maintaining a robust API (application programming interface) specifically for academic research. The Twitter API for Academic Research allows researchers with specific objectives who are affiliated with an academic institution to gather historical and real-time data sets of tweets, and related metadata, at no cost. Currently, the Twitter API for Academic Research continues to be functional and maintained in Twitter 2.0.

The data obtained from the API provides a means to observe public conversations and understand people’s opinions about societal issues. Luca Luceri, a Postdoctoral Research Associate at ISI called Twitter “a primary platform to observe online discussion tied to political and social issues.” And Twitter touts its API for Academic Research as a way for “academic researchers to use data from the public conversation to study topics as diverse as the conversation on Twitter itself.”

However, if people continue deactivating their Twitter accounts, which appears to be the case, the makeup of the user base will change, with data sets and related studies proportionally affected. This is especially true if the user base evolves in a way that makes it more ideologically homogeneous and less diverse.

According to MIT Technology Review, in the first week after its transition, Twitter may have lost one million users, which translates to a 208% increase in lost accounts. And there’s also the concern that the site could not work as effectively, because of the substantial decrease in the size of the engineering teams. This includes concerns about the durability of the service researchers rely on for data, namely the Twitter API. Jason Baumgartner, founder of Pushshift, a social media data collection, analysis, and archiving platform, said in several recent API requests, his team also saw a significant increase in error rates – in the 25-30% range –when they typically see rates near 1%. Though for now this is anecdotal, it leaves researchers wondering if they will be able to rely on Twitter data for future research.

One example of how the makeup of the less-regulated Twitter 2.0 user base could significantly be altered is if marginalized groups leave Twitter at a higher rate than the general user base, e.g. due to increased hate speech. Keith Burghardt, a Computer Scientist at ISI who studies hate speech online said, “It’s not that an underregulated social media changes people’s opinions, but it just makes people much more vocal. So you will probably see a lot more content that is hateful.” In fact, a study by Montclair State University found that hate speech on Twitter skyrocketed in the week after the acquisition of Twitter….(More)”.

Language and the Rise of the Algorithm


Book by Jeffrey M. Binder: “Bringing together the histories of mathematics, computer science, and linguistic thought, Language and the Rise of the Algorithm reveals how recent developments in artificial intelligence are reopening an issue that troubled mathematicians well before the computer age: How do you draw the line between computational rules and the complexities of making systems comprehensible to people? By attending to this question, we come to see that the modern idea of the algorithm is implicated in a long history of attempts to maintain a disciplinary boundary separating technical knowledge from the languages people speak day to day.
 
Here Jeffrey M. Binder offers a compelling tour of four visions of universal computation that addressed this issue in very different ways: G. W. Leibniz’s calculus ratiocinator; a universal algebra scheme Nicolas de Condorcet designed during the French Revolution; George Boole’s nineteenth-century logic system; and the early programming language ALGOL, short for algorithmic language. These episodes show that symbolic computation has repeatedly become entangled in debates about the nature of communication. Machine learning, in its increasing dependence on words, erodes the line between technical and everyday language, revealing the urgent stakes underlying this boundary.
 
The idea of the algorithm is a levee holding back the social complexity of language, and it is about to break. This book is about the flood that inspired its construction…(More)”.

Research Methods in Deliberative Democracy


Book edited by Selen A. Ercan et al: “… brings together a wide range of methods used in the study of deliberative democracy. It offers thirty-one different methods that scholars use for theorizing, measuring, exploring, or applying deliberative democracy. Each chapter presents one method by explaining its utility in deliberative democracy research and providing guidance on its application by drawing on examples from previous studies. The book hopes to inspire scholars to undertake methodologically robust, intellectually creative, and politically relevant research. It fills a significant gap in a rapidly growing field of research by assembling diverse methods and thereby expanding the range of methodological choices available to students, scholars, and practitioners of deliberative democracy…(More)”.

The Wireless Body


Article by Jeremy Greene: “Nearly half the US adult population will pass out at some point in their lives. Doctors call this “syncope,” and it is bread-and-butter practice for any emergency room or urgent care clinic. While most cases are benign—a symptom of dehydration or mistimed medication—syncope can also be a sign of something gone terribly wrong. It may be a symptom of a heart attack, a blood clot in the lungs, an embolus to the arteries supplying the brain, or a life-threatening arrhythmia. After a series of tests ruling out the worst, most patients go home without incident. Many of them also go home with a Holter monitor. 

The Holter monitor is a device about the size of a pack of cards that records the electrical activity of the heart over the course of a day or more. Since its invention more than half a century ago, it has become such a common object in clinical medicine that few pause to consider its origins. But, as the makers of new Wi-Fi and cloud-enabled devices, smartphone apps, and other “wearable” technologies claim to be revolutionizing the world of preventive health care, there is much to learn from the history of this older instrument of medical surveillance…(More)”.

Beyond Measure: The Hidden History of Measurement from Cubits to Quantum Constants


Book by James Vincent: “From the cubit to the kilogram, the humble inch to the speed of light, measurement is a powerful tool that humans invented to make sense of the world. In this revelatory work of science and social history, James Vincent dives into its hidden world, taking readers from ancient Egypt, where measuring the annual depth of the Nile was an essential task, to the intellectual origins of the metric system in the French Revolution, and from the surprisingly animated rivalry between metric and imperial, to our current age of the “quantified self.” At every turn, Vincent is keenly attuned to the political consequences of measurement, exploring how it has also been used as a tool for oppression and control.

Beyond Measure reveals how measurement is not only deeply entwined with our experience of the world, but also how its history encompasses and shapes the human quest for knowledge…(More)”.

Science in Negotiation


Book by Jessica Espey on “The Role of Scientific Evidence in Shaping the United Nations Sustainable Development Goals, 2012-2015”: “This book explores the role of scientific evidence within United Nations (UN) deliberation by examining the negotiation of the Sustainable Development Goals (SDGs), endorsed by Member States in 2015. Using the SDGs as a case study, this book addresses a key gap in our understanding of the role of evidence in contemporary international policy-making. It is structured around three overarching questions: (1) how does scientific evidence influence multilateral policy development within the UN General Assembly? (2) how did evidence shape the goals and targets that constitute the SDGs?; and (3) how did institutional arrangements and non-state actor engagements mediate the evidence-to-policy process in the development of the SDGs? The ultimate intention is to tease out lessons on global policy-making and to understand the influence of different evidence inputs and institutional factors in shaping outcomes.

To understand the value afforded to scientific evidence within multilateral deliberation, a conceptual framework is provided drawing upon literature from policy studies and political science, including recent theories of evidence-informed policy-making and new institutionalism. It posits that the success or failure of evidence informing global political processes rests upon the representation and access of scientific stakeholders, levels of community organisation, the framing and presentation of evidence, and time, including the duration over which evidence and key conceptual ideas are presented. Cutting across the discussion is the fundamental question of whose evidence counts and how expertise is defined? The framework is tested with specific reference to three themes that were prominent during the SDG negotiation process; public health (articulated in SDG 3), urban sustainability (articulated in SDG 11), and data and information systems (which were a cross-cutting theme of the dialogue). Within each, scientific communities had specific demands and through an exploration of key literature, including evidence inputs and UN documentation, as well as through key informant interviews, the translation of these scientific ideas into policy priorities is uncovered…(More)”.

Operationalizing Digital Self Determination


Paper by Stefaan G. Verhulst: “We live in an era of datafication, one in which life is increasingly quantified and transformed into intelligence for private or public benefit. When used responsibly, this offers new opportunities for public good. However, three key forms of asymmetry currently limit this potential, especially for already vulnerable and marginalized groups: data asymmetries, information asymmetries, and agency asymmetries. These asymmetries limit human potential, both in a practical and psychological sense, leading to feelings of disempowerment and eroding public trust in technology. Existing methods to limit asymmetries (e.g., consent) as well as some alternatives under consideration (data ownership, collective ownership, personal information management systems) have limitations to adequately address the challenges at hand. A new principle and practice of digital self-determination (DSD) is therefore required.
DSD is based on existing concepts of self-determination, as articulated in sources as varied as Kantian philosophy and the 1966 International Covenant on Economic, Social and Cultural Rights. Updated for the digital age, DSD contains several key characteristics, including the fact that it has both an individual and collective dimension; is designed to especially benefit vulnerable and marginalized groups; and is context-specific (yet also enforceable). Operationalizing DSD in this (and other) contexts so as to maximize the potential of data while limiting its harms requires a number of steps. In particular, a responsible operationalization of DSD would consider four key prongs or categories of action: processes, people and organizations, policies, and products and technologies…(More)”.