Algorithmic thinking in the public interest: navigating technical, legal, and ethical hurdles to web scraping in the social sciences


Paper by Alex Luscombe, Kevin Dick & Kevin Walby: “Web scraping, defined as the automated extraction of information online, is an increasingly important means of producing data in the social sciences. We contribute to emerging social science literature on computational methods by elaborating on web scraping as a means of automated access to information. We begin by situating the practice of web scraping in context, providing an overview of how it works and how it compares to other methods in the social sciences. Next, we assess the benefits and challenges of scraping as a technique of information production. In terms of benefits, we highlight how scraping can help researchers answer new questions, supersede limits in official data, overcome access hurdles, and reinvigorate the values of sharing, openness, and trust in the social sciences. In terms of challenges, we discuss three: technical, legal, and ethical. By adopting “algorithmic thinking in the public interest” as a way of navigating these hurdles, researchers can improve the state of access to information on the Internet while also contributing to scholarly discussions about the legality and ethics of web scraping. Example software accompanying this article are available within the supplementary materials..(More)”.

The power of words and networks


Introduction to Special Issue by A. Fronzetti Colladon, P. Gloor, and D. F. Iezzi: “According to Freud “words were originally magic and to this day words have retained much of their ancient magical power”. By words, behaviors are transformed and problems are solved. The way we use words reveals our intentions, goals and values. Novel tools for text analysis help understand the magical power of words. This power is multiplied, if it is combined with the study of social networks, i.e. with the analysis of relationships among social units. This special issue of the International Journal of Information Management, entitled “Combining Social Network Analysis and Text Mining: from Theory to Practice”, includes heterogeneous and innovative research at the nexus of text mining and social network analysis. It aims to enrich work at the intersection of these fields, which still lags behind in theoretical, empirical, and methodological foundations. The nine articles accepted for inclusion in this special issue all present methods and tools that have business applications. They are summarized in this editorial introduction….(More)”.

Shape: The Hidden Geometry of Information, Biology, Strategy, Democracy, and Everything Else


Book by Jordan Ellenberg: “How should a democracy choose its representatives? How can you stop a pandemic from sweeping the world? How do computers learn to play Go, and why is learning Go so much easier for them than learning to read a sentence? Can ancient Greek proportions predict the stock market? (Sorry, no.) What should your kids learn in school if they really want to learn to think? All these are questions about geometry. For real.

If you’re like most people, geometry is a sterile and dimly remembered exercise you gladly left behind in the dust of ninth grade, along with your braces and active romantic interest in pop singers. If you recall any of it, it’s plodding through a series of miniscule steps only to prove some fact about triangles that was obvious to you in the first place. That’s not geometry. Okay, it is geometry, but only a tiny part, which has as much to do with geometry in all its flush modern richness as conjugating a verb has to do with a great novel.

Shape reveals the geometry underneath some of the most important scientific, political, and philosophical problems we face. Geometry asks: Where are things? Which things are near each other? How can you get from one thing to another thing? Those are important questions. The word “geometry,” from the Greek for “measuring the world.” If anything, that’s an undersell. Geometry doesn’t just measure the world—it explains it. Shape shows us how….(More)”.

Need Public Policy for Human Gene Editing, Heatwaves, or Asteroids? Try Thinking Like a Citizen


Article by Nicholas Weller, Michelle Sullivan Govani, and Mahmud Farooque: “In a ballroom at the Arizona Science Center one afternoon in 2017, more than 70 Phoenix residents—students, teachers, nurses, and retirees—gathered around tables to participate in a public forum about how cities can respond to extreme weather such as heat waves. Each table was covered in colorful printouts with a large laminated poster resembling a board game. Milling between the tables were decisionmakers from local government and the state. All were taking part in a deliberative process called participatory technology assessment, or pTA, designed to break down the walls between “experts” and citizens to gain insights into public policy dilemmas involving science, technology, and uncertainty.

Foreshadowing their varied viewpoints and experiences, participants prepared differently for the “extreme weather” of the heavily air conditioned ballroom, with some gripping cardigans around their shoulders while others were comfortable in tank tops. Extreme heat is something all the participants were familiar with—Phoenix is one of the hottest cities in the country—but not everyone understood the unequal way that heat and related deaths affect different parts of the Valley of the Sun. Though a handful of the participants might have called themselves environmentalists, most were not regular town-hall goers or political activists. Instead, they represented a diverse cross section of people in Phoenix. All had applied to attend—motivated by a small stipend, the opportunity to have their voice heard, or a bit of both.

Unlike typical town hall setups, where a few bold participants tend to dominate questioning and decisionmakers often respond by being defensive or vague, pTA gatherings are deliberately organized to encourage broad participation and conversation. To help people engage with the topic, the meeting was divided into subgroups to examine the story of Heattown, a fictionalized name for a real but anonymized community contending with the health, environmental, and economic impacts of heat waves. Then each group began a guided discussion of the different characters living in Heattown, vulnerabilities of the emergency-response and infrastructure systems, and strategies for dealing with those vulnerabilities….(More)”.

Artificial intelligence (AI) has become one of the most impactful technologies of the twenty-first century


Lynne Parker at the AI.gov website: “Artificial intelligence (AI) has become one of the most impactful technologies of the twenty-first century.  Nearly every sector of the economy and society has been affected by the capabilities and potential of AI.  AI is enabling farmers to grow food more efficiently, medical researchers to better understand and treat COVID-19, scientists to develop new materials, transportation professionals to deliver more goods faster and with less energy, weather forecasters to more accurately predict the tracks of hurricanes, and national security protectors to better defend our Nation.

At the same time, AI has raised important societal concerns.  What is the impact of AI on the changing nature of work?  How can we ensure that AI is used appropriately, and does not result in unfair discrimination or bias?  How can we guard against uses of AI that infringe upon human rights and democratic principles?

These dual perspectives on AI have led to the concept of “trustworthy AI”.  Trustworthy AI is AI that is designed, developed, and used in a manner that is lawful, fair, unbiased, accurate, reliable, effective, safe, secure, resilient, understandable, and with processes in place to regularly monitor and evaluate the AI system’s performance and outcomes.

Achieving trustworthy AI requires an all-of-government and all-of-Nation approach, combining the efforts of industry, academia, government, and civil society.  The Federal government is doing its part through a national strategy, called the National AI Initiative Act of 2020 (NAIIA).  The National AI Initiative (NAII) builds upon several years of impactful AI policy actions, many of which were outcomes from EO 13859 on Maintaining American Leadership in AI.

Six key pillars define the Nation’s AI strategy:

  • prioritizing AI research and development;
  • strengthening AI research infrastructure;
  • advancing trustworthy AI through technical standards and governance;
  • training an AI-ready workforce;
  • promoting international AI engagement; and
  • leveraging trustworthy AI for government and national security.

Coordinating all of these efforts is the National AI Initiative Office, which is legislated by the NAIIA to coordinate and support the NAII.  This Office serves as the central point of contact for exchanging technical and programmatic information on AI activities at Federal departments and agencies, as well as related Initiative activities in industry, academia, nonprofit organizations, professional societies, State and tribal governments, and others.

The AI.gov website provides a portal for exploring in more depth the many AI actions, initiatives, strategies, programs, reports, and related efforts across the Federal government.  It serves as a resource for those who want to learn more about how to take full advantage of the opportunities of AI, and to learn how the Federal government is advancing the design, development, and use of trustworthy AI….(More)”

Citizen science is booming during the pandemic


Sigal Samuel at Vox: “…The pandemic has driven a huge increase in participation in citizen science, where people without specialized training collect data out in the world or perform simple analyses of data online to help out scientists.

Stuck at home with time on their hands, millions of amateurs arouennd the world are gathering information on everything from birds to plants to Covid-19 at the request of institutional researchers. And while quarantine is mostly a nightmare for us, it’s been a great accelerant for science.

Early in the pandemic, a firehose of data started gushing forth on citizen science platforms like Zooniverse and SciStarter, where scientists ask the public to analyze their data online.It’s a form of crowdsourcing that has the added bonus of giving volunteers a real sense of community; each project has a discussion forum where participants can pose questions to each other (and often to the scientists behind the projects) and forge friendly connections.

“There’s a wonderful project called Rainfall Rescue that’s transcribing historical weather records. It’s a climate change project to understand how weather has changed over the past few centuries,” Laura Trouille, vice president of citizen science at the Adler Planetarium in Chicago and co-lead of Zooniverse, told me. “They uploaded a dataset of 10,000 weather logs that needed transcribing — and that was completed in one day!”

Some Zooniverse projects, like Snapshot Safari, ask participants to classify animals in images from wildlife cameras. That project saw daily classifications go from 25,000 to 200,000 per day in the initial days of lockdown. And across all its projects, Zooniverse reported that 200,000 participants contributed more than 5 million classifications of images in one week alone — the equivalent of 48 years of research. Although participation has slowed a bit since the spring, it’s still four times what it was pre-pandemic.

Many people are particularly eager to help tackle Covid-19, and scientists have harnessed their energy. Carnegie Mellon University’s Roni Rosenfeld set up a platform where volunteers can help artificial intelligence predict the spread of the coronavirus, even if they know nothing about AI. Researchers at the University of Washington invited people to contribute to Covid-19 drug discovery using a computer game called Foldit; they experimented with designing proteins that could attach to the virus that causes Covid-19 and prevent it from entering cells….(More)”.

The fight against fake-paper factories that churn out sham science


Holly Else & Richard Van Noorden at Nature: “When Laura Fisher noticed striking similarities between research papers submitted to RSC Advances, she grew suspicious. None of the papers had authors or institutions in common, but their charts and titles looked alarmingly similar, says Fisher, the executive editor at the journal. “I was determined to try to get to the bottom of what was going on.”

A year later, in January 2021, Fisher retracted 68 papers from the journal, and editors at two other Royal Society of Chemistry (RSC) titles retracted one each over similar suspicions; 15 are still under investigation. Fisher had found what seemed to be the products of paper mills: companies that churn out fake scientific manuscripts to order. All the papers came from authors at Chinese hospitals. The journals’ publisher, the RSC in London, announced in a statement that it had been the victim of what it believed to be “the systemic production of falsified research”.

What was surprising about this was not the paper-mill activity itself: research-integrity sleuths have repeatedly warned that some scientists buy papers from third-party firms to help their careers. Rather, it was extraordinary that a publisher had publicly announced something that journals generally keep quiet about. “We believe that it is a paper mill, so we want to be open and transparent,” Fisher says.

The RSC wasn’t alone, its statement added: “We are one of a number of publishers to have been affected by such activity.” Since last January, journals have retracted at least 370 papers that have been publicly linked to paper mills, an analysis by Nature has found, and many more retractions are expected to follow.

Much of this literature cleaning has come about because, last year, outside sleuths publicly flagged papers that they think came from paper mills owing to their suspiciously similar features. Collectively, the lists of flagged papers total more than 1,000 studies, the analysis shows. Editors are so concerned by the issue that last September, the Committee on Publication Ethics (COPE), a publisher-advisory body in London, held a forum dedicated to discussing “systematic manipulation of the publishing process via paper mills”. Their guest speaker was Elisabeth Bik, a research-integrity analyst in California known for her skill in spotting duplicated images in papers, and one of the sleuths who posts their concerns about paper mills online….(More)”.

Lawmakers’ use of scientific evidence can be improved


Paper by D. Max Crowley et al: “This study is an experimental trial that demonstrates the potential for formal outreach strategies to change congressional use of research. Our results show that collaboration between policy and research communities can change policymakers’ value of science and result in legislation that appears to be more inclusive of research evidence. The findings of this study also demonstrated changes in researchers’ knowledge and motivation to engage with policymakers as well as their actual policy engagement behavior. Together, the observed changes in both policymakers and researchers randomized to receive an intervention for supporting legislative use of research evidence (i.e., the Research-to-Policy Collaboration model) provides support for the underlying theories around the social nature of research translation and evidence use….(More)”.

The speed of science


Essay by Saloni Dattani & Nathaniel Bechhofer: “The 21st century has seen some phenomenal advances in our ability to make scientific discoveries. Scientists have developed new technology to build vaccines swiftly, new algorithms to predict the structure of proteins accurately, new equipment to sequence DNA rapidly, and new engineering solutions to harvest energy efficiently. But in many fields of science, reliable knowledge and progress advance staggeringly slowly. What slows it down? And what can we learn from individual fields of science to pick up the pace across the board – without compromising on quality?

By and large, scientific research is published in journals in the form of papers – static documents that do not update with new data or new methods. Instead of sharing the data and the code that produces their results, most scientists simply publish a textual description of their research in online publications. These publications are usually hidden behind paywalls, making it harder for outsiders to verify their authenticity.

On the occasion when a reader spots a discrepancy in the data or an error in the methods, they must read the intricate details of a study’s method scrupulously, and cross-check the statistics manually. When scientists don’t share the data to produce their results openly, the task becomes even harder. The process of error correction – from scientists publishing a paper, to readers spotting errors, to having the paper corrected or retracted – can take years, assuming those errors are spotted at all.

When scientists reference previous research, they cite entire papers, not specific results or values from them. And although there is evidence that scientists hold back from citing papers once they have been retracted, the problem is compounded over time – consider, for example, a researcher who cites a study that itself derives its data or assumptions from prior research that has been disputed, corrected or retracted. The longer it takes to sift through the science, to identify which results are accurate, the longer it takes to gather an understanding of scientific knowledge.

What makes the problem even more challenging is that flaws in a study are not necessarily mathematical errors. In many situations, researchers make fairly arbitrary decisions as to how they collect their data, which methods they apply to analyse them, and which results they report – altogether leaving readers blind to the impact of these decisions on the results.

This murkiness can result in what is known as p-hacking: when researchers selectively apply arbitrary methods in order to achieve a particular result. For example, in a study that compares the well-being of overweight people to that of underweight people, researchers may find that certain cut-offs of weight (or certain subgroups in their sample) provide the result they’re looking for, while others don’t. And they may decide to only publish the particular methods that provided that result…(More)”.

The Mathematics of How Connections Become Global


Kelsey Houston-Edwards at Scientific American: “When you hit “send” on a text message, it is easy to imagine that the note will travel directly from your phone to your friend’s. In fact, it typically goes on a long journey through a cellular network or the Internet, both of which rely on centralized infrastructure that can be damaged by natural disasters or shut down by repressive governments. For fear of state surveillance or interference, tech-savvy protesters in Hong Kong avoided the Internet by using software such as FireChat and Bridgefy to send messages directly between nearby phones.

These apps let a missive hop silently from one phone to the next, eventually connecting the sender to the receiver—the only users capable of viewing the message. The collections of linked phones, known as mesh networks or mobile ad hoc networks, enable a flexible and decentralized mode of communication. But for any two phones to communicate, they need to be linked via a chain of other phones. How many people scattered throughout Hong Kong need to be connected via the same mesh network before we can be confident that crosstown communication is possible?

Mesh network in action: when cell-phone ranges overlap, a linked chain of connections is established.
Credit: Jen Christiansen (graphic); Wee People font, ProPublica and Alberto Cairo (figure drawings)

A branch of mathematics called percolation theory offers a surprising answer: just a few people can make all the difference. As users join a new network, isolated pockets of connected phones slowly emerge. But full east-to-west or north-to-south communication appears all of a sudden as the density of users passes a critical and sharp threshold. Scientists describe such a rapid change in a network’s connectivity as a phase transition—the same concept used to explain abrupt changes in the state of a material such as the melting of ice or the boiling of water.

A phase transition in a mesh network: the density of users suddenly passes a critical threshold.
Credit: Jen Christiansen (graphic); Wee People font, ProPublica and Alberto Cairo (figure drawings)

Percolation theory examines the consequences of randomly creating or removing links in such networks, which mathematicians conceive of as a collection of nodes (represented by points) linked by “edges” (lines). Each node represents an object such as a phone or a person, and the edges represent a specific relation between two of them. The fundamental insight of percolation theory, which dates back to the 1950s, is that as the number of links in a network gradually increases, a global cluster of connected nodes will suddenly emerge….(More)”.