China’s fake science industry: how ‘paper mills’ threaten progress


Article by Eleanor Olcott, Clive Cookson and Alan Smith at the Financial Times: “…Over the past two decades, Chinese researchers have become some of the world’s most prolific publishers of scientific papers. The Institute for Scientific Information, a US-based research analysis organisation, calculated that China produced 3.7mn papers in 2021 — 23 per cent of global output — and just behind the 4.4mn total from the US.

At the same time, China has been climbing the ranks of the number of times a paper is cited by other authors, a metric used to judge output quality. Last year, China surpassed the US for the first time in the number of most cited papers, according to Japan’s National Institute of Science and Technology Policy, although that figure was flattered by multiple references to Chinese research that first sequenced the Covid-19 virus genome.

The soaring output has sparked concern in western capitals. Chinese advances in high-profile fields such as quantum technology, genomics and space science, as well as Beijing’s surprise hypersonic missile test two years ago, have amplified the view that China is marching towards its goal of achieving global hegemony in science and technology.

That concern is a part of a wider breakdown of trust in some quarters between western institutions and Chinese ones, with some universities introducing background checks on Chinese academics amid fears of intellectual property theft.

But experts say that China’s impressive output masks systemic inefficiencies and an underbelly of low-quality and fraudulent research. Academics complain about the crushing pressure to publish to gain prized positions at research universities…(More)”.

Machine Learning as a Tool for Hypothesis Generation


Paper by Jens Ludwig & Sendhil Mullainathan: “While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, which uses the capacity of machine learning algorithms to notice patterns people might not. We illustrate the procedure with a concrete application: judge decisions about who to jail. We begin with a striking fact: The defendant’s face alone matters greatly for the judge’s jailing decision. In fact, an algorithm given only the pixels in the defendant’s mugshot accounts for up to half of the predictable variation. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: They are not explained by demographics (e.g. race) or existing psychology research; nor are they already known (even if tacitly) to people or even experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional dataset (e.g. cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our paper is that hypothesis generation is in and of itself a valuable activity, and hope this encourages future work in this largely “pre-scientific” stage of science…(More)”.

Collaborative Advantage: Creating Global Commons for Science, Technology, and Innovation


Essay by Leonard Lynn and Hal Salzman: “…We argue that abandoning this techno-nationalist approach and instead investing in systems of global innovation commons, modeled on successful past experiences, and developing new principles and policies for collaborative STI could bring substantially greater benefits—not only for the world, but specifically for the United States. Key to this effort will be creating systems of governance that enable nations to contribute to the commons and to benefit from its innovations, while also allowing each country substantial freedom of action…

The competitive and insular tone of contemporary discourse about STI stands in contrast to our era’s most urgent challenges, which are global in scale: the COVID-19 pandemic, climate change, and governance of complex emerging technologies such as gene editing and artificial intelligence. These global challenges, we believe, require resources, scientific understanding, and know-how that can best be developed through common resource pools to enable both global scale and rapid dissemination. Moreover, aside from moral or ethical considerations about sharing such innovations, the reality of current globalization means that solutions—such as pandemic vaccines—must spread beyond national borders to fully benefit the world. Consequently, each separate national interest will be better served by collaboratively building up the global stocks of STI as public goods. Global scientific commons could be vital in addressing these challenges, but will require new frameworks for governance that are fair and attractive to many nations while also enabling them to act individually.

A valuable perspective on the governance of common pool resources (CPR) can be found in the work that Nobel laureate Elinor Ostrom did with her colleagues beginning in the 1950s. Ostrom, a political scientist, studied how communities that must share common resources—water, fisheries, or grazing land—use trust, cooperation, and collective deliberation to manage those resources over the long term. Before Ostrom’s work, many economists believed that shared resource systems were inherently unsustainable because individuals acting in their own self-interest would ultimately undermine the good of the group, often described as “the tragedy of the commons.” Instead, Ostrom demonstrated that communities can create durable “practical algorithms” for sharing pooled resources, whether that be irrigation in Nepal or lobster fishing in Maine…(More)”.

The Statistics That Come Out of Nowhere


Article by Ray Fisman, Andrew Gelman, and Matthew C. Stephenson: “This winter, the university where one of us works sent out an email urging employees to wear a hat on particularly cold days because “most body heat is lost through the top of the head.” Many people we know have childhood memories of a specific figure—perhaps 50 percent or, by some accounts, 80 percent of the heat you lose is through your head. But neither figure is scientific: One is flawed, and the other is patently wrong. A 2004 New York Times column debunking the claim traced its origin to a U.S. military study from the 1950s in which people dressed in neck-high Arctic-survival suits were sent out into the cold. Participants lost about half of their heat through the only part of their body that was exposed to the elements. Exaggeration by generations of parents got us up to 80 percent. (According to a hypothermia expert cited by the Times, a more accurate figure is 10 percent.)

This rather trivial piece of medical folklore is an example of a more serious problem: Through endless repetition, numbers of dubious origin take on the veneer of scientific fact, in many cases in the context of vital public-policy debates. Unreliable numbers are always just an internet search away, and serious people and institutions depend on and repeat seemingly precise quantitative measurements that turn out to have no reliable support…(More)”.

The big idea: should governments run more experiments?


Article by Stian Westlake: “…Conceived in haste in the early days of the pandemic, Recovery (which stands for Randomised Evaluation of Covid-19 Therapy) sought to find drugs to help treat people seriously ill with the novel disease. It brought together epidemiologists, statisticians and health workers to test a range of promising existing drugs at massive scale across the NHS.

The secret of Recovery’s success is that it was a series of large, fast, randomised experiments, designed to be as easy as possible for doctors and nurses to administer in the midst of a medical emergency. And it worked wonders: within three months, it had demonstrated that dexamethasone, a cheap and widely available steroid, reduced Covid deaths by a fifth to a third. In the months that followed, Recovery identified four more effective drugs, and along the way showed that various popular treatments, including hydroxychloroquine, President Trump’s tonic of choice, were useless. All in all, it is thought that Recovery saved a million lives around the world, and it’s still going.

But Recovery’s incredible success should prompt us to ask a more challenging question: why don’t we do this more often? The question of which drugs to use was far from the only unknown we had to navigate in the early days of the pandemic. Consider the decision to delay second doses of the vaccine, when to close schools, or the right regime for Covid testing. In each case, the UK took a calculated risk and hoped for the best. But as the Royal Statistical Society pointed out at the time, it would have been cheap and quick to undertake trials so we could know for sure what the right choice was, and then double down on it.

There is a growing movement to apply randomised trials not just in healthcare but in other things government does. ..(More)”.

When Ideology Drives Social Science


Article by Michael Jindra and Arthur Sakamoto: Last summer in these pages, Mordechai Levy-Eichel and Daniel Scheinerman uncovered a major flaw in Richard Jean So’s Redlining Culture: A Data History of Racial Inequality and Postwar Fiction, one that rendered the book’s conclusion null and void. Unfortunately, what they found was not an isolated incident. In complex areas like the study of racial inequality, a fundamentalism has taken hold that discourages sound methodology and the use of reliable evidence about the roots of social problems.

We are not talking about mere differences in interpretation of results, which are common. We are talking about mistakes so clear that they should cause research to be seriously questioned or even disregarded. A great deal of research — we will focus on examinations of Asian American class mobility — rigs its statistical methods in order to arrive at ideologically preferred conclusions.

Most sophisticated quantitative work in sociology involves multivariate research, often in a search for causes of social problems. This work might ask how a particular independent variable (e.g., education level) “causes” an outcome or dependent variable (e.g., income). Or it could study the reverse: How does parental income influence children’s education?

Human behavior is too complicated to be explained by only one variable, so social scientists typically try to “control” for various causes simultaneously. If you are trying to test for a particular cause, you want to isolate that cause and hold all other possible causes constant. One can control for a given variable using what is called multiple regression, a statistical tool that parcels out the separate net effects of several variables simultaneously.

If you want to determine whether income causes better education outcomes, you’d want to compare everyone from a two-parent family, since family status might be another causal factor, for instance. You’d also want to see the effect of family status by comparing everyone with similar incomes. And so on for other variables.

The problem is that there are potentially so many variables that a researcher inevitably leaves some out…(More)”.

Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good


Report by National Academies of Sciences, Engineering, and Medicine: “Historically, the U.S. national data infrastructure has relied on the operations of the federal statistical system and the data assets that it holds. Throughout the 20th century, federal statistical agencies aggregated survey responses of households and businesses to produce information about the nation and diverse subpopulations. The statistics created from such surveys provide most of what people know about the well-being of society, including health, education, employment, safety, housing, and food security. The surveys also contribute to an infrastructure for empirical social- and economic-sciences research. Research using survey-response data, with strict privacy protections, led to important discoveries about the causes and consequences of important societal challenges and also informed policymakers. Like other infrastructure, people can easily take these essential statistics for granted. Only when they are threatened do people recognize the need to protect them…(More)”.

Ten lessons for data sharing with a data commons


Article by Robert L. Grossman: “..Lesson 1. Build a commons for a specific community with a specific set of research challenges

Although there are a few data repositories that serve the general scientific community that have proved successful, in general data commons that target a specific user community have proven to be the most successful. The first lesson is to build a data commons for a specific research community that is struggling to answer specific research challenges with data. As a consequence, a data commons is a partnership between the data scientists developing and supporting the commons and the disciplinary scientists with the research challenges.

Lesson 2. Successful commons curate and harmonize the data

Successful commons curate and harmonize the data and produce data products of broad interest to the community. It’s time consuming, expensive, and labor intensive to curate and harmonize data, by much of the value of data commons is centralizing this work so that it can be done once instead of many times by each group that needs the data. These days, it is very easy to think of a data commons as a platform containing data, not spend the time curating or harmonizing it, and then be surprised that the data in the commons is not used more widely used and its impact is not as high as expected.

Lesson 3. It’s ultimately about the data and its value to generate new research discoveries

Despite the importance of a study, few scientists will try to replicate previously published studies. Instead, data is usually accessed if it can lead to a new high impact paper. For this reason, data commons play two different but related roles. First, they preserve data for reproducible science. This is a small fraction of the data access, but plays a critical role in reproducible science. Second, data commons make data available for new high value science.

Lesson 4. Reduce barriers to access to increase usage

A useful rule of thumb is that every barrier to data access cuts down access by a factor of 10. Common barriers that reduce use of a commons include: registration vs no-registration; open access vs controlled access; click through agreements vs signing of data usage agreements and approval by data access committees; license restrictions on the use of the data vs no license restrictions…(More)”.

Researchers scramble as Twitter plans to end free data access


Article by Heidi Ledford: “Akin Ünver has been using Twitter data for years. He investigates some of the biggest issues in social science, including political polarization, fake news and online extremism. But earlier this month, he had to set aside time to focus on a pressing emergency: helping relief efforts in Turkey and Syria after the devastating earthquake on 6 February.

Aid workers in the region have been racing to rescue people trapped by debris and to provide health care and supplies to those displaced by the tragedy. Twitter has been invaluable for collecting real-time data and generating crucial maps to direct the response, says Ünver, a computational social scientist at Özyeğin University in Istanbul.

So when he heard that Twitter was about to end its policy of providing free access to its application programming interface (API) — a pivotal set of rules that allows people to extract and process large amounts of data from the platform — he was dismayed. “Couldn’t come at a worse time,” he tweeted. “Most analysts and programmers that are building apps and functions for Turkey earthquake aid and relief, and are literally saving lives, are reliant on Twitter API.”..

Twitter has long offered academics free access to its API, an unusual approach that has been instrumental in the rise of computational approaches to studying social media. So when the company announced on 2 February that it would end that free access in a matter of days, it sent the field into a tailspin. “Thousands of research projects running over more than a decade would not be possible if the API wasn’t free,” says Patty Kostkova, who specializes in digital health studies at University College London…(More)”.

Managing Intellectual Property Rights in Citizen Science: A Guide for Researchers and Citizen Scientists


Report by Teresa Scassa & Haewon Chung: “IP issues arise in citizen science in a variety of different ways. Indeed, the more broadly the concept of citizen science is cast, the more diverse the potential IP interests. Some community-based projects, for example, may well involve the sharing of traditional knowledge, whereas open innovation projects are ones that are most likely to raise patent issues and to do so in a context where commercialization is a project goal. Trademark issues may also arise, particularly where a project gains a certain degree of renown. In this study we touch on issues of patenting and commercialization; however, we also recognize that most citizen science projects do not have commercialization as an objective, and have IP issues that flow predominantly from copyright law. This guide navigates these issues topically and points the reader towards further research and law in this area should they wish to gain an even more comprehensive understanding of the nuances. It accompanies a prior study conducted by the same authors that created a Typology of Citizen Science Projects from an Intellecutal Property Perspective…(More)”.