Data Stewardship on the map: A study of tasks and roles in Dutch research institutes

Report by Verheul, Ingeborg et al: “Good research requires good data stewardship. Data stewardship encompasses all the different tasks and responsibilities that relate to caring for data during the various phases of the whole research life cycle. The basic assumption is that the researcher himself/herself is primarily responsible for all data.

However, the researcher does need professional support to achieve this. To that end, diverse supportive data stewardship roles and functions have evolved in recent years. Often they have developed over the course of time.

Their functional implementation depends largely on their place in the organization. This comes as no surprise when one considers that data stewardship consists of many facets that are traditionally assigned to different departments. Researchers regularly take on data stewardship tasks as well, not only for themselves but also in a wider context for a research group. This data stewardship work often remains unnoticed….(More)”.

The death of the literature review and the rise of the dynamic knowledge map

Gorgi Krlev at LSE Impact Blog: “Literature reviews are a core part of academic research that are loathed by some and loved by others. The LSE Impact Blog recently presented two proposals on how to deal with the issues raised by literature reviews: Richard P. Phelps argues, due to their numerous flaws, we should simply get rid of them as a requirement in scholarly articles. In contrast, Arnaud Vaganay proposes, despite their flaws, we can save them by means of standardization that would make them more robust. Here, I put forward an alternative that strikes a balance between the two: Let’s build databases that help systemize academic research. There are examples of such databases in evidence-based health-care, why not replicate those examples more widely?

The seed of the thought underlying my proposition of building dynamic knowledge maps in the social sciences and humanities was planted in 2014. I was attending a talk within Oxford’s evidence-based healthcare programme. Jon Brassey, the main speaker of the event and founder of the TRIP database, was explaining his life goal: making systematic reviews and meta-analyses in healthcare research redundant! His argument was that a database containing all available research on treatment of a symptom, migraine for instance, would be able to summarize and display meta-effects within seconds, whereas a thorough meta-analysis would require weeks, if not months, if done by a conventional research team.

Although still imperfect, TRIP has made significant progress in realizing this vision. The most recent addition to the database are “evidence maps” that visualize what we know about effective treatments. Evidence maps compare alternative treatments based on all available studies. They indicate effectiveness of a treatment, the “size” of evidence underscoring the claim and the risk of bias contained in the underlying studies. Here and below is an example based on 943 studies, as of today, dealing with effective treatment of migraine, indicating aggregated study size and risk of bias.

Source: TRIP database

There have been heated debates about the value and relevance of academic research (propositions have centred on intensifying research on global challenges or harnessing data for policy impact), its rigor (for example reproducibility), and the speed of knowledge production, including the “glacial pace of academic publishing”. Literature reviews, for the reasons laid out by Phelps and Vaganay, suffer from imperfections that make them: time consuming, potentially incomplete or misleading, erratic, selective, and ultimately blurry rather than insightful. As a result, conducting literature reviews is arguably not an effective use of research time and only adds to wider inefficiencies in research….(More)”.

New Report Examines Reproducibility and Replicability in Science, Recommends Ways to Improve Transparency and Rigor in Research

National Academies of Sciences: “While computational reproducibility in scientific research is generally expected when the original data and code are available, lack of ability to replicate a previous study — or obtain consistent results looking at the same scientific question but with different data — is more nuanced and occasionally can aid in the process of scientific discovery, says a new congressionally mandated report from the National Academies of Sciences, Engineering, and Medicine.  Reproducibility and Replicability in Science recommends ways that researchers, academic institutions, journals, and funders should help strengthen rigor and transparency in order to improve the reproducibility and replicability of scientific research.

Defining Reproducibility and Replicability

The terms “reproducibility” and “replicability” are often used interchangeably, but the report uses each term to refer to a separate concept.  Reproducibility means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.  Replicability means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.   

Reproducing research involves using the original data and code, while replicating research involves new data collection and similar methods used in previous studies, the report says.  Even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated. 

“Being able to reproduce the computational results of another researcher starting with the same data and replicating a previous study to test its results facilitate the self-correcting nature of science, and are often cited as hallmarks of good science,” said Harvey Fineberg, president of the Gordon and Betty Moore Foundation and chair of the committee that conducted the study.  “However, factors such as lack of transparency of reporting, lack of appropriate training, and methodological errors can prevent researchers from being able to reproduce or replicate a study.  Research funders, journals, academic institutions, policymakers, and scientists themselves each have a role to play in improving reproducibility and replicability by ensuring that scientists adhere to the highest standards of practice, understand and express the uncertainty inherent in their conclusions, and continue to strengthen the interconnected web of scientific knowledge — the principal driver of progress in the modern world.”….(More)”.

As Surveys Falter Big Data Polling Narrows Our Societal Understanding

Kalev Leetaru at Forbes: “One of the most talked-about stories in the world of polling and survey research in recent years has been the gradual death of survey response rates and the reliability of those insights….

The online world’s perceived anonymity has offered some degree of reprieve in which online polls and surveys have often bested traditional approaches in assessing views towards society’s most controversial issues. Yet, here as well increasing public understanding of phishing and online safety are ever more problematic.

The answer has been the rise of “big data” analysis of society’s digital exhaust to fill in the gaps….

Is it truly the same answer though?

Constructing and conducting a well-designed survey means being able to ask the public exactly the questions of interest. Most importantly, it entails being able to ensure representative demographics of respondents.

An online-only poll is unlikely to accurately capture the perspectives of the three quarters of the earth’s population that the digital revolution has left behind. Even within the US, social media platforms are extraordinarily skewed.

The far greater problem is that society’s data exhaust is rarely a perfect match for the questions of greatest interest to policymakers and public.

Cellphone mobility records can offer an exquisitely detailed look at how the people of a city go about their daily lives, but beneath all that blinding light are the invisible members of society not deemed valuable to advertisers and thus not counted. Even for the urban society members whose phones are their ever-present companions, mobility data only goes so far. It can tell us that occupants of a particular part of the city during the workday spend their evenings in a particular part of the city, allowing us to understand their work/life balance, but it offers few insights into their political leanings.

One of the greatest challenges of today’s “big data” surveying is that it requires us to narrow our gaze to only those questions which can be easily answered from the data at hand.

Much as AI’s crisis of bias comes from the field’s steadfast refusal to pay for quality data, settling for highly biased free data, so too has “big data” surveying limited itself largely to datasets it can freely and easily acquire.

The result is that with traditional survey research, we are free to ask the precise questions we are most interested in. With data exhaust research, we must imperfectly shoehorn our questions into the few available metrics. With sufficient creativity it is typically possible to find some way of proxying the given question, but the resulting proxies may be highly unstable, with little understanding of when and where they may fail.

Much like how the early rise of the cluster computing era caused “big data” researchers to limit the questions they asked of their data to just those they could fit into a set of tiny machines, so too has the era of data exhaust surveying forced us to greatly restrict our understanding of society.

Most dangerously, however, big data surveying implicitly means we are measuring only the portion of society our vast commercial surveillance state cares about.

In short, we are only able to measure those deemed of greatest interest to advertisers and thus the most monetizable.

Putting this all together, the decline of traditional survey research has led to the rise of “big data” analysis of society’s data exhaust. Instead of giving us an unprecedented new view into the heartbeat of daily life, this reliance on the unintended output of our digital lives has forced researchers to greatly narrow the questions they can explore and severely skews them to the most “monetizable” portions of society.

In the end, the shift of societal understanding from precision surveys to the big data revolution has led not to an incredible new understanding of what makes us tick, but rather a far smaller, less precise and less accurate view than ever before, just our need to understand ourselves has never been greater….(More)”.

Computational Social Science of Disasters: Opportunities and Challenges

Paper by Annetta Burger, Talha Oz , William G. Kennedy and Andrew T. Crooks: “Disaster events and their economic impacts are trending, and climate projection studies suggest that the risks of disaster will continue to increase in the near future. Despite the broad and increasing social effects of these events, the empirical basis of disaster research is often weak, partially due to the natural paucity of observed data. At the same time, some of the early research regarding social responses to disasters have become outdated as social, cultural, and political norms have changed. The digital revolution, the open data trend, and the advancements in data science provide new opportunities for social science disaster research.

We introduce the term computational social science of disasters (CSSD), which can be formally defined as the systematic study of the social behavioral dynamics of disasters utilizing computational methods. In this paper, we discuss and showcase the opportunities and the challenges in this new approach to disaster research.

Following a brief review of the fields that relate to CSSD, namely traditional social sciences of disasters, computational social science, and crisis informatics, we examine how advances in Internet technologies offer a new lens through which to study disasters. By identifying gaps in the literature, we show how this new field could address ways to advance our understanding of the social and behavioral aspects of disasters in a digitally connected world. In doing so, our goal is to bridge the gap between data science and the social sciences of disasters in rapidly changing environments….(More)”.

Group decisions: When more information isn’t necessarily better

News Release from the Santa Fee Institute: “In nature, group decisions are often a matter of life or death. At first glance, the way certain groups of animals like minnows branch off into smaller sub-groups might seem counterproductive to their survival. After all, information about, say, where to find some tasty fish roe or which waters harbor more of their predators, would flow more freely and seem to benefit more minnows if the school of fish behaved as a whole. However, new research published in Philosophical Transactions of the Royal Society B sheds light on the complexity of collective decision-making and uncovers new insights into the benefits of the internal structure of animal groups.

In their paper, Albert Kao, a Baird Scholar and Omidyar Fellow at the Santa Fe Institute, and Iain Couzin, Director of the Max Planck Institute for Ornithology and Chair of Biodiversity and Collective Behavior at the University of Konstanz, simulate the information-sharing patterns of animals that prefer to interact with certain individuals over others. The authors’ modeling of such animal groups upends previously held assumptions about internal group structure and improves upon our understanding of the influence of group organization and environment on both the collective decision-making process and its accuracy.

Modular — or cliquey — group structure isolates the flow of communication between individuals, so that only certain animals are privy to certain pieces of information. “A feature of modular structure is that there’s always information loss,” says Kao, “but the effect of that information loss on accuracy depends on the environment.”

In simple environments, the impact of these modular groups is detrimental to accuracy, but when animals face many different sources of information, the effect is actually the opposite. “Surprisingly,” says Kao, “in complex environments, the information loss even helps accuracy in a lot of situations.” More information, in this case, is not necessarily better.

“Modular structure can have a profound — and unexpected — impact on the collective intelligence of groups,” says Couzin. “This may indeed be one of the reasons that we see internal structure in so many group-living species, from schooling fish and flocking birds to wild primate groups.”

Potentially, these new observations could be applied to many different kinds of social networks, from the migration patterns of birds to the navigation of social media landscapes to the organization of new companies, deepening our grasp of complex organization and collective behavior….(More)”.

(The paper, “Modular structure within groups causes information loss but can improve decision accuracy,” is part of a theme issue in the Philosophical Transactions of the Royal Society B entitled “Liquid Brains, Solid Brains: How distributed cognitive architectures process information.” The issue was inspired by a Santa Fe Institute working group and edited by Ricard Solé (Universitat Pompeu Fabra), Melanie Moses (University of New Mexico), and Stephanie Forrest (Arizona State University).

Finding Wisdom in Politically Polarized Crowds

Eamon Duede at Nature Research: “We were seeing that the consumption of ideas seemed deeply related io political alignment, and because our group (Knowledge Lab) is concerned with understanding the social dynamics involved in production of ideas, we began wondering whether and to what extent the political alignment of individuals contributes to a group’s ability to produce knowledge. A Wikipedia article is full of smuggled content and worked into a narrative by a diverse team of editors. Because those articles constitute knowledge, we were curious to know whether political polarization within those teams had an effect on the quality of that production. So, we decided to braid both strands of research together and look at the way in which individual political alignments and the polarization of the teams they form affect the quality of the work that is produced collaboratively on Wikipedia.

To answer this question, we turned not to the article itself, but the immense history of articles on Wikipedia. Every edit to every article, no matter how insignificant, is documented and saved in Wikipedia’s astonishingly massive archives. And every edit to every article, no matter how insignificant, is evaluated for its relevance or validity by the vast community of editors, both robotic and human. Remarkable teamwork has gone into producing the encyclopedia. Some people edit randomly, simply cleaning typos, adding citations, or contributing graffiti and vandalism (I’ve experimented with this, and it gets painted over very quickly, no matter where you put it). Yet, many people are genuinely purposeful in their work, and contribute specifically to topics on which they have both interest and knowledge. They tend and grow a handful of articles or a few broad topics like gardeners. We walked through the histories of these gardens, looking back at who made contributions here and there, how much they contributed, and where. We thought that editors who make frequent contributions to pages associated with American liberalism would hold left leaning opinions, and for conservatism opinions on the right. This was a controversial hypothesis, and many in the Wikipedia community felt that perhaps the opposite would be true, with liberals correcting conservative pages and conservatives kindly returning the favor -like weeding or applying pesticide. But a survey we conducted of active Wikipedia editors found that building a function over the relative number of bits they contributed to liberal versus conservative pages predicted more than a third of the probability that they identified as such and voted accordingly.

Following this validation, we assigned a political alignment score to hundreds of thousands of editors by looking at where they make contributions, and then examined the polarization within teams of editors that produced hundreds of thousands of Wikipedia articles in the broad topic areas of politics, social issues, and science. We found that when most members of a team have the same political alignment, whether conservative, liberal, or “independent”, the quality of the Wikipedia pages they produce is not as strong as those of teams with polarized compositions of editors (Shi et al. 2019).

The United States Senate is increasingly polarized, but largely balanced in its polarization. If the Senate was trying to write a Wikipedia article, would they produce a high quality article? If they are doing so on Wikipedia, following norms of civility and balance inscribed within Wikipedia’s policies and guidelines, committed to the production of knowledge rather than self-promotion, then the answer is probably “yes”. That is a surprising finding. We think that the reason for this is that the policies of Wikipedia work to suppress the kind of rhetoric and sophistry common in everyday discourse, not to mention toxic language and name calling. Wikipedia’s policies are intolerant of discussion that could distort balanced consideration of the edit and topic under consideration, and, given that these policies shut down discourse that could bias proposed edits, teams with polarized viewpoints have to spend significantly more time discussing and debating the content that is up for consideration for inclusion in an article. These diverse viewpoints seem to bring out points and arguments between team members that sharpen and refine the quality of the content they can collectively agree to. With assumptions and norms of respect and civility, political polarization can be powerful and generative….(More)”

Innovation Meets Citizen Science

Caroline Nickerson at SciStarter: “Citizen science has been around as long as science, but innovative approaches are opening doors to more and deeper forms of public participation.

Below, our editors spotlight a few projects that feature new approaches, novel research, or low-cost instruments. …

Colony B: Unravel the secrets of microscopic life! Colony B is a mobile gaming app developed at McGill University that enables you to contribute to research on microbes. Collect microbes and grow your colony in a fast-paced puzzle game that advances important scientific research.

AirCasting: AirCasting is an open-source, end-to-end solution for collecting, displaying, and sharing health and environmental data using your smartphone. The platform consists of wearable sensors, including a palm-sized air quality monitor called the AirBeam, that detect and report changes in your environment. (Android only.)

LingoBoingo: Getting computers to understand language requires large amounts of linguistic data and “correct” answers to language tasks (what researchers call “gold standard annotations”). Simply by playing language games online, you can help archive languages and create the linguistic data used by researchers to improve language technologies. These games are in English, French, and a new “multi-lingual” category.

TreeSnap: Help our nation’s trees and protect human health in the process. Invasive diseases and pests threaten the health of America’s forests. With the TreeSnap app, you can record the location and health of particular tree species–those unharmed by diseases that have wiped out other species. Scientists then use the collected information to locate candidates for genetic sequencing and breeding programs. Tag trees you find in your community, on your property, or out in the wild to help scientists understand forest health….(More)”.

Advancing Computational Biology and Bioinformatics Research Through Open Innovation Competitions

HBR Working Paper by Andrea Blasco et al: “Open data science and algorithm development competitions offer a unique avenue for rapid discovery of better computational strategies. We highlight three examples in computational biology and bioinformatics research where the use of competitions has yielded significant performance gains over established algorithms. These include algorithms for antibody clustering, imputing gene expression data, and querying the Connectivity Map (CMap). Performance gains are evaluated quantitatively using realistic, albeit sanitized, data sets. The solutions produced through these competitions are then examined with respect to their utility and the prospects for implementation in the field. We present the decision process and competition design considerations that lead to these successful outcomes as a model for researchers who want to use competitions and non-domain crowds as collaborators to further their research….(More)”.

Progression of the Inevitable

Kevin Kelly at Technium: “…The procession of technological discoveries is inevitable. When the conditions are right — when the necessary web of supporting technology needed for every invention is established — then the next adjacent technological step will emerge as if on cue. If inventor X does not produce it, inventor Y will. The invention of the microphone, the laser, the transistor, the steam turbine, the waterwheel, and the discoveries of oxygen, DNA, and Boolean logic, were all inevitable in roughly the period they appeared. However the particular form of the microphone, its exact circuit, or the specific design of the laser, or the particular materials of the transistor, or the dimensions of the steam turbine, or the peculiar notation of the formula, or the specifics of any invention are not inevitable. Rather they will vary quite widely due to the personality of their finder, the resources at hand, the culture of society they are born into, the economics funding the discovery, and the influence of luck and chance. An incandescent light bulb based on a coil of carbonized bamboo filament heated within a vacuum bulb is not inevitable, but “the electric incandescent light bulb” is. The concept of “the electric incandescent light bulb” abstracted from all the details that can vary while still producing the result — luminance from electricity, for instance  —  is ordained by the technium’s trajectory. We know this because “the electric incandescent light bulb” was invented, re-invented, co-invented, or “first invented” dozens of times. In their book “Edison’s Electric Light: Biography of an Invention”, Robert Friedel and Paul Israel list 23 inventors of incandescent bulbs prior to Edison. It might be fairer to say that Edison was the very last “first” inventor of the electric light.


Three independently invented electric light bulbs: Edison’s, Swan’s, and Maxim’s.

Any claim of inevitability is difficult to prove. Convincing proof requires re-running a progression more than once and showing that the outcome is the same each time. That no matter what perturbations thrown at the system, it yields an identical result. To claim that the large-scale trajectory of the technium is inevitable would mean demonstrating that if we re-ran history, the same abstracted inventions would arise again, and in roughly the same relative order.  Without a time machine, there’ll be no indisputable proof, but we do have three types of evidence that suggest that the paths of technologies are inevitable. They are 1) that quantifiable trajectories of progress don’t waver despite attempts to shift them (see my Moore’s Law); 2) that in ancient times when transcontinental communication was slow or null, we find independent timelines of technology in different continents converging upon a set order; and 3) the fact that most inventions and discoveries have been made independently by more than one person….(More)”.