The 9 Pitfalls of Data Science


Book by Gary Smith and Jay Cordes: “Data science has never had more influence on the world. Large companies are now seeing the benefit of employing data scientists to interpret the vast amounts of data that now exists. However, the field is so new and is evolving so rapidly that the analysis produced can be haphazard at best. 

The 9 Pitfalls of Data Science shows us real-world examples of what can go wrong. Written to be an entertaining read, this invaluable guide investigates the all too common mistakes of data scientists – who can be plagued by lazy thinking, whims, hunches, and prejudices – and indicates how they have been at the root of many disasters, including the Great Recession. 

Gary Smith and Jay Cordes emphasise how scientific rigor and critical thinking skills are indispensable in this age of Big Data, as machines often find meaningless patterns that can lead to dangerous false conclusions. The 9 Pitfalls of Data Science is loaded with entertaining tales of both successful and misguided approaches to interpreting data, both grand successes and epic failures. These cautionary tales will not only help data scientists be more effective, but also help the public distinguish between good and bad data science….(More)”.

Introduction to Decision Intelligence


Blog post by Cassie Kozyrkov: “…Decision intelligence is a new academic discipline concerned with all aspects of selecting between options. It brings together the best of applied data science, social science, and managerial science into a unified field that helps people use data to improve their lives, their businesses, and the world around them. It’s a vital science for the AI era, covering the skills needed to lead AI projects responsibly and design objectives, metrics, and safety-nets for automation at scale.

Let’s take a tour of its basic terminology and concepts. The sections are designed to be friendly to skim-reading (and skip-reading too, that’s where you skip the boring bits… and sometimes skip the act of reading entirely).

What’s a decision?

Data are beautiful, but it’s decisions that are important. It’s through our decisions — our actions — that we affect the world around us.

We define the word “decision” to mean any selection between options by any entity, so the conversation is broader than MBA-style dilemmas (like whether to open a branch of your business in London).

In this terminology, labeling a photo as cat versus not-cat is a decision executed by a computer system, while figuring out whether to launch that system is a decision taken thoughtfully by the human leader (I hope!) in charge of the project.

What’s a decision-maker?

In our parlance, a “decision-maker” is not that stakeholder or investor who swoops in to veto the machinations of the project team, but rather the person who is responsible for decision architecture and context framing. In other words, a creator of meticulously-phrased objectives as opposed to their destroyer.

What’s decision-making?

Decision-making is a word that is used differently by different disciplines, so it can refer to:

  • taking an action when there were alternative options (in this sense it’s possible to talk about decision-making by a computer or a lizard).
  • performing the function of a (human) decision-maker, part of which is taking responsibility for decisions. Even though a computer system can execute a decision, it will not be called a decision-maker because it does not bear responsibility for its outputs — that responsibility rests squarely on the shoulders of the humans who created it.

Decision intelligence taxonomy

One way to approach learning about decision intelligence is to break it along traditional lines into its quantitative aspects (largely overlapping with applied data science) and qualitative aspects (developed primarily by researchers in the social and managerial sciences)….(More)”.


Trust and Mistrust in Americans’ Views of Scientific Experts


Report by the Pew Research Center: “In an era when science and politics often appear to collide, public confidence in scientists is on the upswing, and six-inten Americans say scientists should play an active role in policy debates about scientific
issues, according to a new Pew Research Center survey.

The survey finds public confidence in scientists on par with confidence in the military. It also exceeds the levels of public confidence in other groups and institutions, including the media, business leaders and elected officials.

At the same time, Americans are divided along party lines in terms of how they view the value and objectivity of scientists and their ability to act in the public interest. And, while political divides do not carry over to views of all scientists and scientific issues, there are particularly sizable gaps between Democrats and Republicans when it comes to trust in scientists whose work is related to the environment.

Higher levels of familiarity with the work of scientists are associated with more positive and more trusting views of scientists regarding their competence, credibility and commitment to the public, the survey shows….(More)”.

Innovation Beyond Technology: Science for Society and Interdisciplinary Approaches


Book edited by Sébastien Lechevalier: ” The major purpose of this book is to clarify the importance of non-technological factors in innovation to cope with contemporary complex societal issues while critically reconsidering the relations between science, technology, innovation (STI), and society. For a few decades now, innovation—mainly derived from technological advancement—has been considered a driving force of economic and societal development and prosperity.

With that in mind, the following questions are dealt with in this book: What are the non-technological sources of innovation? What can the progress of STI bring to humankind? What roles will society be expected to play in the new model of innovation? The authors argue that the majority of so-called technological innovations are actually socio-technical innovations, requiring huge resources for financing activities, adapting regulations, designing adequate policy frames, and shaping new uses and new users while having the appropriate interaction with society.

This book gathers multi- and trans-disciplinary approaches in innovation that go beyond technology and take into account the inter-relations with social and human phenomena. Illustrated by carefully chosen examples and based on broad and well-informed analyses, it is highly recommended to readers who seek an in-depth and up-to-date integrated overview of innovation in its non-technological dimensions….(More)”.

For academics, what matters more: journal prestige or readership?


Katie Langin at Science: “With more than 30,000 academic journals now in circulation, academics can have a hard time figuring out where to submit their work for publication. The decision is made all the more difficult by the sky-high pressure of today’s academic environment—including working toward tenure and trying to secure funding, which can depend on a researcher’s publication record. So, what does a researcher prioritize?

According to a new study posted on the bioRxiv preprint server, faculty members say they care most about whether the journal is read by the people they most want to reach—but they think their colleagues care most about journal prestige. Perhaps unsurprisingly, prestige also held more sway for untenured faculty members than for their tenured colleagues.

“I think that it is about the security that comes with being later in your career,” says study co-author Juan Pablo Alperin, an assistant professor in the publishing program at Simon Fraser University in Vancouver, Canada. “It means you can stop worrying so much about the specifics of what is being valued; there’s a lot less at stake.”

According to a different preprint that Alperin and his colleagues posted on PeerJ in April, 40% of research-intensive universities in the United States and Canada explicitly mention that journal impact factors can be considered in promotion and tenure decisions. More likely do so unofficially, with faculty members using journal names on a CV as a kind of shorthand for how “good” a candidate’s publication record is. “You can’t ignore the fact that journal impact factor is a reality that gets looked at,” Alperin says. But some argue that journal prestige and impact factor are overemphasized and harm science, and that academics should focus on the quality of individual work rather than journal-wide metrics. 

In the new study, only 31% of the 338 faculty members who were surveyed—all from U.S. and Canadian institutions and from a variety of disciplines, including 38% in the life and physical sciences and math—said that journal prestige was “very important” to them when deciding where to submit a manuscript. The highest priority was journal readership, which half said was very important. Fewer respondents felt that publication costs (24%) and open access (10%) deserved the highest importance rating.

But, when those same faculty members were asked to assess how their colleagues make the same decision, journal prestige shot to the top of the list, with 43% of faculty members saying that it was very important to their peers when deciding where to submit a manuscript. Only 30% of faculty members thought the same thing about journal readership—a drop of 20 percentage points compared with how faculty members assessed their own motivations….(More)”.

The Hidden Costs of Automated Thinking


Jonathan Zittrain in The New Yorker: “Like many medications, the wakefulness drug modafinil, which is marketed under the trade name Provigil, comes with a small, tightly folded paper pamphlet. For the most part, its contents—lists of instructions and precautions, a diagram of the drug’s molecular structure—make for anodyne reading. The subsection called “Mechanism of Action,” however, contains a sentence that might induce sleeplessness by itself: “The mechanism(s) through which modafinil promotes wakefulness is unknown.”

Provigil isn’t uniquely mysterious. Many drugs receive regulatory approval, and are widely prescribed, even though no one knows exactly how they work. This mystery is built into the process of drug discovery, which often proceeds by trial and error. Each year, any number of new substances are tested in cultured cells or animals; the best and safest of those are tried out in people. In some cases, the success of a drug promptly inspires new research that ends up explaining how it works—but not always. Aspirin was discovered in 1897, and yet no one convincingly explained how it worked until 1995. The same phenomenon exists elsewhere in medicine. Deep-brain stimulation involves the implantation of electrodes in the brains of people who suffer from specific movement disorders, such as Parkinson’s disease; it’s been in widespread use for more than twenty years, and some think it should be employed for other purposes, including general cognitive enhancement. No one can say how it works.

This approach to discovery—answers first, explanations later—accrues what I call intellectual debt. It’s possible to discover what works without knowing why it works, and then to put that insight to use immediately, assuming that the underlying mechanism will be figured out later. In some cases, we pay off this intellectual debt quickly. But, in others, we let it compound, relying, for decades, on knowledge that’s not fully known.

In the past, intellectual debt has been confined to a few areas amenable to trial-and-error discovery, such as medicine. But that may be changing, as new techniques in artificial intelligence—specifically, machine learning—increase our collective intellectual credit line. Machine-learning systems work by identifying patterns in oceans of data. Using those patterns, they hazard answers to fuzzy, open-ended questions. Provide a neural network with labelled pictures of cats and other, non-feline objects, and it will learn to distinguish cats from everything else; give it access to medical records, and it can attempt to predict a new hospital patient’s likelihood of dying. And yet, most machine-learning systems don’t uncover causal mechanisms. They are statistical-correlation engines. They can’t explain why they think some patients are more likely to die, because they don’t “think” in any colloquial sense of the word—they only answer. As we begin to integrate their insights into our lives, we will, collectively, begin to rack up more and more intellectual debt….(More)”.

The plan to mine the world’s research papers


Priyanka Pulla in Nature: “Carl Malamud is on a crusade to liberate information locked up behind paywalls — and his campaigns have scored many victories. He has spent decades publishing copyrighted legal documents, from building codes to court records, and then arguing that such texts represent public-domain law that ought to be available to any citizen online. Sometimes, he has won those arguments in court. Now, the 60-year-old American technologist is turning his sights on a new objective: freeing paywalled scientific literature. And he thinks he has a legal way to do it.

Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi. “This is not every journal article ever written, but it’s a lot,” Malamud says. It’s comparable to the size of the core collection in the Web of Science database, for instance. Malamud and his JNU collaborator, bioinformatician Andrew Lynn, call their facility the JNU data depot.

No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.

The unprecedented project is generating much excitement because it could, for the first time, open up vast swathes of the paywalled literature for easy computerized analysis. Dozens of research groups already mine papers to build databases of genes and chemicals, map associations between proteins and diseases, and generate useful scientific hypotheses. But publishers control — and often limit — the speed and scope of such projects, which typically confine themselves to abstracts, not full text. Researchers in India, the United States and the United Kingdom are already making plans to use the JNU store instead. Malamud and Lynn have held workshops at Indian government laboratories and universities to explain the idea. “We bring in professors and explain what we are doing. They get all excited and they say, ‘Oh gosh, this is wonderful’,” says Malamud.

But the depot’s legal status isn’t yet clear. Malamud, who contacted several intellectual-property (IP) lawyers before starting work on the depot, hopes to avoid a lawsuit. “Our position is that what we are doing is perfectly legal,” he says. For the moment, he is proceeding with caution: the JNU data depot is air-gapped, meaning that no one can access it from the Internet. Users have to physically visit the facility, and only researchers who want to mine for non-commercial purposes are currently allowed in. Malamud says his team does plan to allow remote access in the future. “The hope is to do this slowly and deliberately. We are not throwing this open right away,” he says….(More)”.

The Lives and After Lives of Data


Paper by Christine L. Borgman: “The most elusive term in data science is ‘data.’ While often treated as objects to be computed upon, data is a theory-laden concept with a long history. Data exist within knowledge infrastructures that govern how they are created, managed, and interpreted. By comparing models of data life cycles, implicit assumptions about data become apparent. In linear models, data pass through stages from beginning to end of life, which suggest that data can be recreated as needed. Cyclical models, in which data flow in a virtuous circle of uses and reuses, are better suited for irreplaceable observational data that may retain value indefinitely. In astronomy, for example, observations from one generation of telescopes may become calibration and modeling data for the next generation, whether digital sky surveys or glass plates. The value and reusability of data can be enhanced through investments in knowledge infrastructures, especially digital curation and preservation. Determining what data to keep, why, how, and for how long, is the challenge of our day…(More)”.

An archeological space oddity


Nick Paumgarten at the New Yorker: “…Parcak is a pioneer in the use of remote sensing, via satellite, to find and map potential locations that would otherwise be invisible to us. Variations in the chemical composition of the earth reveal the ghost shadows of ancient walls and citadels, watercourses and planting fields. The nifty kid-friendly name for all this is “archeology from space,” which also happens to be the title of Parcak’s new book. That’s a bit of a misnomer, because, technically, the satellites in question are in the mid-troposphere, and also the archeology still happens on, or under, the ground. In spite of the whiz-bang abracadabra of the multispectral imagery, Parcak is, at heart, a shovel bum…..Another estimate of Parcak’s, based on satellite data: there are roughly fifty million unmapped archeological sites around the world. Many, if not most, will be gone or corrupted by 2040, she says, the threats being not just looting but urban development, illegal construction, and climate change. In 2016, Parcak won the ted Prize, a grant of a million dollars; she used it to launch a project called GlobalXplorer, a crowdsourcing platform, by which citizen Indiana Joneses can scrutinize satellite maps and identify potential new sites, adding these to a database without publicly revealing the coördinates. The idea is to deploy more eyeballs (and, ultimately, more benevolent shovel bums) in the race against carbon and greed….(More)”.

The war to free science


Brian Resnick and Julia Belluz at Vox: “The 27,500 scientists who work for the University of California generate 10 percent of all the academic research papers published in the United States.

Their university recently put them in a strange position: Sometime this year, these scientists will not be able to directly access much of the world’s published research they’re not involved in.

That’s because in February, the UC system — one of the country’s largest academic institutions, encompassing Berkeley, Los Angeles, Davis, and several other campuses — dropped its nearly $11 million annual subscription to Elsevier, the world’s largest publisher of academic journals.

On the face of it, this seemed like an odd move. Why cut off students and researchers from academic research?

In fact, it was a principled stance that may herald a revolution in the way science is shared around the world.

The University of California decided it doesn’t want scientific knowledge locked behind paywalls, and thinks the cost of academic publishing has gotten out of control.

Elsevier owns around 3,000 academic journals, and its articles account for some 18 percentof all the world’s research output. “They’re a monopolist, and they act like a monopolist,” says Jeffrey MacKie-Mason, head of the campus libraries at UC Berkeley and co-chair of the team that negotiated with the publisher.Elsevier makes huge profits on its journals, generating billions of dollars a year for its parent company RELX .

This is a story about more than subscription fees. It’s about how a private industry has come to dominate the institutions of science, and how librarians, academics, and even pirates are trying to regain control.

The University of California is not the only institution fighting back. “There are thousands of Davids in this story,” says University of California Davis librarian MacKenzie Smith, who, like so many other librarians around the world, has been pushing for more open access to science. “But only a few big Goliaths.”…(More)”.