Science Isn’t Broken


Christie Aschwanden at FiveThirtyEight: “Yet even in the face of overwhelming evidence, it’s hard to let go of a cherished idea, especially one a scientist has built a career on developing. And so, as anyone who’s ever tried to correct a falsehood on the Internet knows, the truth doesn’t always win, at least not initially, because we process new evidence through the lens of what we already believe. Confirmation bias can blind us to the facts; we are quick to make up our minds and slow to change them in the face of new evidence.

A few years ago, Ioannidis and some colleagues searched the scientific literature for references to two well-known epidemiological studies suggesting that vitamin E supplements might protect against cardiovascular disease. These studies were followed by several large randomized clinical trials that showed no benefit from vitamin E and one meta-analysis finding that at high doses, vitamin E actually increased the risk of death.

Human fallibilities send the scientific process hurtling in fits, starts and misdirections instead of in a straight line from question to truth.

Despite the contradictory evidence from more rigorous trials, the first studies continued to be cited and defended in the literature. Shaky claims about beta carotene’s ability to reduce cancer risk and estrogen’s role in staving off dementia also persisted, even after they’d been overturned by more definitive studies. Once an idea becomes fixed, it’s difficult to remove from the conventional wisdom.

Sometimes scientific ideas persist beyond the evidence because the stories we tell about them feel true and confirm what we already believe. It’s natural to think about possible explanations for scientific results — this is how we put them in context and ascertain how plausible they are. The problem comes when we fall so in love with these explanations that we reject the evidence refuting them.

The media is often accused of hyping studies, but scientists are prone to overstating their results too.

Take, for instance, the breakfast study. Published in 2013, it examined whether breakfast eaters weigh less than those who skip the morning meal and if breakfast could protect against obesity. Obesity researcher Andrew Brown and his colleagues found that despite more than 90 mentions of this hypothesis in published media and journals, the evidence for breakfast’s effect on body weight was tenuous and circumstantial. Yet researchers in the field seemed blind to these shortcomings, overstating the evidence and using causative language to describe associations between breakfast and obesity. The human brain is primed to find causality even where it doesn’t exist, and scientists are not immune.

As a society, our stories about how science works are also prone to error. The standard way of thinking about the scientific method is: ask a question, do a study, get an answer. But this notion is vastly oversimplified. A more common path to truth looks like this: ask a question, do a study, get a partial or ambiguous answer, then do another study, and then do another to keep testing potential hypotheses and homing in on a more complete answer. Human fallibilities send the scientific process hurtling in fits, starts and misdirections instead of in a straight line from question to truth.

Media accounts of science tend to gloss over the nuance, and it’s easy to understand why. For one thing, reporters and editors who cover science don’t always have training on how to interpret studies. And headlines that read “weak, unreplicated study finds tenuous link between certain vegetables and cancer risk” don’t fly off the newsstands or bring in the clicks as fast as ones that scream “foods that fight cancer!”

People often joke about the herky-jerky nature of science and health headlines in the media — coffee is good for you one day, bad the next — but that back and forth embodies exactly what the scientific process is all about. It’s hard to measure the impact of diet on health, Nosek told me. “That variation [in results] occurs because science is hard.” Isolating how coffee affects health requires lots of studies and lots of evidence, and only over time and in the course of many, many studies does the evidence start to narrow to a conclusion that’s defensible. “The variation in findings should not be seen as a threat,” Nosek said. “It means that scientists are working on a hard problem.”

The scientific method is the most rigorous path to knowledge, but it’s also messy and tough. Science deserves respect exactly because it is difficult — not because it gets everything correct on the first try. The uncertainty inherent in science doesn’t mean that we can’t use it to make important policies or decisions. It just means that we should remain cautious and adopt a mindset that’s open to changing course if new data arises. We should make the best decisions we can with the current evidence and take care not to lose sight of its strength and degree of certainty. It’s no accident that every good paper includes the phrase “more study is needed” — there is always more to learn….(More)”

Can big databases be kept both anonymous and useful?


The Economist: “….The anonymisation of a data record typically means the removal from it of personally identifiable information. Names, obviously. But also phone numbers, addresses and various intimate details like dates of birth. Such a record is then deemed safe for release to researchers, and even to the public, to make of it what they will. Many people volunteer information, for example to medical trials, on the understanding that this will happen.

But the ability to compare databases threatens to make a mockery of such protections. Participants in genomics projects, promised anonymity in exchange for their DNA, have been identified by simple comparison with electoral rolls and other publicly available information. The health records of a governor of Massachusetts were plucked from a database, again supposedly anonymous, of state-employee hospital visits using the same trick. Reporters sifting through a public database of web searches were able to correlate them in order to track down one, rather embarrassed, woman who had been idly searching for single men. And so on.

Each of these headline-generating stories creates a demand for more controls. But that, in turn, deals a blow to the idea of open data—that the electronic “data exhaust” people exhale more or less every time they do anything in the modern world is actually useful stuff which, were it freely available for analysis, might make that world a better place.

Of cake, and eating it

Modern cars, for example, record in their computers much about how, when and where the vehicle has been used. Comparing the records of many vehicles, says Viktor Mayer-Schönberger of the Oxford Internet Institute, could provide a solid basis for, say, spotting dangerous stretches of road. Similarly, an opening of health records, particularly in a country like Britain, which has a national health service, and cross-fertilising them with other personal data, might help reveal the multifarious causes of diseases like Alzheimer’s.

This is a true dilemma. People want both perfect privacy and all the benefits of openness. But they cannot have both. The stripping of a few details as the only means of assuring anonymity, in a world choked with data exhaust, cannot work. Poorly anonymised data are only part of the problem. What may be worse is that there is no standard for anonymisation. Every American state, for example, has its own prescription for what constitutes an adequate standard.

Worse still, devising a comprehensive standard may be impossible. Paul Ohm of Georgetown University, in Washington, DC, thinks that this is partly because the availability of new data constantly shifts the goalposts. “If we could pick an industry standard today, it would be obsolete in short order,” he says. Some data, such as those about medical conditions, are more sensitive than others. Some data sets provide great precision in time or place, others merely a year or a postcode. Each set presents its own dangers and requirements.

Fortunately, there are a few easy fixes. Thanks in part to the headlines, many now agree that public release of anonymised data is a bad move. Data could instead be released piecemeal, or kept in-house and accessible by researchers through a question-and-answer mechanism. Or some users could be granted access to raw data, but only in strictly controlled conditions.

All these approaches, though, are anathema to the open-data movement, because they limit the scope of studies. “If we’re making it so hard to share that only a few have access,” says Tim Althoff, a data scientist at Stanford University, “that has profound implications for science, for people being able to replicate and advance your work.”

Purely legal approaches might mitigate that. Data might come with what have been called “downstream contractual obligations”, outlining what can be done with a given data set and holding any onward recipients to the same standards. One perhaps draconian idea, suggested by Daniel Barth-Jones, an epidemiologist at Columbia University, in New York, is to make it illegal even to attempt re-identification….(More).”

Content Volatility of Scientific Topics in Wikipedia: A Cautionary Tale


Paper by Wilson AM and Likens GE at PLOS: “Wikipedia has quickly become one of the most frequently accessed encyclopedic references, despite the ease with which content can be changed and the potential for ‘edit wars’ surrounding controversial topics. Little is known about how this potential for controversy affects the accuracy and stability of information on scientific topics, especially those with associated political controversy. Here we present an analysis of the Wikipedia edit histories for seven scientific articles and show that topics we consider politically but not scientifically “controversial” (such as evolution and global warming) experience more frequent edits with more words changed per day than pages we consider “noncontroversial” (such as the standard model in physics or heliocentrism). For example, over the period we analyzed, the global warming page was edited on average (geometric mean ±SD) 1.9±2.7 times resulting in 110.9±10.3 words changed per day, while the standard model in physics was only edited 0.2±1.4 times resulting in 9.4±5.0 words changed per day. The high rate of change observed in these pages makes it difficult for experts to monitor accuracy and contribute time-consuming corrections, to the possible detriment of scientific accuracy. As our society turns to Wikipedia as a primary source of scientific information, it is vital we read it critically and with the understanding that the content is dynamic and vulnerable to vandalism and other shenanigans….(More)”

5 Tips for Designing a Data for Good Initiative


Mitul Desai at Mastercard Center for Inclusive Growth: “The transformative impact of data on development projects, captured in the hashtag #DATARevolution, offers the social and private sectors alike a rallying point to enlist data in the service of high-impact development initiatives.

To help organizations design initiatives that are authentic to their identity and capabilities, we’re sharing what’s necessary to navigate the deeply interconnected organizational, technical and ethical aspects of creating a Data for Good initiative.

1) Define the need

At the center of a Data for Good initiative are the individual beneficiaries you are seeking to serve. This is foundation on which the “Good” of Data for Good rests.

Understanding the data and expertise needed to better serve such individuals will bring into focus the areas where your organization can contribute and the partners you might engage. As we’ve covered in past posts, collaboration between agents who bring different layers of expertise to Data for Good projects is a powerful formula for change….

2) Understand what data can make a difference

Think about what kind of data can tell a story that’s relevant to your mission. Claudia Perlich of Dstillery says: “The question is first and foremost, what decision do I have to make and which data can tell me something about that decision.” This great introduction to what different kinds of data are relevant in different settings can give you concrete examples.

3) Get the right tools for the job

By one estimate, some 90% of business-relevant data are unstructured or semi-structured (think texts, tweets, images, audio) as opposed to structured data like numbers that easily fit into the lines of a spreadsheet. Perlich notes that while it’s more challenging to mine this unstructured data, they can yield especially powerful insights with the right tools—which thankfully aren’t that hard to identify…..

4) Build a case that moves your organization

“While our programs are designed to serve organizations no matter what their capacity, we do find that an organization’s clarity around mission and commitment to using data to drive decision-making are two factors that can make or break a project,” says Jake Porway, founder and executive director of DataKind, a New York-based data science nonprofit that helps organizations develop Data for Good initiatives…..

5) Make technology serve people-centric ethics

The two most critical ethical factors to consider are informed consent and privacy—both require engaging the community you wish to serve as individual actors….

“Employ data-privacy walls, mask the data from the point of collection and encrypt the data you store. Ensure that appropriate technical and organizational safeguards are in place to verify that the data can’t be used to identify individuals or target demographics in a way that could harm them,” recommends Quid’s Pedraza. To understand the technology of data encryption and masking, check out this post. (More)”

President Obama Signs Executive Order Making Presidential Innovation Fellows Program Permanent


White House Press Release: “My hope is this continues to encourage a culture of public service among our innovators, and tech entrepreneurs, so that we can keep building a government that’s as modern, as innovative, and as engaging as our incredible tech sector is.  To all the Fellows who’ve served so far – thank you.  I encourage all Americans with bold ideas to apply.  And I can’t wait to see what those future classes will accomplish on behalf of the American people.” –- President Barack Obama

Today, President Obama signed an executive order that makes the Presidential Innovation Fellows Program a permanent part of the Federal government going forward. The program brings executives, entrepreneurs, technologists, and other innovators into government, and teams them up with Federal employees to improve programs that serve more than 150 million Americans.

The Presidential Innovation Fellows Program is built on four key principles:

  • Recruit the best our nation has to offer: Fellows include entrepreneurs, startup founders, and innovators with experience at large technology companies and startups, each of whom leverage their proven skills and technical expertise to create huge value for the public.
  • Partner with innovators inside government: Working as teams, the Presidential Innovation Fellows and their partners across the government create products and services that are responsive, user-friendly, and help to improve the way the Federal government interacts with the American people.
  • Deploy proven private sector strategies: Fellows leverage best practices from the private sector to deliver better, more effective programs and policies across the Federal government.
  • Focus on some of the Nation’s biggest and most pressing challenges: Projects focus on topics such as improving access to education, fueling job creation and the economy, and expanding the public’s ability to access their personal health data.

Additional Details on Today’s Announcements

The Executive Order formally establishes the Presidential Innovation Fellows Program within the General Services Administration (GSA), where it will continue to serve departments and agencies throughout the Executive Branch. The Presidential Innovation Fellow Program will be administered by a Director and guided by a newly-established Advisory Board. The Director will outline steps for the selection, hiring, and deployment of Fellows within government….

Fellows have partnered with leaders at more than 25 government agencies, delivering impressive results in months, not years, driving extraordinary work and innovative solutions in areas such as health care; open data and data science; crowd-sourcing initiatives; education; veterans affairs; jobs and the economy; and disaster response and recovery. Examples of projects include:

Open Data

When government acts as a platform, entrepreneurs, startups, and the private sector can build value-added services and tools on top of federal datasets supported by federal policies. Taking this approach, Fellows and agency stakeholders have supported the creation of new products and services focused on education, health, the environment, and social justice. As a result of their efforts and the agencies they have worked with:….

Jobs and the Economy

Fellows continue to work on solutions that will give the government better access to innovative tools and services. This is also helping small and medium-sized companies create jobs and compete for Federal government contracts….

Digital Government

The Presidential Innovation Fellows Program is a part of the Administration’s strategy to create lasting change across the Federal Government by improving how it uses technology. The Fellows played a part in launching 18F within the General Services Administration (GSA) and the U.S. Digital Services (USDS) team within the Office of Management and Budget….

Supporting Our Veterans

  • …Built a one-stop shop for finding employment opportunities. The Veterans Employment Center was developed by a team of Fellows working with the Department of Veterans Affairs in connection with the First Lady’s Joining Forces Initiative and the Department of Labor. This is the first interagency website connecting Veterans, transitioning Servicemembers, and their spouses to meaningful employment opportunities. The portal has resulted in cost savings of over $27 million to the Department of Veterans Affairs.

Education

  • …More than 1,900 superintendents pledged to more effectively leverage education technology in their schools. Fellows working at the Department of Education helped develop the idea of Future Ready, which later informed the creation of the Future Ready District Pledge. The Future Ready District Pledge is designed to set out a roadmap to achieve successful personalized digital learning for every student and to commit districts to move as quickly as possible towards our shared vision of preparing students for success. Following the President’s announcement of this effort in 2014, more than 1,900 superintendents have signed this pledge, representing 14 million students.

Health and Patient Care

  • More than 150 million Americans are able to access their health records online. Multiple rounds of Fellows have worked with the Department of Health and Human Services (HHS) and the Department of Veterans Affairs (VA) to expand the reach of theBlue Button Initiative. As a result, patients are able to access their electronic health records to make more informed decisions about their own health care. The Blue Button Initiative has received more than 600 commitments from organizations to advance health information access efforts across the country and has expanded into other efforts that support health care system interoperability….

Disaster Response and Recovery

  • Communities are piloting crowdsourcing tools to assess damage after disasters. Fellows developed the GeoQ platform with FEMA and the National Geospatial-Intelligence Agency that crowdsources photos of disaster-affected areas to assess damage over large regions.  This information helps the Federal government better allocate critical response and recovery efforts following a disaster and allows local governments to use geospatial information in their communities…. (More)

The Last Mile: Creating Social and Economic Value from Behavioral Insights


New book by Dilip Soman: “Most organizations spend much of their effort on the start of the value creation process: namely, creating a strategy, developing new products or services, and analyzing the market. They pay a lot less attention to the end: the crucial “last mile” where consumers come to their website, store, or sales representatives and make a choice.

In The Last Mile, Dilip Soman shows how to use insights from behavioral science in order to close that gap. Beginning with an introduction to the last mile problem and the concept of choice architecture, the book takes a deep dive into the psychology of choice, money, and time. It explains how to construct behavioral experiments and understand the data on preferences that they provide. Finally, it provides a range of practical tools with which to overcome common last mile difficulties.

The Last Mile helps lay readers not only to understand behavioral science, but to apply its lessons to their own organizations’ last mile problems, whether they work in business, government, or the nonprofit sector. Appealing to anyone who was fascinated by Dan Ariely’s Predictably Irrational, Richard Thaler and Cass Sunstein’s Nudge, or Daniel Kahneman’s Thinking, Fast and Slow but was not sure how those insights could be practically used, The Last Mile is full of solid, practical advice on how to put the lessons of behavioral science to work….(More)”

Big data algorithms can discriminate, and it’s not clear what to do about it


 at the Conversation“This program had absolutely nothing to do with race…but multi-variable equations.”

That’s what Brett Goldstein, a former policeman for the Chicago Police Department (CPD) and current Urban Science Fellow at the University of Chicago’s School for Public Policy, said about a predictive policing algorithm he deployed at the CPD in 2010. His algorithm tells police where to look for criminals based on where people have been arrested previously. It’s a “heat map” of Chicago, and the CPD claims it helps them allocate resources more effectively.

Chicago police also recently collaborated with Miles Wernick, a professor of electrical engineering at Illinois Institute of Technology, to algorithmically generate a “heat list” of 400 individuals it claims have thehighest chance of committing a violent crime. In response to criticism, Wernick said the algorithm does not use “any racial, neighborhood, or other such information” and that the approach is “unbiased” and “quantitative.” By deferring decisions to poorly understood algorithms, industry professionals effectively shed accountability for any negative effects of their code.

But do these algorithms discriminate, treating low-income and black neighborhoods and their inhabitants unfairly? It’s the kind of question many researchers are starting to ask as more and more industries use algorithms to make decisions. It’s true that an algorithm itself is quantitative – it boils down to a sequence of arithmetic steps for solving a problem. The danger is that these algorithms, which are trained on data produced by people, may reflect the biases in that data, perpetuating structural racism and negative biases about minority groups.

There are a lot of challenges to figuring out whether an algorithm embodies bias. First and foremost, many practitioners and “computer experts” still don’t publicly admit that algorithms can easily discriminate.More and more evidence supports that not only is this possible, but it’s happening already. The law is unclear on the legality of biased algorithms, and even algorithms researchers don’t precisely understand what it means for an algorithm to discriminate….

While researchers clearly understand the theoretical dangers of algorithmic discrimination, it’s difficult to cleanly measure the scope of the issue in practice. No company or public institution is willing to publicize its data and algorithms for fear of being labeled racist or sexist, or maybe worse, having a great algorithm stolen by a competitor.

Even when the Chicago Police Department was hit with a Freedom of Information Act request, they did not release their algorithms or heat list, claiming a credible threat to police officers and the people on the list. This makes it difficult for researchers to identify problems and potentially provide solutions.

Legal hurdles

Existing discrimination law in the United States isn’t helping. At best, it’s unclear on how it applies to algorithms; at worst, it’s a mess. Solon Barocas, a postdoc at Princeton, and Andrew Selbst, a law clerk for the Third Circuit US Court of Appeals, argued together that US hiring law fails to address claims about discriminatory algorithms in hiring.

The crux of the argument is called the “business necessity” defense, in which the employer argues that a practice that has a discriminatory effect is justified by being directly related to job performance….(More)”

Mining Administrative Data to Spur Urban Revitalization


New paper by Ben Green presented at the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: “After decades of urban investment dominated by sprawl and outward growth, municipal governments in the United States are responsible for the upkeep of urban neighborhoods that have not received sufficient resources or maintenance in many years. One of city governments’ biggest challenges is to revitalize decaying neighborhoods given only limited resources. In this paper, we apply data science techniques to administrative data to help the City of Memphis, Tennessee improve distressed neighborhoods. We develop new methods to efficiently identify homes in need of rehabilitation and to predict the impacts of potential investments on neighborhoods. Our analyses allow Memphis to design neighborhood-improvement strategies that generate greater impacts on communities. Since our work uses data that most US cities already collect, our models and methods are highly portable and inexpensive to implement. We also discuss the challenges we encountered while analyzing government data and deploying our tools, and highlight important steps to improve future data-driven efforts in urban policy….(More)”

Push, Pull, and Spill: A Transdisciplinary Case Study in Municipal Open Government


New paper by Jan Whittington et al: “Cities hold considerable information, including details about the daily lives of residents and employees, maps of critical infrastructure, and records of the officials’ internal deliberations. Cities are beginning to realize that this data has economic and other value: If done wisely, the responsible release of city information can also release greater efficiency and innovation in the public and private sector. New services are cropping up that leverage open city data to great effect.

Meanwhile, activist groups and individual residents are placing increasing pressure on state and local government to be more transparent and accountable, even as others sound an alarm over the privacy issues that inevitably attend greater data promiscuity. This takes the form of political pressure to release more information, as well as increased requests for information under the many public records acts across the country.

The result of these forces is that cities are beginning to open their data as never before. It turns out there is surprisingly little research to date into the important and growing area of municipal open data. This article is among the first sustained, cross-disciplinary assessments of an open municipal government system. We are a team of researchers in law, computer science, information science, and urban studies. We have worked hand-in-hand with the City of Seattle, Washington for the better part of a year to understand its current procedures from each disciplinary perspective. Based on this empirical work, we generate a set of recommendations to help the city manage risk latent in opening its data….(More)”

Algorithms and Bias


Q. and A. With Cynthia Dwork in the New York Times: “Algorithms have become one of the most powerful arbiters in our lives. They make decisions about the news we read, the jobs we get, the people we meet, the schools we attend and the ads we see.

Yet there is growing evidence that algorithms and other types of software can discriminate. The people who write them incorporate their biases, and algorithms often learn from human behavior, so they reflect the biases we hold. For instance, research has shown that ad-targeting algorithms have shown ads for high-paying jobs to men but not women, and ads for high-interest loans to people in low-income neighborhoods.

Cynthia Dwork, a computer scientist at Microsoft Research in Silicon Valley, is one of the leading thinkers on these issues. In an Upshot interview, which has been edited, she discussed how algorithms learn to discriminate, who’s responsible when they do, and the trade-offs between fairness and privacy.

Q: Some people have argued that algorithms eliminate discriminationbecause they make decisions based on data, free of human bias. Others say algorithms reflect and perpetuate human biases. What do you think?

A: Algorithms do not automatically eliminate bias. Suppose a university, with admission and rejection records dating back for decades and faced with growing numbers of applicants, decides to use a machine learning algorithm that, using the historical records, identifies candidates who are more likely to be admitted. Historical biases in the training data will be learned by the algorithm, and past discrimination will lead to future discrimination.

Q: Are there examples of that happening?

A: A famous example of a system that has wrestled with bias is the resident matching program that matches graduating medical students with residency programs at hospitals. The matching could be slanted to maximize the happiness of the residency programs, or to maximize the happiness of the medical students. Prior to 1997, the match was mostly about the happiness of the programs.

This changed in 1997 in response to “a crisis of confidence concerning whether the matching algorithm was unreasonably favorable to employers at the expense of applicants, and whether applicants could ‘game the system,’ ” according to a paper by Alvin Roth and Elliott Peranson published in The American Economic Review.

Q: You have studied both privacy and algorithm design, and co-wrote a paper, “Fairness Through Awareness,” that came to some surprising conclusions about discriminatory algorithms and people’s privacy. Could you summarize those?

A: “Fairness Through Awareness” makes the observation that sometimes, in order to be fair, it is important to make use of sensitive information while carrying out the classification task. This may be a little counterintuitive: The instinct might be to hide information that could be the basis of discrimination….

Q: The law protects certain groups from discrimination. Is it possible to teach an algorithm to do the same?

A: This is a relatively new problem area in computer science, and there are grounds for optimism — for example, resources from the Fairness, Accountability and Transparency in Machine Learning workshop, which considers the role that machines play in consequential decisions in areas like employment, health care and policing. This is an exciting and valuable area for research. …(More)”