Data at the Speed of Life


Marc Gunther at The Chronicle of Philanthropy: “Can pregnant women in Zambia be persuaded to deliver their babies in hospitals or clinics rather than at home? How much are villagers in Cambodia willing to pay for a simple latrine? What qualities predict success for a small-scale entrepreneur who advises farmers?

Governments, foundations, and nonprofits that want to help the world’s poor regularly face questions like these. Answers are elusive. While an estimated $135 billion in government aid and another $15 billion in charitable giving flow annually to developing countries, surprisingly few projects benefit from rigorous evaluations. Those that do get scrutinized in academic studies often don’t see the results for years, long after the projects have ended.

IDinsight puts data-driven research on speed. Its goal is to produce useful, low-cost research results fast enough that nonprofits can use it make midcourse corrections to their programs….

IDinsight calls this kind of research “decision-focused evaluation,” which sets it apart from traditional monitoring and evaluation (M&E) and academic research. M&E, experts say, is mostly about accountability and outputs — how many training sessions were held, how much food was distributed, and so on. Usually, it occurs after a program is complete. Academic studies are typically shaped by researchers’ desire to break new ground and publish on topics of broad interest. The IDinsight approach aims instead “for contemporaneous decision-making rather than for publication in the American Economic Review,” says Ruth Levine, who directs the global development program at the William and Flora Hewlett Foundation.

A decade ago, Ms. Levine and William Savedoff, a senior fellow at the Center for Global Development, wrote an influential paper entitled “When Will We Ever Learn? Improving Lives Through Impact Evaluation.” They lamented that an “absence of evidence” for the effectiveness of global development programs “not only wastes money but denies poor people crucial support to improve their lives.”

Since then, impact evaluation has come a “huge distance,” Ms. Levine says….

Actually, others are. Innovations for Poverty Action recently created the Goldilocks Initiative to do what it calls “right fit” evaluations leading to better policy and programs, according to Thoai Ngo, who leads the effort. Its first clients include GiveDirectly, which facilitates cash transfers to the extreme poor, and Splash, a water charity….All this focus on data has generated pushback. Many nonprofits don’t have the resources to do rigorous research, according to Debra Allcock Tyler, chief executive at Directory of Social Change, a British charity that provides training, data, and other resources for social enterprises.

All this focus on data has generated pushback. Many nonprofits don’t have the resources to do rigorous research, according to Debra Allcock Tyler, chief executive at Directory of Social Change, a British charity that provides training, data, and other resources for social enterprises.

“A great deal of the time, data is pointless,” Allcock Tyler said last year at a London seminar on data and nonprofits. “Very often it is dangerous and can be used against us, and sometimes it takes away precious resources from other things that we might more usefully do.”

A bigger problem may be that the accumulation of knowledge does not necessarily lead to better policies or practices.

“People often trust their experience more than a systematic review,” says Ms. Levine of the Hewlett Foundation. IDinsight’s Esther Wang agrees. “A lot of our frustration is looking at the development world and asking why are we not accountable for the money that we are spending,” she says. “That’s a waste that none of us really feels is justifiable.”…(More)”

Bridging data gaps for policymaking: crowdsourcing and big data for development


 for the DevPolicyBlog: “…By far the biggest innovation in data collection is the ability to access and analyse (in a meaningful way) user-generated data. This is data that is generated from forums, blogs, and social networking sites, where users purposefully contribute information and content in a public way, but also from everyday activities that inadvertently or passively provide data to those that are able to collect it.

User-generated data can help identify user views and behaviour to inform policy in a timely way rather than just relying on traditional data collection techniques (census, household surveys, stakeholder forums, focus groups, etc.), which are often cumbersome, very costly, untimely, and in many cases require some form of approval or support by government.

It might seem at first that user-generated data has limited usefulness in a development context due to the importance of the internet in generating this data combined with limited internet availability in many places. However, U-Report is one example of being able to access user-generated data independent of the internet.

U-Report was initiated by UNICEF Uganda in 2011 and is a free SMS based platform where Ugandans are able to register as “U-Reporters” and on a weekly basis give their views on topical issues (mostly related to health, education, and access to social services) or participate in opinion polls. As an example, Figure 1 shows the result from a U-Report poll on whether polio vaccinators came to U-Reporter houses to immunise all children under 5 in Uganda, broken down by districts. Presently, there are more than 300,000 U-Reporters in Uganda and more than one million U-Reporters across 24 countries that now have U-Report. As an indication of its potential impact on policymaking,UNICEF claims that every Member of Parliament in Uganda is signed up to receive U-Report statistics.

Figure 1: U-Report Uganda poll results

Figure 1: U-Report Uganda poll results

U-Report and other platforms such as Ushahidi (which supports, for example, I PAID A BRIBE, Watertracker, election monitoring, and crowdmapping) facilitate crowdsourcing of data where users contribute data for a specific purpose. In contrast, “big data” is a broader concept because the purpose of using the data is generally independent of the reasons why the data was generated in the first place.

Big data for development is a new phrase that we will probably hear a lot more (see here [pdf] and here). The United Nations Global Pulse, for example, supports a number of innovation labs which work on projects that aim to discover new ways in which data can help better decision-making. Many forms of “big data” are unstructured (free-form and text-based rather than table- or spreadsheet-based) and so a number of analytical techniques are required to make sense of the data before it can be used.

Measures of Twitter activity, for example, can be a real-time indicator of food price crises in Indonesia [pdf] (see Figure 2 below which shows the relationship between food-related tweet volume and food inflation: note that the large volume of tweets in the grey highlighted area is associated with policy debate on cutting the fuel subsidy rate) or provide a better understanding of the drivers of immunisation awareness. In these examples, researchers “text-mine” Twitter feeds by extracting tweets related to topics of interest and categorising text based on measures of sentiment (positive, negative, anger, joy, confusion, etc.) to better understand opinions and how they relate to the topic of interest. For example, Figure 3 shows the sentiment of tweets related to vaccination in Kenya over time and the dates of important vaccination related events.

Figure 2: Plot of monthly food-related tweet volume and official food price statistics

Figure 2: Plot of monthly food-related Tweet volume and official food price statistics

Figure 3: Sentiment of vaccine related tweets in Kenya

Figure 3: Sentiment of vaccine-related tweets in Kenya

Another big data example is the use of mobile phone usage to monitor the movement of populations in Senegal in 2013. The data can help to identify changes in the mobility patterns of vulnerable population groups and thereby provide an early warning system to inform humanitarian response effort.

The development of mobile banking too offers the potential for the generation of a staggering amount of data relevant for development research and informing policy decisions. However, it also highlights the public good nature of data collected by public and private sector institutions and the reliance that researchers have on them to access the data. Building trust and a reputation for being able to manage privacy and commercial issues will be a major challenge for researchers in this regard….(More)”

Due Diligence? We need an app for that


Ken Banks at kiwanja.net: “The ubiquity of mobile phones, the reach of the Internet, the shear number of problems facing the planet, competitions and challenges galore, pots of money and strong media interest in tech-for-good projects has today created the perfect storm. Not a day goes by without the release of an app hoping to solve something, and the fact so many people are building so many apps to fix so many problems can only be a good thing. Right?

The only problem is this. It’s become impossible to tell good from bad, even real from fake. It’s something of a Wild West out there. So it was no surprise to see this happening recently. Quoting The Guardian:

An app which purported to offer aid to refugees lost in the Mediterranean has been pulled from Apple’s App Store after it was revealed as a fake. The I Sea app, which also won a Bronze medal at the Cannes Lions conference on Monday night, presented itself as a tool to help report refugees lost at sea, using real-time satellite footage to identify boats in trouble and highlighting their location to the Malta-based Migrant Offshore Aid Station (Moas), which would provide help.

In fact, the app did nothing of the sort. Rather than presenting real-time satellite footage – a difficult and expensive task – it instead simply shows a portion of a static, unchanging image. And while it claims to show the weather in the southern Mediterranean, that too isn’t that accurate: it’s for Western Libya.

The worry isn’t only that someone would decide to build a fake app which ‘tackles’ such an emotive subject, but the fact that this particular app won an award and received favourable press. Wired, Mashable, the Evening Standard and Reuters all spoke positively about it. Did no-one check that it did what it said it did?

This whole episode reminds me of something Joel Selanikio wrote in his contributing chapter to two books I’ve recently edited and published. In his chapters, which touch on his work on the Magpi data collection tool in addition to some of the challenges facing the tech-for-development community, Joel wrote:

In going over our user activity logs for the online Magpi app, I quickly realised that no-one from any of our funding organisations was listed. Apparently no-one who was paying us had ever seen our working software! This didn’t seem to make sense. Who would pay for software without ever looking at it? And if our funders hadn’t seen the software, what information were they using when they decided whether to fund us each year?

…The shear number of apps available that claim to solve all manner of problems may seem encouraging on the surface – 1,500 (and counting) to help refugees might be a case in point – but how many are useful? How many are being used? How many solve a problem? And how many are real?

Due diligence? Maybe it’s time we had an app for that…(More)”

Directory of crowdsourcing websites


Directory by Donelle McKinley: “…Here is just a selection of websites for crowdsourcing cultural heritage. Websites are actively crowdsourcing unless indicated with an asterisk…The directory is organized by the type of crowdsourcing process involved, using the typology for crowdsourcing in the humanities developed by Dunn & Hedges (2012). In their study they explain that, “a process is a sequence of tasks, through which an output is produced by operating on an asset”. For example, the Your Paintings Tagger website is for the process of tagging, which is an editorial task. The assets being tagged are images, and the output of the project is metadata, which makes the images easier to discover, retrieve and curate.

Transcription

Alexander Research Library, Wanganui Library * (NZ) Transcription of index cards from 1840 to 2002.

Ancient Lives*, University of Oxford (UK) Transcription of papyri from Greco-Roman Egypt.

AnnoTate, Tate Britain (UK) Transcription of artists’ diaries, letters and sketchbooks.

Decoding the Civil War, The Huntington Library, Abraham Lincoln Presidential Library and Museum &  North Carolina State University (USA). Transcription and decoding of Civil War telegrams from the Thomas T. Eckert Papers.

DIY History, University of Iowa Libraries (USA) Transcription of historical documents.

Emigrant City, New York Public Library (USA) Transcription of handwritten mortgage and bond ledgers from the Emigrant Savings Bank records.

Field Notes of Laurence M. Klauber, San Diego Natural History Museum (USA) Transcription of field notes by the celebrated herpetologist.

Notes from Nature Transcription of natural history museum records.

Measuring the ANZACs, Archives New Zealand and Auckland War Memorial Museum (NZ). Transcription of first-hand accounts of NZ soldiers in WW1.

Old Weather (UK) Transcription of Royal Navy ships logs from the early twentieth century.

Scattered Seeds, Heritage Collections, Dunedin Public Libraries (NZ) Transcription of index cards for Dunedin newspapers 1851-1993

Shakespeare’s World, Folger Shakespeare Library (USA) & Oxford University Press (UK). Transcription of handwritten documents by Shakespeare’s contemporaries. Identification of words that have yet to be recorded in the authoritative Oxford English Dictionary.

Smithsonian Digital Volunteers Transcription Center (USA) Transcription of multiple collections.

Transcribe Bentham, University College London (UK) Transcription of historical manuscripts by philosopher and reformer Jeremy Bentham,

What’s on the menu? New York Public Library (USA) Transcription of historical restaurant menus. …

(Full Directory).

Transforming governance: how can technology help reshape democracy?


Research Briefing by Matt Leighninger: “Around the world, people are asking how we can make democracy work in new and better ways. We are frustrated by political systems in which voting is the only legitimate political act, concerned that many republics don’t have the strength or appeal to withstand authoritarian figures, and disillusioned by the inability of many countries to address the fundamental challenges of health, education and economic development.

We can no longer assume that the countries of the global North have ‘advanced’ democracies, and that the nations of the global South simply need to catch up. Citizens of these older democracies have increasingly lost faith in their political institutions; Northerners cherish their human rights and free elections, but are clearly looking for something more. Meanwhile, in the global South, new regimes based on a similar formula of rights and elections have proven fragile and difficult to sustain. And in Brazil, India and other Southern countries, participatory budgeting and other valuable democratic innovations have emerged. The stage is set for a more equitable, global conversation about what we mean by democracy.

How can we adjust our democratic formulas so that they are more sustainable, powerful, fulfilling – and, well, democratic? Some of the parts of this equation may come from the development of online tools and platforms that help people to engage with their governments, with organisations and institutions, and with each other. Often referred to collectively as ‘civic technology’ or ‘civic tech’, these tools can help us map public problems, help citizens generate solutions, gather input for government, coordinate volunteer efforts, and help neighbours remain connected. If we want to create democracies in which citizens have meaningful roles in shaping public decisions and solving public problems, we should be asking a number of questions about civic tech, including:

  • How can online tools best support new forms of democracy?
  • What are the examples of how this has happened?
  • What are some variables to consider in comparing these examples?
  • How can we learn from each other as we move forward?

This background note has been developed to help democratic innovators explore these questions and examine how their work can provide answers….(More)”

Soon Your City Will Know Everything About You


Currently, the biggest users of these sensor arrays are in cities, where city governments use them to collect large amounts of policy-relevant data. In Los Angeles, the crowdsourced traffic and navigation app Waze collects data that helps residents navigate the city’s choked highway networks. In Chicago, an ambitious program makes public data available to startups eager to build apps for residents. The city’s 49th ward has been experimenting with participatory budgeting and online votingto take the pulse of the community on policy issues. Chicago has also been developing the “Array of Things,” a network of sensors that track, among other things, the urban conditions that affect bronchitis.

Edmonton uses the cloud to track the condition of playground equipment. And a growing number of countries have purpose-built smart cities, like South Korea’s high tech utopia city of Songdo, where pervasive sensor networks and ubiquitous computing generate immense amounts of civic data for public services.

The drive for smart cities isn’t restricted to the developed world. Rio de Janeiro coordinates the information flows of 30 different city agencies. In Beijing and Da Nang (Vietnam), mobile phone data is actively tracked in the name of real-time traffic management. Urban sensor networks, in other words, are also developing in countries with few legal protections governing the usage of data.

These services are promising and useful. But you don’t have to look far to see why the Internet of Things has serious privacy implications. Public data is used for “predictive policing” in at least 75 cities across the U.S., including New York City, where critics maintain that using social media or traffic data to help officers evaluate probable cause is a form of digital stop-and-frisk. In Los Angeles, the security firm Palantir scoops up publicly generated data on car movements, merges it with license plate information collected by the city’s traffic cameras, and sells analytics back to the city so that police officers can decide whether or not to search a car. In Chicago, concern is growing about discriminatory profiling because so much information is collected and managed by the police department — an agency with a poor reputation for handling data in consistent and sensitive ways. In 2015, video surveillance of the police shooting Laquan McDonald outside a Burger King was erased by a police employee who ironically did not know his activities were being digitally recorded by cameras inside the restaurant.

Since most national governments have bungled privacy policy, cities — which have a reputation for being better with administrative innovations — will need to fill this gap. A few countries, such as Canada and the U.K., have independent “privacy commissioners” who are responsible for advocating for the public when bureaucracies must decide how to use or give out data. It is pretty clear that cities need such advocates too.

What would Urban Privacy Commissioners do? They would teach the public — and other government staff — about how policy algorithms work. They would evaluate the political context in which city agencies make big data investments. They would help a city negotiate contracts that protect residents’ privacy while providing effective analysis to policy makers and ensuring that open data is consistently serving the public good….(more)”.

Improving patient care by bridging the divide between doctors and data scientists


 at the Conversation: “While wonderful new medical discoveries and innovations are in the news every day, doctors struggle daily with using information and techniques available right now while carefully adopting new concepts and treatments. As a practicing doctor, I deal with uncertainties and unanswered clinical questions all the time….At the moment, a report from the National Academy of Medicine tells us, most doctors base most of their everyday decisions on guidelines from (sometimes biased) expert opinions or small clinical trials. It would be better if they were from multicenter, large, randomized controlled studies, with tightly controlled conditions ensuring the results are as reliable as possible. However, those are expensive and difficult to perform, and even then often exclude a number of important patient groups on the basis of age, disease and sociological factors.

Part of the problem is that health records are traditionally kept on paper, making them hard to analyze en masse. As a result, most of what medical professionals might have learned from experiences was lost – or at least was inaccessible to another doctor meeting with a similar patient.

A digital system would collect and store as much clinical data as possible from as many patients as possible. It could then use information from the past – such as blood pressure, blood sugar levels, heart rate and other measurements of patients’ body functions – to guide future doctors to the best diagnosis and treatment of similar patients.

Industrial giants such as Google, IBM, SAP and Hewlett-Packard have also recognized the potential for this kind of approach, and are now working on how to leverage population data for the precise medical care of individuals.

Collaborating on data and medicine

At the Laboratory of Computational Physiology at the Massachusetts Institute of Technology, we have begun to collect large amounts of detailed patient data in the Medical Information Mart in Intensive Care (MIMIC). It is a database containing information from 60,000 patient admissions to the intensive care units of the Beth Israel Deaconess Medical Center, a Boston teaching hospital affiliated with Harvard Medical School. The data in MIMIC has been meticulously scoured so individual patients cannot be recognized, and is freely shared online with the research community.

But the database itself is not enough. We bring together front-line clinicians (such as nurses, pharmacists and doctors) to identify questions they want to investigate, and data scientists to conduct the appropriate analyses of the MIMIC records. This gives caregivers and patients the best individualized treatment options in the absence of a randomized controlled trial.

Bringing data analysis to the world

At the same time we are working to bring these data-enabled systems to assist with medical decisions to countries with limited health care resources, where research is considered an expensive luxury. Often these countries have few or no medical records – even on paper – to analyze. We can help them collect health data digitally, creating the potential to significantly improve medical care for their populations.

This task is the focus of Sana, a collection of technical, medical and community experts from across the globe that is also based in our group at MIT. Sana has designed a digital health information system specifically for use by health providers and patients in rural and underserved areas.

At its core is an open-source system that uses cellphones – common even in poor and rural nations – to collect, transmit and store all sorts of medical data. It can handle not only basic patient data such as height and weight, but also photos and X-rays, ultrasound videos, and electrical signals from a patient’s brain (EEG) and heart (ECG).

Partnering with universities and health organizations, Sana organizes training sessions (which we call “bootcamps”) and collaborative workshops (called “hackathons”) to connect nurses, doctors and community health workers at the front lines of care with technology experts in or near their communities. In 2015, we held bootcamps and hackathons in Colombia, Uganda, Greece and Mexico. The bootcamps teach students in technical fields like computer science and engineering how to design and develop health apps that can run on cellphones. Immediately following the bootcamp, the medical providers join the group and the hackathon begins…At the end of the day, though, the purpose is not the apps….(More)

How to implement “open innovation” in city government


Victor Mulas at the Worldbank: “City officials are facing increasingly complex challenges. As urbanization rates grow, cities face higher demand for services from a larger and more densely distributed population. On the other hand, rapid changes in the global economy are affecting cities that struggle to adapt to these changes, often resulting in economic depression and population drain.

“Open innovation” is the latest buzz word circulating in forums on how to address the increased volume and complexity of challenges for cities and governments in general.

But, what is open innovation?

Traditionally, public services were designed and implemented by a group of public officials. Open innovation allows us to design these services with multiple actors, including those who stand to benefit from the services, resulting in more targeted and better tailored services, often implemented through partnership with these stakeholders. Open innovation allows cities to be more productive in providing services while addressing increased demand and higher complexity of services to be delivered.

New York, Barcelona, Amsterdam and many other cities have been experimenting with this concept, introducing challenges for entrepreneurs to address common problems or inviting stakeholders to co-create new services.   Open innovation has gone from being a “buzzword” to another tool in the city officials’ toolbox.

However, even cities that embrace open innovation are still struggling to implement it beyond a few specific areas.  This is understandable, as introducing open innovation practically requires a new way of doing things for city governments, which tend to be complex and bureaucratic organizations.

Counting with an engaged mayor is not enough to bring this kind of transformation. Changing the behavior of city officials requires their buy-in, it can’t be done top down

We have been introducing open innovation to cities and governments for the last three years in Chile, Colombia, Egypt and Mozambique. We have addressed specific challenges and iteratively designed and tested a systematic methodology to introduce open innovation in government through both a top-down and a bottom-up approaches. We have tested this methodology in Colombia (Cali, Barranquilla and Manizales) and Chile (metropolitan area of Gran Concepción).   We have identified “internal champions” (i.e., government officials who advocate the new methodology), and external stakeholders organized in an “innovation hub” that provides long-term sustainability and scalability of interventions. We believe that this methodology is easily applicable beyond cities to other government entities at the regional and national levels. …To understand how the methodology practically works, we describe in this report the process and its results in its application in the city area of Gran Concepción, in Chile. For this activity, the urban transport sector was selected and the target of intervention were the regional and municipal government departments in charge or urban transport in the area of Gran Concepción. The activity in Chile resulted in a threefold impact:

  1. It catalyzed the adoption of the bottom-up smart city model following this new methodology throughout Chile; and
  2. It expanded the implementation and mainstreaming of the methodologies developed and tested through this activity in other World Bank projects.

More information about this activity in Chile can be found in the Smart City Gran Concepcion webpage…(More)”

Building Data Responsibility into Humanitarian Action


Stefaan Verhulst at The GovLab: “Next Monday, May 23rd, governments, non-profit organizations and citizen groups will gather in Istanbul at the first World Humanitarian Summit. A range of important issues will be on the agenda, not least of which the refugee crisis confronting the Middle East and Europe. Also on the agenda will be an issue of growing importance and relevance, even if it does not generate front-page headlines: the increasing potential (and use) of data in the humanitarian context.

To explore this topic, a new paper, “Building Data Responsibility into Humanitarian Action,” is being released today, and will be presented tomorrow at the Understanding Risk Forum. This paper is the result of a collaboration between the United Nations Office for the Coordination of Humanitarian Affairs (OCHA), The GovLab (NYU Tandon School of Engineering), the Harvard Humanitarian Initiative, and Leiden UniversityCentre for Innovation. It seeks to identify the potential benefits and risks of using data in the humanitarian context, and begins to outline an initial framework for the responsible use of data in humanitarian settings.

Both anecdotal and more rigorously researched evidence points to the growing use of data to address a variety of humanitarian crises. The paper discusses a number of data risk case studies, including the use of call data to fight Malaria in Africa; satellite imagery to identify security threats on the border between Sudan and South Sudan; and transaction data to increase the efficiency of food delivery in Lebanon. These early examples (along with a few others discussed in the paper) have begun to show the opportunities offered by data and information. More importantly, they also help us better understand the risks, including and especially those posed to privacy and security.

One of the broader goals of the paper is to integrate the specific and the theoretical, in the process building a bridge between the deep, contextual knowledge offered by initiatives like those discussed above and the broader needs of the humanitarian community. To that end, the paper builds on its discussion of case studies to begin establishing a framework for the responsible use of data in humanitarian contexts. It identifies four “Minimum Humanitarian standards for the Responsible use of Data” and four “Characteristics of Humanitarian Organizations that use Data Responsibly.” Together, these eight attributes can serve as a roadmap or blueprint for humanitarian groups seeking to use data. In addition, the paper also provides a four-step practical guide for a data responsibility framework (see also earlier blog)….(More)” Full Paper: Building Data Responsibility into Humanitarian Action

Big data’s ‘streetlight effect’: where and how we look affects what we see


 at the Conversation: “Big data offers us a window on the world. But large and easily available datasets may not show us the world we live in. For instance, epidemiological models of the recent Ebola epidemic in West Africa using big data consistently overestimated the risk of the disease’s spread and underestimated the local initiatives that played a critical role in controlling the outbreak.

Researchers are rightly excited about the possibilities offered by the availability of enormous amounts of computerized data. But there’s reason to stand back for a minute to consider what exactly this treasure trove of information really offers. Ethnographers like me use a cross-cultural approach when we collect our data because family, marriage and household mean different things in different contexts. This approach informs how I think about big data.

We’ve all heard the joke about the drunk who is asked why he is searching for his lost wallet under the streetlight, rather than where he thinks he dropped it. “Because the light is better here,” he said.

This “streetlight effect” is the tendency of researchers to study what is easy to study. I use this story in my course on Research Design and Ethnographic Methods to explain why so much research on disparities in educational outcomes is done in classrooms and not in students’ homes. Children are much easier to study at school than in their homes, even though many studies show that knowing what happens outside the classroom is important. Nevertheless, schools will continue to be the focus of most research because they generate big data and homes don’t.

The streetlight effect is one factor that prevents big data studies from being useful in the real world – especially studies analyzing easily available user-generated data from the Internet. Researchers assume that this data offers a window into reality. It doesn’t necessarily.

Looking at WEIRDOs

Based on the number of tweets following Hurricane Sandy, for example, it might seem as if the storm hit Manhattan the hardest, not the New Jersey shore. Another example: the since-retired Google Flu Trends, which in 2013 tracked online searches relating to flu symptoms to predict doctor visits, but gave estimates twice as high as reports from the Centers for Disease Control and Prevention. Without checking facts on the ground, researchers may fool themselves into thinking that their big data models accurately represent the world they aim to study.

The problem is similar to the “WEIRD” issue in many research studies. Harvard professor Joseph Henrich and colleagues have shown that findings based on research conducted with undergraduates at American universities – whom they describe as “some of the most psychologically unusual people on Earth” – apply only to that population and cannot be used to make any claims about other human populations, including other Americans. Unlike the typical research subject in psychology studies, they argue, most people in the world are not from Western, Educated, Industrialized, Rich and Democratic societies, i.e., WEIRD.

Twitter users are also atypical compared with the rest of humanity, giving rise to what our postdoctoral researcher Sarah Laborde has dubbed the “WEIRDO” problem of data analytics: most people are not Western, Educated, Industrialized, Rich, Democratic and Online.

Context is critical

Understanding the differences between the vast majority of humanity and that small subset of people whose activities are captured in big data sets is critical to correct analysis of the data. Considering the context and meaning of data – not just the data itself – is a key feature of ethnographic research, argues Michael Agar, who has written extensively about how ethnographers come to understand the world….(https://theconversation.com/big-datas-streetlight-effect-where-and-how-we-look-affects-what-we-see-58122More)”