‘Big data’ was supposed to fix education. It didn’t. It’s time for ‘small data’


Pasi Sahlberg and Jonathan Hasak in the Washington Post: “One thing that distinguishes schools in the United States from schools around the world is how data walls, which typically reflect standardized test results, decorate hallways and teacher lounges. Green, yellow, and red colors indicate levels of performance of students and classrooms. For serious reformers, this is the type of transparency that reveals more data about schools and is seen as part of the solution to how to conduct effective school improvement. These data sets, however, often don’t spark insight about teaching and learning in classrooms; they are based on analytics and statistics, not on emotions and relationships that drive learning in schools. They also report outputs and outcomes, not the impacts of learning on the lives and minds of learners….

If you are a leader of any modern education system, you probably care a lot about collecting, analyzing, storing, and communicating massive amounts of information about your schools, teachers, and students based on these data sets. This information is “big data,” a term that first appeared around 2000, which refers to data sets that are so large and complex that processing them by conventional data processing applications isn’t possible. Two decades ago, the type of data education management systems processed were input factors of education system, such as student enrollments, teacher characteristics, or education expenditures handled by education department’s statistical officer. Today, however, big data covers a range of indicators about teaching and learning processes, and increasingly reports on student achievement trends over time.

With the outpouring of data, international organizations continue to build regional and global data banks. Whether it’s the United Nations, the World Bank, the European Commission, or the Organization for Economic Cooperation and Development, today’s international reformers are collecting and handling more data about human development than before. Beyond government agencies, there are global education and consulting enterprises like Pearson and McKinsey that see business opportunities in big data markets.

Among the best known today is the OECD’s Program for International Student Assessment (PISA), which measures reading, mathematical, and scientific literacy of 15-year-olds around the world. OECD now also administers an Education GPS, or a global positioning system, that aims to tell policymakers where their education systems place in a global grid and how to move to desired destinations. OECD has clearly become a world leader in the big data movement in education.

Despite all this new information and benefits that come with it, there are clear handicaps in how big data has been used in education reforms. In fact, pundits and policymakers often forget that Big data, at best, only reveals correlations between variables in education, not causality. As any introduction to statistics course will tell you, correlation does not imply causation….
We believe that it is becoming evident that big data alone won’t be able to fix education systems. Decision-makers need to gain a better understanding of what good teaching is and how it leads to better learning in schools. This is where information about details, relationships and narratives in schools become important. These are what Martin Lindstrom calls “small data”: small clues that uncover huge trends. In education, these small clues are often hidden in the invisible fabric of schools. Understanding this fabric must become a priority for improving education.

To be sure, there is not one right way to gather small data in education. Perhaps the most important next step is to realize the limitations of current big data-driven policies and practices. Too strong reliance on externally collected data may be misleading in policy-making. This is an example of what small data look like in practice:

  • It reduces census-based national student assessments to the necessary minimum and transfer saved resources to enhance the quality of formative assessments in schools and teacher education on other alternative assessment methods. Evidence shows that formative and other school-based assessments are much more likely to improve quality of education than conventional standardized tests.
  • It strengthens collective autonomy of schools by giving teachers more independence from bureaucracy and investing in teamwork in schools. This would enhance social capital that is proved to be critical aspects of building trust within education and enhancing student learning.
  • It empowers students by involving them in assessing and reflecting their own learning and then incorporating that information into collective human judgment about teaching and learning (supported by national big data). Because there are different ways students can be smart in schools, no one way of measuring student achievement will reveal success. Students’ voices about their own growth may be those tiny clues that can uncover important trends of improving learning.

Edwards Deming once said that “without data you are another person with an opinion.” But Deming couldn’t have imagined the size and speed of data systems we have today….(More)”

Critics allege big data can be discriminatory, but is it really bias?


Pradip Sigdyal at CNBC: “…The often cited case of big data discrimination points to a research conducted few years ago by Latanya Sweeny, who heads the Data Privacy Lab at Harvard University.

The case involves Google ad results when searching for certain kinds of names on the internet. In her research, Sweeney found that distinct sounding names often associated with blacks showed up with a disproportionately higher number of arrest record ads compared to white sounding names by roughly 18 percent of the time. Google has since fixed the issue, although they never publicly stated what they did to correct the problem.

The proliferation of big data in the last few years has seen other allegations of improper use and bias. These allegations run the gamut, from online price discrimination and consequences of geographic targeting to the controversial use of crime predicting technology by law enforcement, and lack of sufficient representative[data] sampleused in some public works decisions.

The benefits of big data need to be balanced with the risks associated with applying modern technologies to address societal issues. Yet data advocates believe that democratization of data has in essence givenpower to the people to affect change by transferring ‘tribal knowledge’ from experts to data-savvy practitioners.

Big data is here to stay

According to some advocates, the problem is not so much that ‘big data discriminates’, but that failures by data professionals risk misinterpreting the findings at the heart of data mining and statistical learning. They add that the benefits far outweigh the concerns.

“In my academic research and industry consulting, I have seen tremendous benefits accruing to firms, organizations and consumers alike from the use of data-driven decision-making, data science, and business analytics,” Anindya Ghose, the director of Center for Business Analytics at New York University’s Stern School of Business, said.

“To be perfectly honest, I do not at all understand these big-data cynics who engage in fear mongering about the implications of data analytics,” Ghose said.

“Here is my message to the cynics and those who keep cautioning us: ‘Deal with it, big data analytics is here to stay forever’.”…(More)”

OSoMe: The IUNI observatory on social media


Clayton A Davis et al at Peer J. PrePrint:  “The study of social phenomena is becoming increasingly reliant on big data from online social networks. Broad access to social media data, however, requires software development skills that not all researchers possess. Here we present the IUNI Observatory on Social Media, an open analytics platform designed to facilitate computational social science. The system leverages a historical, ongoing collection of over 70 billion public messages from Twitter. We illustrate a number of interactive open-source tools to retrieve, visualize, and analyze derived data from this collection. The Observatory, now available at osome.iuni.iu.edu, is the result of a large, six-year collaborative effort coordinated by the Indiana University Network Science Institute.”…(More)”

A Framework for Understanding Data Risk


Sarah Telford and Stefaan G. Verhulst at Understanding Risk Forum: “….In creating the policy, OCHA partnered with the NYU Governance Lab (GovLab) and Leiden University to understand the policy and privacy landscape, best practices of partner organizations, and how to assess the data it manages in terms of potential harm to people.

We seek to share our findings with the UR community to get feedback and start a conversation around the risk to using certain types of data in humanitarian and development efforts and when understanding risk.

What is High-Risk Data?

High-risk data is generally understood as data that includes attributes about individuals. This is commonly referred to as PII or personally identifiable information. Data can also create risk when it identifies communities or demographics within a group and ties them to a place (i.e., women of a certain age group in a specific location). The risk comes when this type of data is collected and shared without proper authorization from the individual or the organization acting as the data steward; or when the data is being used for purposes other than what was initially stated during collection.

The potential harms of inappropriately collecting, storing or sharing personal data can affect individuals and communities that may feel exploited or vulnerable as the result of how data is used. This became apparent during the Ebola outbreak of 2014, when a number of data projects were implemented without appropriate risk management measures. One notable example was the collection and use of aggregated call data records (CDRs) to monitor the spread of Ebola, which not only had limited success in controlling the virus, but also compromised the personal information of those in Ebola-affected countries. (See Ebola: A Big Data Disaster).

A Data-Risk Framework

Regardless of an organization’s data requirements, it is useful to think through the potential risks and harms for its collection, storage and use. Together with the Harvard Humanitarian Initiative, we have set up a four-step data risk process that includes doing an assessment and inventory, understanding risks and harms, and taking measures to counter them.

  1. Assessment – The first step is to understand the context within which the data is being generated and shared. The key questions to ask include: What is the anticipated benefit of using the data? Who has access to the data? What constitutes the actionable information for a potential perpetrator? What could set off the threat to the data being used inappropriately?
  1. Data Inventory – The second step is to take inventory of the data and how it is being stored. Key questions include: Where is the data – is it stored locally or hosted by a third party? Where could the data be housed later? Who might gain access to the data in the future? How will we know – is data access being monitored?
  1. Risks and Harms – The next step is to identify potential ways in which risk might materialize. Thinking through various risk-producing scenarios will help prepare staff for incidents. Examples of risks include: your organization’s data being correlated with other data sources to expose individuals; your organization’s raw data being publicly released; and/or your organization’s data system being maliciously breached.
  1. Counter-Measures – The next step is to determine what measures would prevent risk from materializing. Methods and tools include developing data handling policies, implementing access controls to the data, and training staff on how to use data responsibly….(More)

Big Risks, Big Opportunities: the Intersection of Big Data and Civil Rights


Latest White House report on Big Data charts pathways for fairness and opportunity but also cautions against re-encoding bias and discrimination into algorithmic systems: ” Advertisements tailored to reflect previous purchasing decisions; targeted job postings based on your degree and social networks; reams of data informing predictions around college admissions and financial aid. Need a loan? There’s an app for that.

As technology advances and our economic, social, and civic lives become increasingly digital, we are faced with ethical questions of great consequence. Big data and associated technologies create enormous new opportunities to revisit assumptions and instead make data-driven decisions. Properly harnessed, big data can be a tool for overcoming longstanding bias and rooting out discrimination.

The era of big data is also full of risk. The algorithmic systems that turn data into information are not infallible—they rely on the imperfect inputs, logic, probability, and people who design them. Predictors of success can become barriers to entry; careful marketing can be rooted in stereotype. Without deliberate care, these innovations can easily hardwire discrimination, reinforce bias, and mask opportunity.

Because technological innovation presents both great opportunity and great risk, the White House has released several reports on “big data” intended to prompt conversation and advance these important issues. The topics of previous reports on data analytics included privacy, prices in the marketplace, and consumer protection laws. Today, we are announcing the latest report on big data, one centered on algorithmic systems, opportunity, and civil rights.

The first big data report warned of “the potential of encoding discrimination in automated decisions”—that is, discrimination may “be the inadvertent outcome of the way big data technologies are structured and used.” A commitment to understanding these risks and harnessing technology for good prompted us to specifically examine the intersection between big data and civil rights.

Using case studies on credit lending, employment, higher education, and criminal justice, the report we are releasing today illustrates how big data techniques can be used to detect bias and prevent discrimination. It also demonstrates the risks involved, particularly how technologies can deliberately or inadvertently perpetuate, exacerbate, or mask discrimination.

The purpose of the report is not to offer remedies to the issues it raises, but rather to identify these issues and prompt conversation, research—and action—among technologists, academics, policy makers, and citizens, alike.

The report includes a number of recommendations for advancing work in this nascent field of data and ethics. These include investing in research, broadening and diversifying technical leadership, cross-training, and expanded literacy on data discrimination, bolstering accountability, and creating standards for use within both the government and the private sector. It also calls on computer and data science programs and professionals to promote fairness and opportunity as part of an overall commitment to the responsible and ethical use of data.

Big data is here to stay; the question is how it will be used: to advance civil rights and opportunity, or to undermine them….(More)”

In the future, Big Data will make actual voting obsolete


Robert Epstein at Quartz: “Because I conduct research on how the Internet affects elections, journalists have lately been asking me about the primaries. Here are the two most common questions I’ve been getting:

  • Do Google’s search rankings affect how people vote?
  • How well does Google Trends predict the winner of each primary?

My answer to the first question is: Probably, but no one knows for sure. From research I have been conducting in recent years with Ronald E. Robertson, my associate at the American Institute for Behavioral Research and Technology, on the Search Engine Manipulation Effect (SEME, pronounced “seem”), we know that when higher search results make one candidate look better than another, an enormous number of votes will be driven toward the higher-ranked candidate—up to 80% of undecided voters in some demographic groups. This is partly because we have all learned to trust high-ranked search results, but it is mainly because we are lazy; search engine users generally click on just the top one or two items.

Because no one actually tracks search rankings, however—they are ephemeral and personalized, after all, which makes them virtually impossible to track—and because no whistleblowers have yet come forward from any of the search engine companies,

We cannot know for sure whether search rankings are consistently favoring one candidate or another.This means we also cannot know for sure how search rankings are affecting elections. We know the power they have to do so, but that’s it.
As for the question about Google Trends, for a while I was giving a mindless, common-sense answer: Well, I said, Google Trends tells you about search activity, and if lots more people are searching for “Donald Trump” than for “Ted Cruz” just before a primary, then more people will probably vote for Trump.

When you run the numbers, search activity seems to be a pretty good predictor of voting. On primary day in New Hampshire this year, search traffic on Google Trends was highest for Trump, followed by John Kasich, then Cruz—and so went the vote. But careful studies of the predictive power of search activity have actually gotten mixed results. A 2011 study by researchers at Wellesley College in Massachusetts, for example, found that Google Trends was a poor predictor of the outcomes of the 2008 and 2010 elections.

So much for Trends. But then I got to thinking: Why are we struggling so hard to figure out how to use Trends or tweets or shares to predict elections when Google actually knows exactly how we are going to vote. Impossible, you say? Think again….

This leaves us with two questions, one small and practical and the other big and weird.

The small, practical question is: How is Google using those numbers? Might they be sharing them with their preferred presidential candidate, for example? That is not unlawful, after all, and Google executives have taken a hands-on role in past presidential campaigns. The Wall Street Journal reported, for example, that Eric Schmidt, head of Google at that time, was personally overseeing Barack Obama’s programming team at his campaign headquarters the night before the 2012 election.
And the big, weird question is: Why are we even bothering to vote?
 Voting is such a hassle—the parking, the lines, the ID checks. Maybe we should all stay home and just let Google announce the winners….(More)”

Data innovation: where to start? With the road less taken


Giulio Quaggiotto at Nesta: “Over the past decade we’ve seen an explosion in the amount of data we create, with more being captured about our lives than ever before. As an industry, the public sector creates an enormous amount of information – from census data to tax data to health data. When it comes to use of the data however, despite many initiatives trying to promote open and big data for public policy as well as evidence-based policymaking, we feel there is still a long way to go.

Why is that? Data initiatives are often created under the assumption that if data is available, people (whether citizens or governments) will use it. But this hasn’t necessarily proven to be the case, and this approach neglects analysis of power and an understanding of the political dynamics at play around data (particularly when data is seen as an output rather than input).

Many data activities are also informed by the ‘extractive industry’ paradigm: citizens and frontline workers are seen as passive ‘data producers’ who hand over their information for it to be analysed and mined behind closed doors by ‘the experts’.

Given budget constraints facing many local and central governments, even well intentioned initiatives often take an incremental, passive transparency approach (i.e. let’s open the data first then see what happens), or they adopt a ‘supply/demand’ metaphor to data provision and usage…..

As a response to these issues, this blog series will explore the hypothesis that putting the question of citizen and government agency – rather than openness, volume or availability – at the centre of data initiatives has the potential to unleash greater, potentially more disruptive innovation and to focus efforts (ultimately leading to cost savings).

Our argument will be that data innovation initiatives should be informed by the principles that:

  • People closer to the problem are the best positioned to provide additional context to the data and potentially act on solutions (hence the importance of “thick data“).

  • Citizens are active agents rather than passive providers of ‘digital traces’.

  • Governments are both users and providers of data.

  • We should ask at every step of the way how can we empower communities and frontline workers to take better decisions over time, and how can we use data to enhance the decision making of every actor in the system (from government to the private sector, from private citizens to social enterprises) in their role of changing things for the better… (More)

 

Ethical Reasoning in Big Data


Book edited by Collmann, Jeff, and Matei, Sorin Adam: “This book springs from a multidisciplinary, multi-organizational, and multi-sector conversation about the privacy and ethical implications of research in human affairs using big data. The need to cultivate and enlist the public’s trust in the abilities of particular scientists and scientific institutions constitutes one of this book’s major themes. The advent of the Internet, the mass digitization of research information, and social media brought about, among many other things, the ability to harvest – sometimes implicitly – a wealth of human genomic, biological, behavioral, economic, political, and social data for the purposes of scientific research as well as commerce, government affairs, and social interaction. What type of ethical dilemmas did such changes generate? How should scientists collect, manipulate, and disseminate this information? The effects of this revolution and its ethical implications are wide-ranging.

This book includes the opinions of myriad investigators, practitioners, and stakeholders in big data on human beings who also routinely reflect on the privacy and ethical issues of this phenomenon. Dedicated to the practice of ethical reasoning and reflection in action, the book offers a range of observations, lessons learned, reasoning tools, and suggestions for institutional practice to promote responsible big data research on human affairs. It caters to a broad audience of educators, researchers, and practitioners. Educators can use the volume in courses related to big data handling and processing. Researchers can use it for designing new methods of collecting, processing, and disseminating big data, whether in raw form or as analysis results. Lastly, practitioners can use it to steer future tools or procedures for handling big data. As this topic represents an area of great interest that still remains largely undeveloped, this book is sure to attract significant interest by filling an obvious gap in currently available literature. …(More)”

Addressing the ‘doctrine gap’: professionalising the use of Information Communication Technologies in humanitarian action


Nathaniel A. Raymond and Casey S. Harrity at HPN: “This generation of humanitarian actors will be defined by the actions they take in response to the challenges and opportunities of the digital revolution. At this critical moment in the history of humanitarian action, success depends on humanitarians recognising that the use of information communication technologies (ICTs) must become a core competency for humanitarian action. Treated in the past as a boutique sub-area of humanitarian practice, the central role that they now play has made the collection, analysis and dissemination of data derived from ICTs and other sources a basic skill required of humanitarians in the twenty-first century. ICT use must now be seen as an essential competence with critical implications for the efficiency and effectiveness of humanitarian response.

Practice in search of a doctrine

ICT use for humanitarian response runs the gamut from satellite imagery to drone deployment; to tablet and smartphone use; to crowd mapping and aggregation of big data. Humanitarian actors applying these technologies include front-line responders in NGOs and the UN but also, increasingly, volunteers and the private sector. The rapid diversification of available technologies as well as the increase in actors utilising them for humanitarian purposes means that the use of these technologies has far outpaced the ethical and technical guidance available to practitioners. Technology adoption by humanitarian actors prior to the creation of standards for how and how not to apply a specific tool has created a largely undiscussed and unaddressed ‘doctrine gap’.

Examples of this gap are, unfortunately, many. One such is the mass collection of personally identifiable cell phone data by humanitarian actors as part of phone surveys and cash transfer programmes. Although initial best practice and lessons learned have been developed for this method of data collection, no common inter-agency standards exist, nor are there comprehensive ethical frameworks for what data should be retained and for how long, and what data should be anonymised or not collected in the first place…(More)”

Open Data Supply: Enriching the usability of information


Report by Phoensight: “With the emergence of increasing computational power, high cloud storage capacity and big data comes an eager anticipation of one of the biggest IT transformations of our society today.

Open data has an instrumental role to play in our digital revolution by creating unprecedented opportunities for governments and businesses to leverage off previously unavailable information to strengthen their analytics and decision making for new client experiences. Whilst virtually every business recognises the value of data and the importance of the analytics built on it, the ability to realise the potential for maximising revenue and cost savings is not straightforward. The discovery of valuable insights often involves the acquisition of new data and an understanding of it. As we move towards an increasing supply of open data, technological and other entrepreneurs will look to better utilise government information for improved productivity.

This report uses a data-centric approach to examine the usability of information by considering ways in which open data could better facilitate data-driven innovations and further boost our economy. It assesses the state of open data today and suggests ways in which data providers could supply open data to optimise its use. A number of useful measures of information usability such as accessibility, quantity, quality and openness are presented which together contribute to the Open Data Usability Index (ODUI). For the first time, a comprehensive assessment of open data usability has been developed and is expected to be a critical step in taking the open data agenda to the next level.

With over two million government datasets assessed against the open data usability framework and models developed to link entire country’s datasets to key industry sectors, never before has such an extensive analysis been undertaken. Government open data across Australia, Canada, Singapore, the United Kingdom and the United States reveal that most countries have the capacity for improvements in their information usability. It was found that for 2015 the United Kingdom led the way followed by Canada, Singapore, the United States and Australia. The global potential of government open data is expected to reach 20 exabytes by 2020, provided governments are able to release as much data as possible within legislative constraints….(More)”