Data for Policy: Data Science and Big Data in the Public Sector


Innar Liiv at OXPOL: “How can big data and data science help policy-making? This question has recently gained increasing attention. Both the European Commission and the White House have endorsed the use of data for evidence-based policy making.

Still, a gap remains between theory and practice. In this blog post, I make a number of recommendations for systematic development paths.

RESEARCH TRENDS SHAPING DATA FOR POLICY

‘Data for policy’ as an academic field is still in its infancy. A typology of the field’s foci and research areas are summarised in the figure below.

 

diagram1

 

Besides the ‘data for policy’ community, there are two important research trends shaping the field: 1) computational social science; and 2) the emergence of politicised social bots.

Computational social science (CSS) is an new interdisciplinary research trend in social science, which tries to transform advances in big data and data science into research methodologies for understanding, explaining and predicting underlying social phenomena.

Social science has a long tradition of using computational and agent-based modelling approaches (e.g.Schelling’s Model of Segregation), but the new challenge is to feed real-life, and sometimes even real-time information into those systems to get gain rapid insights into the validity of research hypotheses.

For example, one could use mobile phone call records to assess the acculturation processes of different communities. Such a project would involve translating different acculturation theories into computational models, researching the ethical and legal issues inherent in using mobile phone data and developing a vision for generating policy recommendations and new research hypothesis from the analysis.

Politicised social bots are also beginning to make their mark. In 2011, DARPA solicited research proposals dealing with social media in strategic communication. The term ‘political bot’ was not used, but the expected results left no doubt about the goals…

The next wave of e-government innovation will be about analytics and predictive models.  Taking advantage of their potential for social impact will require a solid foundation of e-government infrastructure.

The most important questions going forward are as follows:

  • What are the relevant new data sources?
  • How can we use them?
  • What should we do with the information? Who cares? Which political decisions need faster information from novel sources? Do we need faster information? Does it come with unanticipated risks?

These questions barely scratch the surface, because the complex interplay between general advancements of computational social science and hovering satellite topics like political bots will have an enormous impact on research and using data for policy. But, it’s an important start….(More)”

Questioning Big Data: Crowdsourcing crisis data towards an inclusive humanitarian response


Femke Mulder, Julie Ferguson, Peter Groenewegen, Kees Boersma, and Jeroen Wolbers in Big Data and Society: “The aim of this paper is to critically explore whether crowdsourced Big Data enables an inclusive humanitarian response at times of crisis. We argue that all data, including Big Data, are socially constructed artefacts that reflect the contexts and processes of their creation. To support our argument, we qualitatively analysed the process of ‘Big Data making’ that occurred by way of crowdsourcing through open data platforms, in the context of two specific humanitarian crises, namely the 2010 earthquake in Haiti and the 2015 earthquake in Nepal. We show that the process of creating Big Data from local and global sources of knowledge entails the transformation of information as it moves from one distinct group of contributors to the next. The implication of this transformation is that locally based, affected people and often the original ‘crowd’ are excluded from the information flow, and from the interpretation process of crowdsourced crisis knowledge, as used by formal responding organizations, and are marginalized in their ability to benefit from Big Data in support of their own means. Our paper contributes a critical perspective to the debate on participatory Big Data, by explaining the process of in and exclusion during data making, towards more responsive humanitarian relief….(More)”.

How Big Data Analytics is Changing Legal Ethics


Renee Knake at Bloomberg Law: “Big data analytics are changing how lawyers find clients, conduct legal research and discovery, draft contracts and court papers, manage billing and performance, predict the outcome of a matter, select juries, and more. Ninety percent of corporate legal departments, law firms, and government lawyers note that data analytics are applied in their organizations, albeit in limited ways, according to a 2015 survey. The Legal Services Corporation, the largest funder of civil legal aid for low-income individuals in the United States, recommended in 2012 that all states collect and assess data on case progress/outcomes to improve the delivery of legal services. Lawyers across all sectors of the market increasingly recognize how big data tools can enhance their work.

A growing literature advocates for businesses and governmental bodies to adopt data ethics policies, and many have done so. It is not uncommon to find data-use policies prominently displayed on company or government websites, or required a part of a click-through consent before gaining access to a mobile app or webpage. Data ethics guidelines can help avoid controversies, especially when analytics are used in potentially manipulative or exploitive ways. Consider, for example, Target’s data analytics that uncovered a teen’s pregnancy before her father did, or Orbitz’s data analytics offered pricier hotels to Mac users. These are just two of numerous examples in recent years where companies faced criticism for how they used data analytics.

While some law firms and legal services organizations follow data-use policies or codes of conduct, many do not. Perhaps this is because the legal profession was not transformed as early or rapidly as other industries, or because until now, big data in legal was largely limited to e-discovery, where the data use is confined to the litigation and is subject to judicial oversight. Another reason may be that lawyers believe their rules of professional conduct provide sufficient guidance and protection. Unlike other industries, lawyers are governed by a special code of ethical obligations to clients, the justice system, and the public. In most states, this code is based in part upon the American Bar Association (ABA) Model Rules of Professional Conduct, though rules often vary from jurisdiction to jurisdiction. Several of the Model Rules are relevant to big data use. That said, the Model Rules are insufficient for addressing a number of fundamental ethical concerns.

At the moment, legal ethics for big data analytics is at best an incomplete mix of professional conduct rules and informal policies adopted by some, but not all law practices. Given the increasing prevalence of data analytics in legal services, lawyers and law students should be familiar not only with the relevant professional conduct rules, but also the ethical questions left unanswered. Listed below is a brief summary of both, followed by a proposed legal ethics agenda for data analytics. …

Questions Unanswered by Lawyer Ethics Rules 

Access/Ownership. Who owns the original data — the individual source or the holder of the pooled information? Who owns the insights drawn from its analysis? Who should receive access to the data compilation and the results?

Anonymity/Identity. Should all personally identifiable or sensitive information be removed from the data? What protections are necessary to respect individual autonomy? How should individuals be able to control and shape their electronic identity?

Consent. Should individuals affirmatively consent to use of their personal data? Or is it sufficient to provide notice, perhaps with an opt-out provision?

Privacy/Security. Should privacy be protected beyond the professional obligation of client confidentiality? How should data be secured? The ABA called upon private and public sector lawyers to implement cyber-security policies, including data use, in a 2012resolution and produced a cyber-security handbook in 2013.

Process. How involved should lawyers be in the process of data collection and analysis? In the context of e-discovery, for example, a lawyer is expected to understand how documents are collected, produced, and preserved, or to work with a specialist. Should a similar level of knowledge be required for all forms of data analytics use?

Purpose. Why was the data first collected from individuals? What is the purpose for the current use? Is there a significant divergence between the original and secondary purposes? If so, is it necessary for the individuals to consent to the secondary purpose? How will unintended consequences be addressed?

Source. What is the source of the data? Did the lawyer collect it directly from clients, or is the lawyer relying upon a third-party source? Client-based data is, of course, subject to the lawyer’s professional conduct rules. Data from any source should be trustworthy, reasonable, timely, complete, and verifiable….(More)”

Why Zika, Malaria and Ebola should fear analytics


Frédéric Pivetta at Real Impact Analytics:Big data is a hot business topic. It turns out to be an equally hot topic for the non profit sector now that we know the vital role analytics can play in addressing public health issues and reaching sustainable development goals.

Big players like IBM just announced they will help fight Zika by analyzing social media, transportation and weather data, among other indicators. Telecom data takes it further by helping to predict the spread of disease, identifying isolated and fragile communities and prioritizing the actions of aid workers.

The power of telecom data

Human mobility contributes significantly to epidemic transmission into new regions. However, there are gaps in understanding human mobility due to the limited and often outdated data available from travel records. In some countries, these are collected by health officials in the hospitals or in occasional surveys.

Telecom data, constantly updated and covering a large portion of the population, is rich in terms of mobility insights. But there are other benefits:

  • it’s recorded automatically (in the Call Detail Records, or CDRs), so that we avoid data collection and response bias.
  • it contains localization and time information, which is great for understanding human mobility.
  • it contains info on connectivity between people, which helps understanding social networks.
  • it contains info on phone spending, which allows tracking of socio-economic indicators.

Aggregated and anonymized, mobile telecom data fills the public data gap without questioning privacy issues. Mixing it with other public data sources results in a very precise and reliable view on human mobility patterns, which is key for preventing epidemic spreads.

Using telecom data to map epidemic risk flows

So how does it work? As in any other big data application, the challenge is to build the right predictive model, allowing decision-makers to take the most appropriate actions. In the case of epidemic transmission, the methodology typically includes five steps :

  • Identify mobility patterns relevant for each particular disease. For example, short-term trips for fast-spreading diseases like Ebola. Or overnight trips for diseases like Malaria, as it spreads by mosquitoes that are active only at night. Such patterns can be deduced from the CDRs: we can actually find the home location of each user by looking at the most active night tower, and then tracking calls to identify short or long-term trips. Aggregating data per origin-destination pairs is useful as we look at intercity or interregional transmission flows. And it protects the privacy of individuals, as no one can be singled out from the aggregated data.
  • Get data on epidemic incidence, typically from local organisations like national healthcare systems or, in case of emergency, from NGOs or dedicated emergency teams. This data should be aggregated on the same level of granularity than CDRs.
  • Knowing how many travelers go from one place to another, for how long, and the disease incidence at origin and destination, build an epidemiological model that can account for the way and speed of transmission of the particular disease.
  • With an import/export scoring model, map epidemic risk flows and flag areas that are at risk of becoming the new hotspots because of human travel.
  • On that base, prioritize and monitor public health measures, focusing on restraining mobility to and from hotspots. Mapping risk also allows launching prevention campaigns at the right places and setting up the necessary infrastructure on time. Eventually, the tool reduces public health risks and helps stem the epidemic.

That kind of application works in a variety of epidemiological contexts, including Zika, Ebola, Malaria, Influenza or Tuberculosis. No doubt the global boom of mobile data will proof extraordinarily helpful in fighting these fierce enemies….(More)”

The Ethics of Biomedical Big Data


Book edited by Mittelstadt, Brent Daniel, and Floridi, Luciano: “This book presents cutting edge research on the new ethical challenges posed by biomedical Big Data technologies and practices. ‘Biomedical Big Data’ refers to the analysis of aggregated, very large datasets to improve medical knowledge and clinical care. The book describes the ethical problems posed by aggregation of biomedical datasets and re-use/re-purposing of data, in areas such as privacy, consent, professionalism, power relationships, and ethical governance of Big Data platforms. Approaches and methods are discussed that can be used to address these problems to achieve the appropriate balance between the social goods of biomedical Big Data research and the safety and privacy of individuals. Seventeen original contributions analyse the ethical, social and related policy implications of the analysis and curation of biomedical Big Data, written by leading experts in the areas of biomedical research, medical and technology ethics, privacy, governance and data protection. The book advances our understanding of the ethical conundrums posed by biomedical Big Data, and shows how practitioners and policy-makers can address these issues going forward….(More)”

Through the looking glass: Harnessing big data to respond to violent extremism


Michele Piercey, Carolyn Forbes, and Hasan Davulcu at Devex:”People think and say all sorts of things that they would never actually do. One of the biggest challenges in countering violent extremism is not only figuring out which people hold radical views, but who is most likely to join and act on behalf of violent extremist organizations. Determining who is likely to become violent is key to designing and evaluating more targeted interventions, but it has proven to be extremely difficult.

There are few recognized tools for assessing perceptions and beliefs, such as whether community sentiment about violent extremist organizations is more or less favorable, or which narratives and counternarratives resonate with vulnerable populations.

Program designers and monitoring and evaluation staff often rely on perception surveying to assess attitudinal changes that CVE programs try to achieve, but there are limitations to this method. Security and logistical challenges to collecting perception data in a conflict-affected community can make it difficult to get a representative sample, while ensuring the safety of enumerators and respondents. And given the sensitivity of the subject matter, respondents may be reluctant to express their actual beliefs to an outsider (that is, social desirability bias can affect data reliability).

The rise of smartphone technology and social media uptake among the burgeoning youth populations of many conflict-affected countries presents a new opportunity to understand what people believe from a safer distance, lessening the associated risks and data defects. Seeing an opportunity in the growing mass of online public data, the marketing industry has pioneered tools to “scrape” and aggregate the data to help companies paint a clearer picture of consumer behavior and perceptions of brands and products.

These developments present a critical question for CVE programs: Could similar tools be developed that would analyze online public data to identify who is being influenced by which extremist narratives and influences, learn which messages go viral, and distinguish groups and individuals who simply hold radical views from those who support or carry out violence?

Using data to track radicalization

Seeking to answer this question, researchers at Arizona State University’s Center for the Study of Religion and Conflict, Cornell University’s Social Dynamics Laboratory, and Carnegie Mellon’s Center for Computational Analysis of Social and Organizational systems have been innovating a wide variety of data analytics tools. ASU’s LookingGlass tool, for example, maps networks of perception, belief, and influence online. ASU and Chemonics International are now piloting the tool on a CVE program in Libya.

Drawn from the humanities and social and computational sciences, LookingGlass retrieves, categorizes, and analyzes vast amounts of data from across the internet to map the spread of extremist and counter-extremist influence online. The tool displays what people think about their political situation, governments and extremist groups, and tracks changes in these perceptions over time and in response to events. It also lets users visualize how groups emerge, interact, coalesce, and fragment in relation to emerging issues and events and evaluates “information cascades” to assess what causes extremist messages to go viral on social media and what causes them to die out.

By assessing the relative influence and expressed beliefs of diverse groups over time and in critical locations, LookingGlass represents an advanced capability for providing real-time contextual information about the ideological drivers of violent and counter-violent extremist movements online. Click here to view a larger version.

For CVE planners, LookingGlass can map social movements in relation to specific countries and regions. Indonesia, for example, has been the site of numerous violent movements and events. A relatively young democracy, the country’s complex political environment encompasses numerous groups seeking radical change across a wide spectrum of social and political issues….(More)”

 

Make Algorithms Accountable


Julia Angwin in The New York Times: “Algorithms are ubiquitous in our lives. They map out the best route to our destination and help us find new music based on what we listen to now. But they are also being employed to inform fundamental decisions about our lives.

Companies use them to sort through stacks of résumés from job seekers. Credit agencies use them to determine our credit scores. And the criminal justice system is increasingly using algorithms to predict a defendant’s future criminality.
Those computer-generated criminal “risk scores” were at the center of a recent Wisconsin Supreme Court decision that set the first significant limits on the use of risk algorithms in sentencing.
The court ruled that while judges could use these risk scores, the scores could not be a “determinative” factor in whether a defendant was jailed or placed on probation. And, most important, the court stipulated that a pre sentence report submitted to the judge must include a warning about the limits of the algorithm’s accuracy.

This warning requirement is an important milestone in the debate over how our data-driven society should hold decision-making software accountable.But advocates for big data due process argue that much more must be done to assure the appropriateness and accuracy of algorithm results.

An algorithm is a procedure or set of instructions often used by a computer to solve a problem. Many algorithms are secret. In Wisconsin, for instance,the risk-score formula was developed by a private company and has never been publicly disclosed because it is considered proprietary. This secrecy has made it difficult for lawyers to challenge a result.

 The credit score is the lone algorithm in which consumers have a legal right to examine and challenge the underlying data used to generate it. In 1970,President Richard M. Nixon signed the Fair Credit Reporting Act. It gave people the right to see the data in their credit reports and to challenge and delete data that was inaccurate.

For most other algorithms, people are expected to read fine-print privacy policies, in the hopes of determining whether their data might be used against them in a way that they wouldn’t expect.

 “We urgently need more due process with the algorithmic systems influencing our lives,” says Kate Crawford, a principal researcher atMicrosoft Research who has called for big data due process requirements.“If you are given a score that jeopardizes your ability to get a job, housing or education, you should have the right to see that data, know how it was generated, and be able to correct errors and contest the decision.”

The European Union has recently adopted a due process requirement for data-driven decisions based “solely on automated processing” that“significantly affect” citizens. The new rules, which are set to go into effect in May 2018, give European Union citizens the right to obtain an explanation of automated decisions and to challenge those decisions. However, since the European regulations apply only to situations that don’t involve human judgment “such as automatic refusal of an online credit application or e-recruiting practices without any human intervention,” they are likely to affect a narrow class of automated decisions. …More recently, the White House has suggested that algorithm makers police themselves. In a recent report, the administration called for automated decision-making tools to be tested for fairness, and for the development of“algorithmic auditing.”

But algorithmic auditing is not yet common. In 2014, Eric H. Holder Jr.,then the attorney general, called for the United States SentencingCommission to study whether risk assessments used in sentencing were reinforcing unjust disparities in the criminal justice system. No study was done….(More)”

Open Data for Developing Economies


Scan of the literature by Andrew Young, Stefaan Verhulst, and Juliet McMurren: This edition of the GovLab Selected Readings was developed as part of the Open Data for Developing Economies research project (in collaboration with WebFoundation, USAID and fhi360). Special thanks to Maurice McNaughton, Francois van Schalkwyk, Fernando Perini, Michael Canares and David Opoku for their input on an early draft. Please contact Stefaan Verhulst (stefaan@thegovlab.org) for any additional input or suggestions.

Open data is increasingly seen as a tool for economic and social development. Across sectors and regions, policymakers, NGOs, researchers and practitioners are exploring the potential of open data to improve government effectiveness, create new economic opportunity, empower citizens and solve public problems in developing economies. Open data for development does not exist in a vacuum – rather it is a phenomenon that is relevant to and studied from different vantage points including Data4Development (D4D), Open Government, the United Nations’ Sustainable Development Goals (SDGs), and Open Development. The below-selected readings provide a view of the current research and practice on the use of open data for development and its relationship to related interventions.

Selected Reading List (in alphabetical order)

  • Open Data and Open Development…
  • Open Data and Developing Countries (National Case Studies)….(More)”

Metric Power


Book by David Beer: This book examines the powerful and intensifying role that metrics play in ordering and shaping our everyday lives. Focusing upon the interconnections between measurement, circulation and possibility, the author explores the interwoven relations between power and metrics. He draws upon a wide-range of interdisciplinary resources to place these metrics within their broader historical, political and social contexts. More specifically, he illuminates the various ways that metrics implicate our lives – from our work, to our consumption and our leisure, through to our bodily routines and the financial and organisational structures that surround us. Unravelling the power dynamics that underpin and reside within the so-called big data revolution, he develops the central concept of Metric Power along with a set of conceptual resources for thinking critically about the powerful role played by metrics in the social world today….(More)”

Matchmaker, matchmaker make me a mortgage: What policymakers can learn from dating websites


Angelina Carvalho, Chiranjit Chakraborty and Georgia Latsi at Bank Underground: “Policy makers have access to more and more detailed datasets. These can be joined together to give an unprecedentedly rich description of the economy. But the data are often noisy and individual entries are not uniquely identifiable. This leads to a trade-off: very strict matching criteria may result in a limited and biased sample; making them too loose risks inaccurate data. The problem gets worse when joining large datasets as the potential number of matches increases exponentially. Even with today’s astonishing computer power, we need efficient techniques. In this post we describe a bipartite matching algorithm on such big data to deal with these issues. Similar algorithms are often used in online dating, closely modelled as the stable marriage problem.

The home-mover problem

The housing market matters and affects almost everything that central banks care about. We want to know why, when and how people move home. And a lot do move: one in nine UK households in 2013/4 according to the Office for National Statistics (ONS). Fortunately, it is also a market that we have an increasing amount of information about. We are going to illustrate the use of the matching algorithm in the context of identifying the characteristics of these movers and the mortgages that many of them took out.

A Potential Solution

The FCA’s Product Sales Data (PSD) on owner-occupied mortgage lending contains loan level product, borrower and property characteristics for all loans originated in the UK since Q2 2005. This dataset captures the attributes of each loan at the point of origination but does not follow the borrowers afterwards. Hence, it does not meaningfully capture if a loan was transferred to another property or closed for certain reason. Also, there is no unique borrower identifier and that is why we cannot easily monitor if a borrower repaid their old mortgage and got a new one against another property.

However, the dataset identify whether a borrower is a first time buyer or a home-mover, together with other information. Even though we do not have information before 2005, we can still try to use this dataset to identify some of the owners’ moving patterns. We try to find from where a home-mover may have moved (origination point) and who moved in to his/her vacant property. If we can successfully track the movers, it will also help us to remove corresponding old mortgages to calculate the stock of mortgages from our flow data. A previous Bank Underground post showed how probabilistic record linkage techniques can be used to join related datasets that do not have unique common identifiers.  We have used bipartite graph matching techniques here to extend those ideas….(More)”