What does Big Data mean to public affairs research?


Ines Mergel, R. Karl Rethemeyer, and Kimberley R. Isett at LSE’s The Impact Blog: “…Big Data promises access to vast amounts of real-time information from public and private sources that should allow insights into behavioral preferences, policy options, and methods for public service improvement. In the private sector, marketing preferences can be aligned with customer insights gleaned from Big Data. In the public sector however, government agencies are less responsive and agile in their real-time interactions by design – instead using time for deliberation to respond to broader public goods. The responsiveness Big Data promises is a virtue in the private sector but could be a vice in the public.

Moreover, we raise several important concerns with respect to relying on Big Data as a decision and policymaking tool. While in the abstract Big Data is comprehensive and complete, in practice today’sversion of Big Data has several features that should give public sector practitioners and scholars pause. First, most of what we think of as Big Data is really ‘digital exhaust’ – that is, data collected for purposes other than public sector operations or research. Data sets that might be publicly available from social networking sites such as Facebook or Twitter were designed for purely technical reasons. The degree to which this data lines up conceptually and operationally with public sector questions is purely coincidental. Use of digital exhaust for purposes not previously envisioned can go awry. A good example is Google’s attempt to predict the flu based on search terms.

Second, we believe there are ethical issues that may arise when researchers use data that was created as a byproduct of citizens’ interactions with each other or with a government social media account. Citizens are not able to understand or control how their data is used and have not given consent for storage and re-use of their data. We believe that research institutions need to examine their institutional review board processes to help researchers and their subjects understand important privacy issues that may arise. Too often it is possible to infer individual-level insights about private citizens from a combination of data points and thus predict their behaviors or choices.

Lastly, Big Data can only represent those that spend some part of their life online. Yet we know that certain segments of society opt in to life online (by using social media or network-connected devices), opt out (either knowingly or passively), or lack the resources to participate at all. The demography of the internet matters. For instance, researchers tend to use Twitter data because its API allows data collection for research purposes, but many forget that Twitter users are not representative of the overall population. Instead, as a recent Pew Social Media 2016 update shows, only 24% of all online adults use Twitter. Internet participation generally is biased in terms of age, educational attainment, and income – all of which correlate with gender, race, and ethnicity. We believe therefore that predictive insights are potentially biased toward certain parts of the population, making generalisations highly problematic at this time….(More)”

Improving Services—At What Cost? Examining the Ethics of Twitter Research


Case study by Sara Mannheimer, Scott W. H. Young and Doralyn Rossmann: “As social media use has become widespread, academic and corporate researchers have identified social networking services as sources of detailed information about people’s viewpoints and behaviors. Social media users share thoughts, have conversations, and build communities in open, online spaces, and researchers analyze social media data for a variety of purposes—from tracking the spread of disease (Lampos & Cristianini, 2010) to conducting market research (Patino, Pitta, & Quinones, 2012; Hornikx & Hendriks, 2015) to forecasting elections (Tumasjan et al., 2010). Twitter in particular has emerged as a leading platform for social media research, partly because user data from non-private Twitter accounts is openly accessible via an application programming interface (API). This case study describes research conducted by Montana State University (MSU) librarians to analyze the MSU Library’s Twitter community, and the ethical questions that we encountered over the course of the research. The case study will walk through our Twitter research at the MSU Library, and then suggest discussion questions to frame an ethical conversation surrounding social media research. We offer a number of areas of ethical inquiry that we recommend be engaged with as a cohesive whole….(More)”.

Making Open Data more evidence-based


Essay by Stefaan G. Verhulst and Danny Lämmerhirt: “…To realize its potential there is a need for more evidence on the full life cycle of open data – within and across settings and sectors….

In particular, three substantive areas were identified that could benefit from interdisciplinary and comparative research:

Demand and use: First, many expressed a need to become smarter about the demand and use-side of open data. Much of the focus, given the nascent nature of many initiatives around the world, has been on the supply-side of open data. Yet to be more responsive and sustainable more insight needs to be gained to the demand and/or user needs.

Conversations repeatedly emphasized that we should differentiate between open data demand and use. Open data demand and use can be analyzed from multiple directions: 1) top-down, starting from a data provider, to intermediaries, to the end users and/or audiences; or 2) bottom-up, studying the data demands articulated by individuals (for instance, through FOIA requests), and how these demands can be taken up by intermediaries and open data providers to change what is being provided as open data.

Research should scrutinize each stage (provision, intermediation, use and demand) on its own, but also examine the interactions between stages (for instance, how may open data demand inform data supply, and how does data supply influence intermediation and use?)….

Informing data supply and infrastructure: Second, we heard on numerous occasions, a call upon researchers and domain experts to help in identifying “key data” and inform the government data infrastructure needed to provide them. Principle 1 of the International Open Data Charter states that governments should provide key data “open by default”, yet the questions remains in how to identify “key” data (e.g., would that mean data relevant to society at large?).

Which governments (and other public institutions) should be expected to provide key data and which information do we need to better understand government’s role in providing key data? How can we evaluate progress around publishing these data coherently if countries organize the capture, collection, and publication of this data differently?…

Impact: In addition to those two focus areas – covering the supply and demand side –  there was also a call to become more sophisticated about impact. Too often impact gets confused with outputs, or even activities. Given the embryonic and iterative nature of many open data efforts, signals of impact are limited and often preliminary. In addition, different types of impact (such as enhancing transparency versus generating innovation and economic growth) require different indicators and methods. At the same time, to allow for regular evaluations of what works and why there is a need for common assessment methods that can generate comparative and directional insights….

Research Networking: Several researchers identified a need for better exchange and collaboration among the research community. This would allow to tackle the research questions and challenges listed above, as well as to identify gaps in existing knowledge, to develop common research methods and frameworks and to learn from each other. Key questions posed involved: how to nurture and facilitate networking among researchers and (topical) experts from different disciplines, focusing on different issues or using different methods? How are different sub-networks related or disconnected with each other (for instance how connected are the data4development; freedom of information or civic tech research communities)? In addition, an interesting discussion emerged around how researchers can also network more with those part of the respective universe of analysis – potentially generating some kind of participatory research design….(More)”

A decentralized web would give power back to the people online


 at TechCrunch: “…The original purpose of the web and internet, if you recall, was to build a common neural network which everyone can participate in equally for the betterment of humanity.Fortunately, there is an emerging movement to bring the web back to this vision and it even involves some of the key figures from the birth of the web. It’s called the Decentralised Web or Web 3.0, and it describes an emerging trend to build services on the internet which do not depend on any single “central” organisation to function.

So what happened to the initial dream of the web? Much of the altruism faded during the first dot-com bubble, as people realised that an easy way to create value on top of this neutral fabric was to build centralised services which gather, trap and monetise information.

Search Engines (e.g. Google), Social Networks (e.g. Facebook), Chat Apps (e.g. WhatsApp )have grown huge by providing centralised services on the internet. For example, Facebook’s future vision of the internet is to provide access only to the subset of centralised services endorses (Internet.org and Free Basics).

Meanwhile, it disables fundamental internet freedoms such as the ability to link to content via a URL (forcing you to share content only within Facebook) or the ability for search engines to index its contents (other than the Facebook search function).

The Decentralised Web envisions a future world where services such as communication,currency, publishing, social networking, search, archiving etc are provided not by centralised services owned by single organisations, but by technologies which are powered by the people: their own community. Their users.

The core idea of decentralisation is that the operation of a service is not blindly trusted toany single omnipotent company. Instead, responsibility for the service is shared: perhaps by running across multiple federated servers, or perhaps running across client side apps in an entirely “distributed” peer-to-peer model.

Even though the community may be “byzantine” and not have any reason to trust or depend on each other, the rules that describe the decentralised service’s behaviour are designed to force participants to act fairly in order to participate at all, relying heavily on cryptographic techniques such as Merkle trees and digital signatures to allow participants to hold each other accountable.

There are three fundamental areas that the Decentralised Web necessarily champions:privacy, data portability and security.

  • Privacy: Decentralisation forces an increased focus on data privacy. Data is distributed across the network and end-to-end encryption technologies are critical for ensuring that only authorized users can read and write. Access to the data itself is entirely controlled algorithmically by the network as opposed to more centralized networks where typically the owner of that network has full access to data, facilitating  customer profiling and ad targeting.
  • Data Portability: In a decentralized environment, users own their data and choose with whom they share this data. Moreover they retain control of it when they leave a given service provider (assuming the service even has the concept of service providers). This is important. If I want to move from General Motors to BMW today, why should I not be able to take my driving records with me? The same applies to chat platform history or health records.
  • Security: Finally, we live in a world of increased security threats. In a centralized environment, the bigger the silo, the bigger the honeypot is to attract bad actors.Decentralized environments are safer by their general nature against being hacked,infiltrated, acquired, bankrupted or otherwise compromised as they have been built to exist under public scrutiny from the outset….(More)”

Crowdsourcing investigative journalism


Convoca in Peru: “…collaborative effort is the essence of Convoca. We are a team of journalists and programmers who work with professionals from different disciplines and generations to expose facts that are hidden by networks of power and affect the life of citizens. We bet on the work in partnership to publish findings of high impact from Peru, where the Amazon survives in almost 60% of the country, in the middle of oil exploitation, minerals and criminal activities such as logging, illegal mining and human trafficking. Fifty percent of social conflicts have as epicenter extractives areas of natural resources where the population and communities with the highest poverty rates live.

Over one year and seven months, Convoca has uncovered facts of public relevance such as the patterns of corruption and secrecy networking with journalists from Latin America and the world. The series of reports with the BRIO platform revealed the cost overruns of highways and public works in Latin American countries in the hands of Brazilian companies financed by the National Bank of Economic and Social Development (BNDES), nowadays investigated in the most notorious corruption scandal in the region, ‘Lava Jato’. This research won the 2016 Journalistic Excellence Award granted by the Inter American Press Association (SIP). On a global scale, we dove into 11 million and a half files of the ‘Panama Papers’ with more than a hundred media and organizations led by the International Consortium of Investigative Journalists (ICIJ), which allowed to undress the world of tax havens where companies and characters hide their fortune.

Our work on extractive industries ‘Excesses unpunished’ won the most important award of data journalism in the world, the Data Journalism Awards 2016, and is a finalist of the Gabriel Garcia Marquez Award which recognized the best of journalism in Latin America. We invite you to be the voice of this effort to keep publishing new reports that allow citizens to make better decisions about their destinies and compel groups of power to come clean about their activities and fulfill their commitments. So join ConBoca: The Power of Citizens Call, our first fundraising campaign alongside our readers. We believe that journalism is a public service….(More)”

Social Machines: The Coming Collision of Artificial Intelligence, Social Networking, and Humanity


Book by James Hendler and Alice Mulvehill: “Will your next doctor be a human being—or a machine? Will you have a choice? If you do, what should you know before making it?

This book introduces the reader to the pitfalls and promises of artificial intelligence in its modern incarnation and the growing trend of systems to “reach off the Web” into the real world. The convergence of AI, social networking, and modern computing is creating an historic inflection point in the partnership between human beings and machines with potentially profound impacts on the future not only of computing but of our world.

AI experts and researchers James Hendler and Alice Mulvehill explore the social implications of AI systems in the context of a close examination of the technologies that make them possible. The authors critically evaluate the utopian claims and dystopian counterclaims of prognosticators. Social Machines: The Coming Collision of Artificial Intelligence, Social Networking, and Humanity is your richly illustrated field guide to the future of your machine-mediated relationships with other human beings and with increasingly intelligent machines.

What you’ll learn

• What the concept of a social machine is and how the activities of non-programmers are contributing to machine intelligence• How modern artificial intelligence technologies, such as Watson, are evolving and how they process knowledge from both carefully produced information (such as Wikipedia or journal articles) and from big data collections

• The fundamentals of neuromorphic computing

• The fundamentals of knowledge graph search and linked data as well as the basic technology concepts that underlie networking applications such as Facebook and Twitter

• How the change in attitudes towards cooperative work on the Web, especially in the younger demographic, is critical to the future of Web applications…(More)”

How to advance open data research: Towards an understanding of demand, users, and key data


Danny Lämmerhirt and Stefaan Verhulst at IODC blog: “…Lord Kelvin’s famous quote “If you can not measure it, you can not improve it” equally applies to open data. Without more evidence of how open data contributes to meeting users’ needs and addressing societal challenges, efforts and policies toward releasing and using more data may be misinformed and based upon untested assumptions.

When done well, assessments, metrics, and audits can guide both (local) data providers and users to understand, reflect upon, and change how open data is designed. What we measure and how we measure is therefore decisive to advance open data.

Back in 2014, the Web Foundation and the GovLab at NYU brought together open data assessment experts from Open Knowledge, Organisation for Economic Co-operation and Development, United Nations, Canada’s International Development Research Centre, and elsewhere to explore the development of common methods and frameworks for the study of open data. It resulted in a draft template or framework for measuring open data. Despite the increased awareness for more evidence-based open data approaches, since 2014 open data assessment methods have only advanced slowly. At the same time, governments publish more of their data openly, and more civil society groups, civil servants, and entrepreneurs employ open data to manifold ends: the broader public may detect environmental issues and advocate for policy changes, neighbourhood projects employ data to enable marginalized communities to participate in urban planning, public institutions may enhance their information exchange, and entrepreneurs embed open data in new business models.

In 2015, the International Open Data Conference roadmap made the following recommendations on how to improve the way we assess and measure open data.

  1. Reviewing and refining the Common Assessment Methods for Open Data framework. This framework lays out four areas of inquiry: context of open data, the data published, use practices and users, as well as the impact of opening data.
  2. Developing a catalogue of assessment methods to monitor progress against the International Open Data Charter (based on the Common Assessment Methods for Open Data).
  3. Networking researchers to exchange common methods and metrics. This helps to build methodologies that are reproducible and increase credibility and impact of research.
  4. Developing sectoral assessments.

In short, the IODC called for refining our assessment criteria and metrics by connecting researchers, and applying the assessments to specific areas. It is hard to tell how much progress has been made in answering these recommendations, but there is a sense among researchers and practitioners that the first two goals are yet to be fully addressed.

Instead we have seen various disparate, yet well meaning, efforts to enhance the understanding of the release and impact of open data. A working group was created to measure progress on the International Open Data Charter, which provides governments with principles for implementing open data policies. While this working group compiled a list of studies and their methodologies, it did not (yet) deepen the common framework of definitions and criteria to assess and measure the implementation of the Charter.

In addition, there is an increase of sector- and case-specific studies that are often more descriptive and context specific in nature, yet do contribute to the need for examples that illustrate the value proposition for open data.

As such, there seems to be a disconnect between top-level frameworks and on-the-ground research, preventing the sharing of common methods and distilling replicable experiences about what works and what does not….(More)”

Scholarpedia


About: “Scholarpedia is a peer-reviewed open-access encyclopedia written and maintained by scholarly experts from around the world. Scholarpediais inspired by Wikipedia and aims to complement it by providing in-depth scholarly treatments of academic topics.

Scholarpedia and Wikipedia are alike in many respects:

  • both allow anyone to propose revisions to almost any article
  • both are “wikis” and use the familiar MediaWiki software designed for Wikipedia
  • both allow considerable freedom within each article’s “Talk” pages
  • both are committed to the goal of making the world’s knowledge freely available to all

Nonetheless, Scholarpedia is best understood by how it is unlike most wikis, differences arising from Scholarpedia’s academic origins, goals, and audience. The most significant isScholarpedia’s process of peer-reviewed publication: all articles in Scholarpedia are either in the process of being written by a team of authors, or have already been published and are subject to expert curation….(More)”

How Twitter gives scientists a window into human happiness and health


 at the Conversation: “Since its public launch 10 years ago, Twitter has been used as a social networking platform among friends, an instant messaging service for smartphone users and a promotional tool for corporations and politicians.

But it’s also been an invaluable source of data for researchers and scientists – like myself – who want to study how humans feel and function within complex social systems.

By analyzing tweets, we’ve been able to observe and collect data on the social interactions of millions of people “in the wild,” outside of controlled laboratory experiments.

It’s enabled us to develop tools for monitoring the collective emotions of large populations, find the happiest places in the United States and much more.

So how, exactly, did Twitter become such a unique resource for computational social scientists? And what has it allowed us to discover?

Twitter’s biggest gift to researchers

On July 15, 2006, Twittr (as it was then known) publicly launched as a “mobile service that helps groups of friends bounce random thoughts around with SMS.” The ability to send free 140-character group texts drove many early adopters (myself included) to use the platform.

With time, the number of users exploded: from 20 million in 2009 to 200 million in 2012 and 310 million today. Rather than communicating directly with friends, users would simply tell their followers how they felt, respond to news positively or negatively, or crack jokes.

For researchers, Twitter’s biggest gift has been the provision of large quantities of open data. Twitter was one of the first major social networks to provide data samples through something called Application Programming Interfaces (APIs), which enable researchers to query Twitter for specific types of tweets (e.g., tweets that contain certain words), as well as information on users.

This led to an explosion of research projects exploiting this data. Today, a Google Scholar search for “Twitter” produces six million hits, compared with five million for “Facebook.” The difference is especially striking given that Facebook has roughly five times as many users as Twitter (and is two years older).

Twitter’s generous data policy undoubtedly led to some excellent free publicity for the company, as interesting scientific studies got picked up by the mainstream media.

Studying happiness and health

With traditional census data slow and expensive to collect, open data feeds like Twitter have the potential to provide a real-time window to see changes in large populations.

The University of Vermont’s Computational Story Lab was founded in 2006 and studies problems across applied mathematics, sociology and physics. Since 2008, the Story Lab has collected billions of tweets through Twitter’s “Gardenhose” feed, an API that streams a random sample of 10 percent of all public tweets in real time.

I spent three years at the Computational Story Lab and was lucky to be a part of many interesting studies using this data. For example, we developed a hedonometer that measures the happiness of the Twittersphere in real time. By focusing on geolocated tweets sent from smartphones, we were able to map the happiest places in the United States. Perhaps unsurprisingly, we found Hawaii to be the happiest state and wine-growing Napa the happiest city for 2013.

A map of 13 million geolocated U.S. tweets from 2013, colored by happiness, with red indicating happiness and blue indicating sadness. PLOS ONE, Author provided

These studies had deeper applications: Correlating Twitter word usage with demographics helped us understand underlying socioeconomic patterns in cities. For example, we could link word usage with health factors like obesity, so we built a lexicocalorimeter to measure the “caloric content” of social media posts. Tweets from a particular region that mentioned high-calorie foods increased the “caloric content” of that region, while tweets that mentioned exercise activities decreased our metric. We found that this simple measure correlates with other health and well-being metrics. In other words, tweets were able to give us a snapshot, at a specific moment in time, of the overall health of a city or region.

Using the richness of Twitter data, we’ve also been able to see people’s daily movement patterns in unprecedented detail. Understanding human mobility patterns, in turn, has the capacity to transform disease modeling, opening up the new field of digital epidemiology….(More)”

Bridging data gaps for policymaking: crowdsourcing and big data for development


 for the DevPolicyBlog: “…By far the biggest innovation in data collection is the ability to access and analyse (in a meaningful way) user-generated data. This is data that is generated from forums, blogs, and social networking sites, where users purposefully contribute information and content in a public way, but also from everyday activities that inadvertently or passively provide data to those that are able to collect it.

User-generated data can help identify user views and behaviour to inform policy in a timely way rather than just relying on traditional data collection techniques (census, household surveys, stakeholder forums, focus groups, etc.), which are often cumbersome, very costly, untimely, and in many cases require some form of approval or support by government.

It might seem at first that user-generated data has limited usefulness in a development context due to the importance of the internet in generating this data combined with limited internet availability in many places. However, U-Report is one example of being able to access user-generated data independent of the internet.

U-Report was initiated by UNICEF Uganda in 2011 and is a free SMS based platform where Ugandans are able to register as “U-Reporters” and on a weekly basis give their views on topical issues (mostly related to health, education, and access to social services) or participate in opinion polls. As an example, Figure 1 shows the result from a U-Report poll on whether polio vaccinators came to U-Reporter houses to immunise all children under 5 in Uganda, broken down by districts. Presently, there are more than 300,000 U-Reporters in Uganda and more than one million U-Reporters across 24 countries that now have U-Report. As an indication of its potential impact on policymaking,UNICEF claims that every Member of Parliament in Uganda is signed up to receive U-Report statistics.

Figure 1: U-Report Uganda poll results

Figure 1: U-Report Uganda poll results

U-Report and other platforms such as Ushahidi (which supports, for example, I PAID A BRIBE, Watertracker, election monitoring, and crowdmapping) facilitate crowdsourcing of data where users contribute data for a specific purpose. In contrast, “big data” is a broader concept because the purpose of using the data is generally independent of the reasons why the data was generated in the first place.

Big data for development is a new phrase that we will probably hear a lot more (see here [pdf] and here). The United Nations Global Pulse, for example, supports a number of innovation labs which work on projects that aim to discover new ways in which data can help better decision-making. Many forms of “big data” are unstructured (free-form and text-based rather than table- or spreadsheet-based) and so a number of analytical techniques are required to make sense of the data before it can be used.

Measures of Twitter activity, for example, can be a real-time indicator of food price crises in Indonesia [pdf] (see Figure 2 below which shows the relationship between food-related tweet volume and food inflation: note that the large volume of tweets in the grey highlighted area is associated with policy debate on cutting the fuel subsidy rate) or provide a better understanding of the drivers of immunisation awareness. In these examples, researchers “text-mine” Twitter feeds by extracting tweets related to topics of interest and categorising text based on measures of sentiment (positive, negative, anger, joy, confusion, etc.) to better understand opinions and how they relate to the topic of interest. For example, Figure 3 shows the sentiment of tweets related to vaccination in Kenya over time and the dates of important vaccination related events.

Figure 2: Plot of monthly food-related tweet volume and official food price statistics

Figure 2: Plot of monthly food-related Tweet volume and official food price statistics

Figure 3: Sentiment of vaccine related tweets in Kenya

Figure 3: Sentiment of vaccine-related tweets in Kenya

Another big data example is the use of mobile phone usage to monitor the movement of populations in Senegal in 2013. The data can help to identify changes in the mobility patterns of vulnerable population groups and thereby provide an early warning system to inform humanitarian response effort.

The development of mobile banking too offers the potential for the generation of a staggering amount of data relevant for development research and informing policy decisions. However, it also highlights the public good nature of data collected by public and private sector institutions and the reliance that researchers have on them to access the data. Building trust and a reputation for being able to manage privacy and commercial issues will be a major challenge for researchers in this regard….(More)”