Five Headlines from a Big Month for the Data Revolution


Sarah T. Lucas at Post2015.org: “If the history of the data revolution were written today, it would include three major dates. May 2013, when theHigh Level Panel on the Post-2015 Development Agenda first coined the phrase “data revolution.” November 2014, when the UN Secretary-General’s Independent Expert Advisory Group (IEAG) set a vision for it. And April 2015, when five headliner stories pushed the data revolution from great idea to a concrete roadmap for action.

The April 2015 Data Revolution Headlines

1. The African Data Consensus puts Africa in the lead on bringing the data revolution to the regional level. TheAfrica Data Consensus (ADC) envisions “a profound shift in the way that data is harnessed to impact on development decision-making, with a particular emphasis on building a culture of usage.” The ADC finds consensus across 15 “data communities”—ranging from open data to official statistics to geospatial data, and is endorsed by Africa’s ministers of finance. The ADC gets top billing in my book, as the first contribution that truly reflects a large diversity of voices and creates a political hook for action. (Stay tuned for a blog from my colleague Rachel Quint on the ADC).

2. The Sustainable Development Solutions Network (SDSN) gets our minds (and wallets) around the data needed to measure the SDGs. The SDSN Needs Assessment for SDG Monitoring and Statistical Capacity Development maps the investments needed to improve official statistics. My favorite parts are the clear typology of data (see pg. 12), and that the authors are very open about the methods, assumptions, and leaps of faith they had to take in the costing exercise. They also start an important discussion about how advances in information and communications technology, satellite imagery, and other new technologies have the potential to expand coverage, increase analytic capacity, and reduce the cost of data systems.

3. The Overseas Development Institute (ODI) calls on us to find the “missing millions.” ODI’s The Data Revolution: Finding the Missing Millions presents the stark reality of data gaps and what they mean for understanding and addressing development challenges. The authors highlight that even that most fundamental of measures—of poverty levels—could be understated by as much as a quarter. And that’s just the beginning. The report also pushes us to think beyond the costs of data, and focus on how much good data can save. With examples of data lowering the cost of doing government business, the authors remind us to think about data as an investment with real economic and social returns.

4. Paris21 offers a roadmap for putting national statistic offices (NSOs) at the heart of the data revolution.Paris21’s Roadmap for a Country-Led Data Revolution does not mince words. It calls on the data revolution to “turn a vicious cycle of [NSO] underperformance and inadequate resources into a virtuous one where increased demand leads to improved performance and an increase in resources and capacity.” It makes the case for why NSOs are central and need more support, while also pushing them to modernize, innovate, and open up. The roadmap gets my vote for best design. This ain’t your grandfather’s statistics report!

5. The Cartagena Data Festival features real-live data heroes and fosters new partnerships. The Festival featured data innovators (such as terra-i using satellite data to track deforestation), NSOs on the leading edge of modernization and reform (such as Colombia and the Philippines), traditional actors using old data in new ways (such as the Inter-American Development Bank’s fantastic energy database), groups focused on citizen-generated data (such as The Data Shift and UN My World), private firms working with big data for social good (such asTelefónica), and many others—all reminding us that the data revolution is well underway and will not be stopped. Most importantly, it brought these actors together in one place. You could see the sparks flying as folks learned from each other and hatched plans together. The Festival gets my vote for best conference of a lifetime, with the perfect blend of substantive sessions, intense debate, learning, inspiration, new connections, and a lot of fun. (Stay tuned for a post from my colleague Kristen Stelljes and me for more on Cartagena).

This month full of headlines leaves no room for doubt—momentum is building fast on the data revolution. And just in time.

With the Financing for Development (FFD) conference in Addis Ababa in July, the agreement of Sustainable Development Goals in New York in September, and the Climate Summit in Paris in December, this is a big political year for global development. Data revolutionaries must seize this moment to push past vision, past roadmaps, to actual action and results…..(More)”

Data Fusion Heralds City Attractiveness Ranking


Emerging Technology From the arXiv: “The ability of any city to attract visitors is an important metric for town planners, businesses based on tourism, traffic planners, residents, and so on. And there are increasingly varied ways of measuring this thanks to the growing volumes of city-related data generated by with social media, and location-based data.

So it’s only natural that researchers would like to draw these data sets together to see what kind of insight they can get from this form of data fusion.

And so it has turned out thanks to the work of Stanislav Sobolevsky at MIT and a few buddies. These guys have fused three wildly different data sets related to the attractiveness of a city that allows them to rank these places and to understand why people visit them and what they do when they get there.

The work focuses exclusively on cities in Spain using data that is relatively straightforward to gather. The first data set consists of the number of credit and debit card transactions carried out by visitors to cities throughout Spain during 2011. This includes each card’s country of origin, which allows Sobolevsky and co to count only those transactions made by foreign visitors—a total of 17 million anonymized transactions from 8.6 million foreign visitors from 175 different countries.

The second data set consists of over 3.5 million photos and videos taken in Spain and posted to Flickr by people living in other countries. These pictures were taken between 2005 and 2014 by 16,000 visitors from 112 countries.

The last data set consists of around 700,000 geotagged tweets posted in Spain during 2012. These were posted by 16,000 foreign visitors from 112 countries.

Finally, the team defined a city’s attractiveness, at least for the purposes of this study, as the total number of pictures, tweets and card transactions that took place within it……

That’s interesting work that shows how the fusion of big data sets can provide insights into the way people use cities.   It has its limitations of course. The study does not address the reasons why people find cities attractive and what draws them there in the first place. For example, are they there for tourism, for business, or for some other reason. That would require more specialized data.

But it does provide a general picture of attractiveness that could be a start for more detailed analyses. As such, this work is just a small part of a new science of cities based on big data, but one that shows how much is becoming possible with just a little number crunching.

Ref: arxiv.org/abs/1504.06003 : Scaling of city attractiveness for foreign visitors through big data of human economic and social media activity”

How Not to Drown in Numbers


Seth Stephens-Davidowitz in the New York Times: “BIG data will save the world. How often have we heard that over the past couple of years? We’re pretty sure both of us have said something similar dozens of times in the past few months.

If you’re trying to build a self-driving car or detect whether a picture has a cat in it, big data is amazing. But here’s a secret: If you’re trying to make important decisions about your health, wealth or happiness, big data is not enough.

The problem is this: The things we can measure are never exactly what we care about. Just trying to get a single, easy-to-measure number higher and higher (or lower and lower) doesn’t actually help us make the right choice. For this reason, the key question isn’t “What did I measure?” but “What did I miss?”

So what can big data do to help us make big decisions? One of us, Alex, is a data scientist at Facebook. The other, Seth, is a former data scientist at Google. There is a special sauce necessary to making big data work: surveys and the judgment of humans — two seemingly old-fashioned approaches that we will call small data….(More)”

Urban Data Games: creating smart citizens for smart cities


Paper by Wolff, Annika; Kortuem, Gerd and Cavero, Jose: “A bottom-up approach to smart cities places citizens in an active role of contributing, analysing and interpreting data in pursuit of tackling local urban challenges and building a more sustainable future city. This vision can only be realised if citizens have sufficient data literacy skills and experience of large, complex, messy, ever expanding data sets. Schools typically focus on teaching data handling skills using small, personally collected data sets obtained through scientific experimentation, leading to a gap between what is being taught and what will be needed as big data and analytics become more prevalent. This paper proposes an approach to teaching data literacy in the context of urban innovation tasks, using an idea of Urban Data Games. These are supported by a set of training data and resources that will be used in school trials for exploring the problems people have when dealing with large data and trialling novel approaches for teaching data literacy….(More)”

A map for Big Data research in Digital Humanities


Article by Frederic Kaplan in Frontiers: “This article is an attempt to represent Big Data research in Digital Humanities as a structured research field. A division in three concentric areas of study is presented. Challenges in the first circle – focusing on the processing and interpretations of large cultural datasets – can be organized linearly following the data processing pipeline. Challenges in the second circle – concerning digital culture at large – can be structured around the different relations linking massive datasets, large communities, collective discourses, global actors and the software medium. Challenges in the third circle – dealing with the experience of big data – can be described within a continuous space of possible interfaces organized around three poles: immersion, abstraction and language. By identifying research challenges in all these domains, the article illustrates how this initial cartography could be helpful to organize the exploration of the various dimensions of Big Data Digital Humanities research….(More)”

How Data Mining could have prevented Tunisia’s Terror attack in Bardo Museum


Wassim Zoghlami at Medium: “…Data mining is the process of posing queries and extracting useful patterns or trends often previously unknown from large amounts of data using various techniques such as those from pattern recognition and machine learning. Latelely there has been a big interest on leveraging the use of data mining for counter-terrorism applications

Using the data on more than 50.000+ ISIS connected twitter accounts , I was able to establish an understanding of some factors determined how often ISIS attacks occur , what different types of terror strikes are used in which geopolitical situations, and many other criteria through graphs about the frequency of hashtags usages and the frequency of a particular group of the words used in the tweets.

A simple data mining project of some of the repetitive hashtags and sequences of words used typically by ISIS militants in their tweets yielded surprising results. The results show a rise of some keywords on the tweets that started from Marsh 15, three days before Bardo museum attacks.

Some of the common frequent keywords and hashtags that had a unusual peak since marsh 15 , three days before the attack :

#طواغيت تونس : Tyrants of Tunisia = a reference to the military

بشرى تونس : Good news for Tunisia.

قريبا تونس : Soon in Tunisia.

#إفريقية_للإعلام : The head of social media of Afriqiyah

#غزوة_تونس : The foray of Tunis…

Big Data and Data Mining should be used for national security intelligence

The Tunisian national security has to leverage big data to predict such attacks and to achieve objectives as the volume of digital data. Some of the challenges facing the Data mining techniques are that to carry out effective data mining and extract useful information for counterterrorism and national security, we need to gather all kinds of information about individuals. However, this information could be a threat to the individuals’ privacy and civil liberties…(More)”

Health Big Data in the Commercial Context


CDT Press Release: “This paper is the third in a series of three, each of which explores health big data in a different context. The first — on health big data in the government context — is available here, and the second — on health big data in the clinical context — is available here.

Consumers are increasingly using mobile phone apps and wearable devices to generate and share data on health and wellness. They are using personal health record tools to access and copy health records and move them to third party platforms. They are sharing health information on social networking sites. They leave digital health footprints when they conduct online searches for health information. The health data created, accessed, and shared by consumers using these and many other tools can range from detailed clinical information, such as downloads from an implantable device and details about medication regimens, to data about weight, caloric intake, and exercise logged with a smart phone app.

These developments offer a wealth of opportunities for health care and personal wellness. However, privacy questions arise due to the volume and sensitivity of health data generated by consumer-focused apps, devices, and platforms, including the potential analytics uses that can be made of such data.

Many of the privacy issues that face traditional health care entities in the big data era also apply to app developers, wearable device manufacturers, and other entities not part of the traditional health care ecosystem. These include questions of data minimization, retention, and secondary use. Notice and consent pose challenges, especially given the limits of presenting notices on mobile device screens, and the fact that consumer devices may be bought and used without consultation with a health care professional. Security is a critical issue as well.

However, the privacy and security provisions of the Heath Insurance Portability and Accountability Act (HIPAA) do not apply to most app developers, device manufacturers or others in the consumer health space. This has benefits to innovation, as innovators would otherwise have to struggle with the complicated HIPAA rules. However, the current vacuum also leaves innovators without clear guidance on how to appropriately and effectively protect consumers’ health data. Given the promise of health apps, consumer devices, and consumer-facing services, and given the sensitivity of the data that they collect and share, it is important to provide such guidance….

As the source of privacy guidelines, we look to the framework provided by the Fair Information Practice Principles (FIPPs) and explore how it could be applied in an age of big data to patient-generated data. The FIPPs have influenced to varying degrees most modern data privacy regimes. While some have questioned the continued validity of the FIPPs in the current era of mass data collection and analysis, we consider here how the flexibility and rigor of the FIPPs provide an organizing framework for responsible data governance, promoting innovation, efficiency, and knowledge production while also protecting privacy. Rather than proposing an entirely new framework for big data, which could be years in the making at best, using the FIPPs would seem the best approach in promoting responsible big data practices. Applying the FIPPs could also help synchronize practices between the traditional health sector and emerging consumer products….(More)”

Big Other: Surveillance Capitalism and the Prospects of an Information Civilization


New paper by Shoshana Zuboff in the Journal of Information Technology: “This article describes an emergent logic of accumulation in the networked sphere, ‘surveillance capitalism,’ and considers its implications for ‘information civilization.’ Google is to surveillance capitalism what General Motors was to managerial capitalism. Therefore the institutionalizing practices and operational assumptions of Google Inc. are the primary lens for this analysis as they are rendered in two recent articles authored by Google Chief Economist Hal Varian. Varian asserts four uses that follow from computer-mediated transactions: ‘data extraction and analysis,’ ‘new contractual forms due to better monitoring,’ ‘personalization and customization,’ and ‘continuous experiments.’ An examination of the nature and consequences of these uses sheds light on the implicit logic of surveillance capitalism and the global architecture of computer mediation upon which it depends. This architecture produces a distributed and largely uncontested new expression of power that I christen: ‘Big Other.’ It is constituted by unexpected and often illegible mechanisms of extraction, commodification, and control that effectively exile persons from their own behavior while producing new markets of behavioral prediction and modification. Surveillance capitalism challenges democratic norms and departs in key ways from the centuries long evolution of market capitalism….(More)”

The big medical data miss: challenges in establishing an open medical resource


Eric J. Topol in Nature: ” I call for an international open medical resource to provide a database for every individual’s genomic, metabolomic, microbiomic, epigenomic and clinical information. This resource is needed in order to facilitate genetic diagnoses and transform medical care.

“We are each, in effect, one-person clinical trials”

Laurie Becklund was a noted journalist who died in February 2015 at age 66 from breast cancer. Soon thereafter, the Los Angeles Times published her op-ed entitled “As I lay dying” (Ref. 1). She lamented, “We are each, in effect, one-person clinical trials. Yet the knowledge generated from those trials will die with us because there is no comprehensive database of metastatic breast cancer patients, their characteristics and what treatments did and didn’t help them”. She went on to assert that, in the era of big data, the lack of such a resource is “criminal”, and she is absolutely right….

Around the same time of this important op-ed, the MIT Technology Review published their issue entitled “10 Breakthrough Technologies 2015” and on the list was the “Internet of DNA” (Ref. 2). While we are often reminded that the world we live in is becoming the “Internet of Things”, I have not seen this terminology applied to DNA before. The article on the “Internet of DNA” decried, “the unfolding calamity in genomics is that a great deal of life-saving information, though already collected, is inaccessible”. It called for a global network of millions of genomes and cited theMatchmaker Exchange as a frontrunner. For this international initiative, a growing number of research and clinical teams have come together to pool and exchange phenotypic and genotypic data for individual patients with rare disorders, in order to share this information and assist in the molecular diagnosis of individuals with rare diseases….

an Internet of DNA — or what I have referred to as a massive, open, online medicine resource (MOOM) — would help to quickly identify the genetic cause of the disorder4 and, in the process of doing so, precious guidance for prevention, if necessary, would become available for such families who are currently left in the lurch as to their risk of suddenly dying.

So why aren’t such MOOMs being assembled? ….

There has also been much discussion related to privacy concerns that patients might be unwilling to participate in a massive medical information resource. However, multiple global consumer surveys have shown that more than 80% of individuals are ready to share their medical data provided that they are anonymized and their privacy maximally assured4. Indeed, just 24 hours into Apple’s ResearchKit initiative, a smartphone-based medical research programme, there were tens of thousand of patients with Parkinson disease, asthma or heart disease who had signed on. Some individuals are even willing to be “open source” — that is, to make their genetic and clinical data fully available with free access online, without any assurance of privacy. This willingness is seen by the participants in the recently launched Open Humans initiative. Along with the Personal Genome Project, Go Viral and American Gut have joined in this initiative. Still, studies suggest that most individuals would only agree to be medical research participants if their identities would not be attainable. Unfortunately, to date, little has been done to protect individual medical privacy, for which there are both promising new data protection technological approaches4 and the need for additional governmental legislation.

This leaves us with perhaps the major obstacle that is holding back the development of MOOMs — researchers. Even with big, team science research projects culling together hundreds of investigators and institutions throughout the world, such as the Global Alliance for Genomics and Health (GA4GH), the data obtained clinically are just as Laurie Becklund asserted in her op-ed — “one-person clinical trials” (Ref. 1). While undertaking the construction of a MOOM is a huge endeavour, there is little motivation for researchers to take on this task, as this currently offers no academic credit and has no funding source. But the transformative potential of MOOMs to improve medical care is extraordinary. Rather than having the knowledge die with each of us, the time has come to take down the walls of academic medical centres and health-care systems around the world, and create a global knowledge medical resource that leverages each individual’s information to help one another…(More)”

A New Source of Data for Public Health Surveillance: Facebook Likes


Paper by Steven Gittelman et al in the Journal of Medical Internet Research: “The development of the Internet and the explosion of social media have provided many new opportunities for health surveillance. The use of the Internet for personal health and participatory health research has exploded, largely due to the availability of online resources and health care information technology applications [18]. These online developments, plus a demand for more timely, widely available, and cost-effective data, have led to new ways epidemiological data are collected, such as digital disease surveillance and Internet surveys [825]. Over the past 2 decades, Internet technology has been used to identify disease outbreaks, track the spread of infectious disease, monitor self-care practices among those with chronic conditions, and to assess, respond, and evaluate natural and artificial disasters at a population level [6,8,11,12,14,15,17,22,2628]. Use of these modern communication tools for public health surveillance has proven to be less costly and more timely than traditional population surveillance modes (eg, mail surveys, telephone surveys, and face-to-face household surveys).

The Internet has spawned several sources of big data, such as Facebook [29], Twitter [30], Instagram [31], Tumblr [32], Google [33], and Amazon [34]. These online communication channels and market places provide a wealth of passively collected data that may be mined for purposes of public health, such as sociodemographic characteristics, lifestyle behaviors, and social and cultural constructs. Moreover, researchers have demonstrated that these digital data sources can be used to predict otherwise unavailable information, such as sociodemographic characteristics among anonymous Internet users [3538]. For example, Goel et al [36] found no difference by demographic characteristics in the usage of social media and email. However, the frequency with which individuals accessed the Web for news, health care, and research was a predictor of gender, race/ethnicity, and educational attainment, potentially providing useful targeting information based on ethnicity and income [36]. Integrating these big data sources into the practice of public health surveillance is vital to move the field of epidemiology into the 21st century as called for in the 2012 US “Big Data Research and Development Initiative” [19,39].

Understanding how big data can be used to predict lifestyle behavior and health-related data is a step toward the use of these electronic data sources for epidemiologic needs…(More)”