Imagining Data Without Division


Thomas Lin in Quanta Magazine: “As science dives into an ocean of data, the demands of large-scale interdisciplinary collaborations are growing increasingly acute…Seven years ago, when David Schimel was asked to design an ambitious data project called the National Ecological Observatory Network, it was little more than a National Science Foundation grant. There was no formal organization, no employees, no detailed science plan. Emboldened by advances in remote sensing, data storage and computing power, NEON sought answers to the biggest question in ecology: How do global climate change, land use and biodiversity influence natural and managed ecosystems and the biosphere as a whole?…
For projects like NEON, interpreting the data is a complicated business. Early on, the team realized that its data, while mid-size compared with the largest physics and biology projects, would be big in complexity. “NEON’s contribution to big data is not in its volume,” said Steve Berukoff, the project’s assistant director for data products. “It’s in the heterogeneity and spatial and temporal distribution of data.”
Unlike the roughly 20 critical measurements in climate science or the vast but relatively structured data in particle physics, NEON will have more than 500 quantities to keep track of, from temperature, soil and water measurements to insect, bird, mammal and microbial samples to remote sensing and aerial imaging. Much of the data is highly unstructured and difficult to parse — for example, taxonomic names and behavioral observations, which are sometimes subject to debate and revision.
And, as daunting as the looming data crush appears from a technical perspective, some of the greatest challenges are wholly nontechnical. Many researchers say the big science projects and analytical tools of the future can succeed only with the right mix of science, statistics, computer science, pure mathematics and deft leadership. In the big data age of distributed computing — in which enormously complex tasks are divided across a network of computers — the question remains: How should distributed science be conducted across a network of researchers?
Part of the adjustment involves embracing “open science” practices, including open-source platforms and data analysis tools, data sharing and open access to scientific publications, said Chris Mattmann, 32, who helped develop a precursor to Hadoop, a popular open-source data analysis framework that is used by tech giants like Yahoo, Amazon and Apple and that NEON is exploring. Without developing shared tools to analyze big, messy data sets, Mattmann said, each new project or lab will squander precious time and resources reinventing the same tools. Likewise, sharing data and published results will obviate redundant research.
To this end, international representatives from the newly formed Research Data Alliance met this month in Washington to map out their plans for a global open data infrastructure.”

Digital Participation – The Case of the Italian 'Dialogue with Citizens'


New paper by Gianluca Sgueo presented at Democracy and Technology – Europe in Tension from the 19th to the 21th Century – Sorbonne Paris, 2013: “This paper focuses on the initiative named “Dialogue With Citizens” that the Italian Government introduced in 2012. The Dialogue was an entirely web-based experiment of participatory democracy aimed at, first, informing citizens through documents and in-depth analysis and, second, designed for answering to their questions and requests. During the year and half of life of the initiative roughly 90.000 people wrote (approximately 5000 messages/month). Additionally, almost 200.000 participated in a number of public online consultations that the government launched in concomitance with the adoption of crucial decisions (i.e. the spending review national program).
From the analysis of this experiment of participatory democracy three questions can be raised. (1) How can a public institution maximize the profits of participation and minimize its costs? (2) How can public administrations manage the (growing) expectations of the citizens once they become accustomed to participation? (3) Is online participatory democracy going to develop further, and why?
In order to fully answer such questions, the paper proceeds as follows: it will initially provide a general overview of online public participation both at the central and the local level. It will then discuss the “Dialogue with Citizens” and a selected number of online public consultations lead by the Italian government in 2012. The conclusions will develop a theoretical framework for reflection on the peculiarities and problems of the web-participation.”

Mobile phone data are a treasure-trove for development


Paul van der Boor and Amy Wesolowski in SciDevNet: “Each of us generates streams of digital information — a digital ‘exhaust trail’ that provides real-time information to guide decisions that affect our lives. For example, Google informs us about traffic by using both its ‘My Location’ feature on mobile phones and third-party databases to aggregate location data. BBVA, one of Spain’s largest banks, analyses transactions such as credit card payments as well as ATM withdrawals to find out when and where peak spending occurs.This type of data harvest is of great value. But, often, there is so much data that its owners lack the know-how to process it and fail to realise its potential value to policymakers.
Meanwhile, many countries, particularly in the developing world, have a dearth of information. In resource-poor nations, the public sector often lives in an analogue world where piles of paper impede operations and policymakers are hindered by uncertainty about their own strengths and capabilities.Nonetheless, mobile phones have quickly pervaded the lives of even the poorest: 75 per cent of the world’s 5.5 billion mobile subscriptions are in emerging markets. These people are also generating digital trails of anything from their movements to mobile phone top-up patterns. It may seem that putting this information to use would take vast analytical capacity. But using relatively simple methods, researchers can analyse existing mobile phone data, especially in poor countries, to improve decision-making.
Think of existing, available data as low-hanging fruit that we — two graduate students — could analyse in less than a month. This is not a test of data-scientist prowess, but more a way of saying that anyone could do it.
There are three areas that should be ‘low-hanging fruit’ in terms of their potential to dramatically improve decision-making in information-poor countries: coupling healthcare data with mobile phone data to predict disease outbreaks; using mobile phone money transactions and top-up data to assess economic growth; and predicting travel patterns after a natural disaster using historical movement patterns from mobile phone data to design robust response programmes.
Another possibility is using call-data records to analyse urban movement to identify traffic congestion points. Nationally, this can be used to prioritise infrastructure projects such as road expansion and bridge building.
The information that these analyses could provide would be lifesaving — not just informative or revenue-increasing, like much of this work currently performed in developed countries.
But some work of high social value is being done. For example, different teams of European and US researchers are trying to estimate the links between mobile phone use and regional economic development. They are using various techniques, such as merging night-time satellite imagery from NASA with mobile phone data to create behavioural fingerprints. They have found that this may be a cost-effective way to understand a country’s economic activity and, potentially, guide government spending.
Another example is given by researchers (including one of this article’s authors) who have analysed call-data records from subscribers in Kenya to understand malaria transmission within the country and design better strategies for its elimination. [1]
In this study, published in Science, the location data of the mobile phones of more than 14 million Kenyan subscribers was combined with national malaria prevalence data. After identifying the sources and sinks of malaria parasites and overlaying these with phone movements, analysis was used to identify likely transmission corridors. UK scientists later used similar methods to create different epidemic scenarios for the Côte d’Ivoire.”

Three Paradoxes of Big Data


New Paper by Neil M. Richards and Jonathan H. King in the Stanford Law Review Online:Big data is all the rage. Its proponents tout the use of sophisticated analytics to mine large data sets for insight as the solution to many of our society’s problems. These big data evangelists insist that data-driven decisionmaking can now give us better predictions in areas ranging from college admissions to dating to hiring to medicine to national security and crime prevention. But much of the rhetoric of big data contains no meaningful analysis of its potential perils, only the promise. We don’t deny that big data holds substantial potential for the future, and that large dataset analysis has important uses today. But we would like to sound a cautionary note and pause to consider big data’s potential more critically. In particular, we want to highlight three paradoxes in the current rhetoric about big data to help move us toward a more complete understanding of the big data picture. First, while big data pervasively collects all manner of private information, the operations of big data itself are almost entirely shrouded in legal and commercial secrecy. We call this the Transparency Paradox. Second, though big data evangelists talk in terms of miraculous outcomes, this rhetoric ignores the fact that big data seeks to identify at the expense of individual and collective identity. We call this the Identity Paradox. And third, the rhetoric of big data is characterized by its power to transform society, but big data has power effects of its own, which privilege large government and corporate entities at the expense of ordinary individuals. We call this the Power Paradox. Recognizing the paradoxes of big data, which show its perils alongside its potential, will help us to better understand this revolution. It may also allow us to craft solutions to produce a revolution that will be as good as its evangelists predict.”

Vint Cerf: Freedom and the Social Contract


Vinton G. Cerf in the Communications of the ACM: “The last several weeks (as of this writing) have been filled with disclosures of intelligence practices in the U.S. and elsewhere. Edward Snowden’s unauthorized release of highly classified information has stirred a great deal of debate about national security and the means used to preserve it.
In the midst of all this, I looked to Jean-Jacques Rousseau’s well-known 18th-century writings on the Social Contract (Du Contrat Social, Ou Principes du Droit Politique) for insight. Distilled and interpreted through my perspective, I took away several notions. One is that in a society, to achieve a degree of safety and stability, we as individuals give up some absolute freedom of action to what Rousseau called the sovereign will of the people. He did not equate this to government, which he argued was distinct and derived its power from the sovereign people.
I think it may be fair to say that most of us would not want to live in a society that had no limits to individual behavior. In such a society, there would be no limit to the potential harm an individual could visit upon others. In exchange for some measure of stability and safety, we voluntarily give up absolute freedom in exchange for the rule of law. In Rousseau’s terms, however, the laws must come from the sovereign people, not from the government. We approximate this in most modern societies creating representative government using public elections to populate the key parts of the government.”

(Appropriate) Big Data for Climate Resilience?


Amy Luers at the Stanford Social Innovation Review: “The answer to whether big data can help communities build resilience to climate change is yes—there are huge opportunities, but there are also risks.

Opportunities

  • Feedback: Strong negative feedback is core to resilience. A simple example is our body’s response to heat stress—sweating, which is a natural feedback to cool down our body. In social systems, feedbacks are also critical for maintaining functions under stress. For example, communication by affected communities after a hurricane provides feedback for how and where organizations and individuals can provide help. While this kind of feedback used to rely completely on traditional communication channels, now crowdsourcing and data mining projects, such as Ushahidi and Twitter Earthquake detector, enable faster and more-targeted relief.
  • Diversity: Big data is enhancing diversity in a number of ways. Consider public health systems. Health officials are increasingly relying on digital detection methods, such as Google Flu Trends or Flu Near You, to augment and diversify traditional disease surveillance.
  • Self-Organization: A central characteristic of resilient communities is the ability to self-organize. This characteristic must exist within a community (see the National Research Council Resilience Report), not something you can impose on it. However, social media and related data-mining tools (InfoAmazonia, Healthmap) can enhance situational awareness and facilitate collective action by helping people identify others with common interests, communicate with them, and coordinate efforts.

Risks

  • Eroding trust: Trust is well established as a core feature of community resilience. Yet the NSA PRISM escapade made it clear that big data projects are raising privacy concerns and possibly eroding trust. And it is not just an issue in government. For example, Target analyzes shopping patterns and can fairly accurately guess if someone in your family is pregnant (which is awkward if they know your daughter is pregnant before you do). When our trust in government, business, and communities weakens, it can decrease a society’s resilience to climate stress.
  • Mistaking correlation for causation: Data mining seeks meaning in patterns that are completely independent of theory (suggesting to some that theory is dead). This approach can lead to erroneous conclusions when correlation is mistakenly taken for causation. For example, one study demonstrated that data mining techniques could show a strong (however spurious) correlation between the changes in the S&P 500 stock index and butter production in Bangladesh. While interesting, a decision support system based on this correlation would likely prove misleading.
  • Failing to see the big picture: One of the biggest challenges with big data mining for building climate resilience is its overemphasis on the hyper-local and hyper-now. While this hyper-local, hyper-now information may be critical for business decisions, without a broader understanding of the longer-term and more-systemic dynamism of social and biophysical systems, big data provides no ability to understand future trends or anticipate vulnerabilities. We must not let our obsession with the here and now divert us from slower-changing variables such as declining groundwater, loss of biodiversity, and melting ice caps—all of which may silently define our future. A related challenge is the fact that big data mining tends to overlook the most vulnerable populations. We must not let the lure of the big data microscope on the “well-to-do” populations of the world make us blind to the less well of populations within cities and communities that have more limited access to smart phones and the Internet.”

Frontiers in Massive Data Analysis


New report from the National Academy of Sciences: “Data mining of massive data sets is transforming the way we think about crisis response, marketing, entertainment, cybersecurity and national intelligence. Collections of documents, images, videos, and networks are being thought of not merely as bit strings to be stored, indexed, and retrieved, but as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data.
Frontiers in Massive Data Analysis examines the frontier of analyzing massive amounts of data, whether in a static database or streaming through a system. Data at that scale–terabytes and petabytes–is increasingly common in science (e.g., particle physics, remote sensing, genomics), Internet commerce, business analytics, national security, communications, and elsewhere. The tools that work to infer knowledge from data at smaller scales do not necessarily work, or work well, at such massive scale. New tools, skills, and approaches are necessary, and this report identifies many of them, plus promising research directions to explore. Frontiers in Massive Data Analysis discusses pitfalls in trying to infer knowledge from massive data, and it characterizes seven major classes of computation that are common in the analysis of massive data. Overall, this report illustrates the cross-disciplinary knowledge–from computer science, statistics, machine learning, and application disciplines–that must be brought to bear to make useful inferences from massive data.”

Political Scientists Acknowledge Need to Make Stronger Case for Their Field


Beth McMurtrie in The Chronicle of Higher Education: “Back in March, Congress limited federal support for political-science research by the National Science Foundation to projects that promote national security or American economic interests. That decision was a victory for Sen. Tom Coburn, a Republican from Oklahoma who has long aimed to eliminate all NSF grants for political science, arguing that unlike the hard sciences it rarely produces concrete benefits to society.
Congress’s action has led to soul searching within the discipline about how effective academics have been in conveying the value of their work to the public. It has also revived a longstanding debate among political scientists about the shift toward more statistically sophisticated, mathematically esoteric research, and its usefulness outside of academe. Those discussions were out front at the annual conference of the American Political Science Association, held here last week.
Rogers M. Smith, a political-science professor at the University of Pennsylvania, was one of 13 members of a panel that discussed the controversy over NSF money for political-science studies. He put the problem bluntly: “We need to make a better case for ourselves.”
Few on the panel, in fact, seemed to think that political science had done a good job on that front. The association has created a task force—led by Arthur Lupia, a political-science professor at the University of Michigan at Ann Arbor—to improve public perceptions of political science’s value. He said his colleagues could learn from organizations like the American Association for the Advancement of Science, which holds special sessions for the news media at its annual conference to explain the work of its members to the public.”

White House: "We Want Your Input on Building a More Open Government"


Nick Sinai at the White House Blog:”…We are proud of this progress, but recognize that there is always more we can do to build a more efficient, effective, and accountable government.  In that spirit, the Obama Administration has committed to develop a second National Action Plan on Open Government: “NAP 2.0.”
In order to develop a Plan with the most creative and ambitious solutions, we need all-hands-on-deck. That’s why we are asking for your input on what should be in the NAP 2.0:

  1. How can we better encourage and enable the public to participate in government and increase public integrity? For example, in the first National Action Plan, we required Federal enforcement agencies to make publicly available compliance information easily accessible, downloadable and searchable online – helping the public to hold the government and regulated entities accountable.
  • What other kinds of government information should be made more available to help inform decisions in your communities or in your lives?
  • How would you like to be able to interact with Federal agencies making decisions which impact where you live?
  • How can the Federal government better ensure broad feedback and public participation when considering a new policy?
  1. The American people must be able to trust that their Government is doing everything in its power to stop wasteful practices and earn a high return on every tax dollar that is spent.  How can the government better manage public resources? 
  • What suggestions do you have to help the government achieve savings while also improving the way that government operates?
  • What suggestions do you have to improve transparency in government spending?
  1. The American people deserve a Government that is responsive to their needs, makes information readily accessible, and leverages Federal resources to help foster innovation both in the public and private sector.   How can the government more effectively work in collaboration with the public to improve services?
  • What are your suggestions for ways the government can better serve you when you are seeking information or help in trying to receive benefits?
  • In the past few years, the government has promoted the use of “grand challenges,” ambitious yet achievable goals to solve problems of national priority, and incentive prizes, where the government identifies challenging problems and provides prizes and awards to the best solutions submitted by the public.  Are there areas of public services that you think could be especially benefited by a grand challenge or incentive prize?
  • What information or data could the government make more accessible to help you start or improve your business?

Please think about these questions and send your thoughts to opengov@ostp.gov by September 23. We will post a summary of your submissions online in the future.”

How Mechanical Turkers Crowdsourced a Huge Lexicon of Links Between Words and Emotion


The Physics arXiv Blog: Sentiment analysis on the social web depends on how a person’s state of mind is expressed in words. Now a new database of the links between words and emotions could provide a better foundation for this kind of analysis


One of the buzzphrases associated with the social web is sentiment analysis. This is the ability to determine a person’s opinion or state of mind by analysing the words they post on Twitter, Facebook or some other medium.
Much has been promised with this method—the ability to measure satisfaction with politicians, movies and products; the ability to better manage customer relations; the ability to create dialogue for emotion-aware games; the ability to measure the flow of emotion in novels; and so on.
The idea is to entirely automate this process—to analyse the firehose of words produced by social websites using advanced data mining techniques to gauge sentiment on a vast scale.
But all this depends on how well we understand the emotion and polarity (whether negative or positive) that people associate with each word or combinations of words.
Today, Saif Mohammad and Peter Turney at the National Research Council Canada in Ottawa unveil a huge database of words and their associated emotions and polarity, which they have assembled quickly and inexpensively using Amazon’s crowdsourcing Mechanical Turk website. They say this crowdsourcing mechanism makes it possible to increase the size and quality of the database quickly and easily….The result is a comprehensive word-emotion lexicon for over 10,000 words or two-word phrases which they call EmoLex….
The bottom line is that sentiment analysis can only ever be as good as the database on which it relies. With EmoLex, analysts have a new tool for their box of tricks.”
Ref: arxiv.org/abs/1308.6297: Crowdsourcing a Word-Emotion Association Lexicon