The Value of Personal Data


The Digital Enlightenment Yearbook 2013 is dedicated this year to Personal Data:  “The value of personal data has traditionally been understood in ethical terms as a safeguard for personality rights such as human dignity and privacy. However, we have entered an era where personal data are mined, traded and monetized in the process of creating added value – often in terms of free services including efficient search, support for social networking and personalized communications. This volume investigates whether the economic value of personal data can be realized without compromising privacy, fairness and contextual integrity. It brings scholars and scientists from the disciplines of computer science, law and social science together with policymakers, engineers and entrepreneurs with practical experience of implementing personal data management.
The resulting collection will be of interest to anyone concerned about privacy in our digital age, especially those working in the field of personal information management, whether academics, policymakers, or those working in the private sector.”

Using Big Data to Ask Big Questions


Chase Davis in the SOURCE: “First, let’s dispense with the buzzwords. Big Data isn’t what you think it is: Every federal campaign contribution over the last 30-plus years amounts to several tens of millions of records. That’s not Big. Neither is a dataset of 50 million Medicare records. Or even 260 gigabytes of files related to offshore tax havens—at least not when Google counts its data in exabytes. No, the stuff we analyze in pursuit of journalism and app-building is downright tiny by comparison.
But you know what? That’s ok. Because while super-smart Silicon Valley PhDs are busy helping Facebook crunch through petabytes of user data, they’re also throwing off intellectual exhaust that we can benefit from in the journalism and civic data communities. Most notably: the ability to ask Big Questions.
Most of us who analyze public data for fun and profit are familiar with small questions. They’re focused, incisive, and often have the kind of black-and-white, definitive answers that end up in news stories: How much money did Barack Obama raise in 2012? Is the murder rate in my town going up or down?
Big Questions, on the other hand, are speculative, exploratory, and systemic. As the name implies, they are also answered at scale: Rather than distilling a small slice of a dataset into a concrete answer, Big Questions look at entire datasets and reveal small questions you wouldn’t have thought to ask.
Can we track individual campaign donor behavior over decades, and what does that tell us about their influence in politics? Which neighborhoods in my city are experiencing spikes in crime this week, and are police changing patrols accordingly?
Or, by way of example, how often do interest groups propose cookie-cutter bills in state legislatures?

Looking at Legislation

Even if you don’t follow politics, you probably won’t be shocked to learn that lawmakers don’t always write their own bills. In fact, interest groups sometimes write them word-for-word.
Sometimes those groups even try to push their bills in multiple states. The conservative American Legislative Exchange Council has gotten some press, but liberal groups, social and business interests, and even sororities and fraternities have done it too.
On its face, something about elected officials signing their names to cookie-cutter bills runs head-first against people’s ideal of deliberative Democracy—hence, it tends to make news. Those can be great stories, but they’re often limited in scope to a particular bill, politician, or interest group. They’re based on small questions.
Data science lets us expand our scope. Rather than focusing on one bill, or one interest group, or one state, why not ask: How many model bills were introduced in all 50 states, period, by anyone, during the last legislative session? No matter what they’re about. No matter who introduced them. No matter where they were introduced.
Now that’s a Big Question. And with some basic data science, it’s not particularly hard to answer—at least at a superficial level.

Analyze All the Things!

Just for kicks, I tried building a system to answer this question earlier this year. It was intended as an example, so I tried to choose methods that would make intuitive sense. But it also makes liberal use of techniques applied often to Big Data analysis: k-means clustering, matrices, graphs, and the like.
If you want to follow along, the code is here….
To make exploration a little easier, my code represents similar bills in graph space, shown at the top of this article. Each dot (known as a node) represents a bill. And a line connecting two bills (known as an edge) means they were sufficiently similar, according to my criteria (a cosine similarity of 0.75 or above). Thrown into a visualization software like Gephi, it’s easy to click around the clusters and see what pops out. So what do we find?
There are 375 clusters in total. Because of the limitations of our data, many of them represent vague, subject-specific bills that just happen to have similar titles even though the legislation itself is probably very different (think things like “Budget Bill” and “Campaign Finance Reform”). This is where having full bill text would come handy.
But mixed in with those bills are a handful of interesting nuggets. Several bills that appear to be modeled after legislation by the National Conference of Insurance Legislators appear in multiple states, among them: a bill related to limited lines travel insurance; another related to unclaimed insurance benefits; and one related to certificates of insurance.”

Commons at the Intersection of Peer Production, Citizen Science, and Big Data: Galaxy Zoo


New paper by Michael J. Madison: “The knowledge commons research framework is applied to a case of commons governance grounded in research in modern astronomy. The case, Galaxy Zoo, is a leading example of at least three different contemporary phenomena. In the first place Galaxy Zoo is a global citizen science project, in which volunteer non-scientists have been recruited to participate in large-scale data analysis via the Internet. In the second place Galaxy Zoo is a highly successful example of peer production, some times known colloquially as crowdsourcing, by which data are gathered, supplied, and/or analyzed by very large numbers of anonymous and pseudonymous contributors to an enterprise that is centrally coordinated or managed. In the third place Galaxy Zoo is a highly visible example of data-intensive science, sometimes referred to as e-science or Big Data science, by which scientific researchers develop methods to grapple with the massive volumes of digital data now available to them via modern sensing and imaging technologies. This chapter synthesizes these three perspectives on Galaxy Zoo via the knowledge commons framework.”

Defining Open Data


Open Knowledge Foundation Blog: “Open data is data that can be freely used, shared and built-on by anyone, anywhere, for any purpose. This is the summary of the full Open Definition which the Open Knowledge Foundation created in 2005 to provide both a succinct explanation and a detailed definition of open data.
As the open data movement grows, and even more governments and organisations sign up to open data, it becomes ever more important that there is a clear and agreed definition for what “open data” means if we are to realise the full benefits of openness, and avoid the risks of creating incompatibility between projects and splintering the community.

Open can apply to information from any source and about any topic. Anyone can release their data under an open licence for free use by and benefit to the public. Although we may think mostly about government and public sector bodies releasing public information such as budgets or maps, or researchers sharing their results data and publications, any organisation can open information (corporations, universities, NGOs, startups, charities, community groups and individuals).

Read more about different kinds of data in our one page introduction to open data
There is open information in transport, science, products, education, sustainability, maps, legislation, libraries, economics, culture, development, business, design, finance …. So the explanation of what open means applies to all of these information sources and types. Open may also apply both to data – big data and small data – or to content, like images, text and music!
So here we set out clearly what open means, and why this agreed definition is vital for us to collaborate, share and scale as open data and open content grow and reach new communities.

What is Open?

The full Open Definition provides a precise definition of what open data is. There are 2 important elements to openness:

  • Legal openness: you must be allowed to get the data legally, to build on it, and to share it. Legal openness is usually provided by applying an appropriate (open) license which allows for free access to and reuse of the data, or by placing data into the public domain.
  • Technical openness: there should be no technical barriers to using that data. For example, providing data as printouts on paper (or as tables in PDF documents) makes the information extremely difficult to work with. So the Open Definition has various requirements for “technical openness,” such as requiring that data be machine readable and available in bulk.”…

The Science Behind Using Online Communities To Change Behavior


Sean D. Young in TechCrunch: “Although social media and online communities might have been developed for people to connect and share information, recent research shows that these technologies are really helpful in changing behaviors. My colleagues and I in the medical school, for instance, created online communities designed to improve health by getting people to do things, such as test for HIV, stop using methamphetamines, and just de-stress and relax. We don’t handpick people to join because we think they’ll love the technology; that’s not how science works. We invite them because the technology is relevant to them — they’re engaging in drugs, sex and other behaviors that might put themselves and others at risk. It’s our job to create the communities in a way that engages them enough to want to stay and participate. Yes, we do offer to pay them $30 to complete an hour-long survey, but then they are free to collect their money and never talk to us again. But for some reason, they stay in the group and decide to be actively engaged with strangers.
So how do we create online communities that keep people engaged and change their behaviors? Our starting point is to understand and address their psychological needs….
Throughout our research, we find that newly created online communities can change people’s behaviors by addressing the following psychological needs:
The Need to Trust. Sharing our thoughts, experiences, and difficulties with others makes us feel closer to others and increases our trust. When we trust people, we’re more open-minded, more willing to learn, and more willing to change our behavior. In our studies, we found that sharing personal information (even something as small as describing what you did today) can help increase trust and change behavior.
The Need to Fit In. Most of us inherently strive to fit in. Social norms, or other people’s attitudes and behaviors, heavily influence our own attitudes and behaviors. Each time a new online community or group forms, it creates its own set of social norms and expectations for how people should behave. Most people are willing to change their attitudes and/or behavior to fit these group norms and fit in with the community.
The Need for Self-Worth. When people feel good about themselves, they are more open to change and feel empowered to be able to change their behavior. When an online community is designed to have people support and care for each other, they can help to increase self-esteem.
The Need to Be Rewarded for Good Behavior. Anyone who has trained a puppy knows that you can get him to keep sitting as long as you keep the treats flowing to reward him, but if you want to wean him off the treats and really train him then you’ll need to begin spacing out the treats to make them less predictable. Well, people aren’t that different from animals in that way and can be trained with reinforcements too. For example, “liking” people’s communications when they immediately join a network, and then progressively spacing out the time that their posts are liked (psychologists call this variable reinforcement) can be incorporated onto social network platforms to encourage them to keep posting content. Eventually, these behaviors become habits.
The Need to Feel Empowered. While increasing self-esteem makes people feel good about themselves, increasing empowerment helps them know they have the ability to change. Creating a sense of empowerment is one of the most powerful predictors of whether people will change their behavior. Belonging to a network of people who are changing their own behaviors, support our needs, and are confident in our changing our behavior empowers us and gives us the ability to change our behavior.”

Imagining Data Without Division


Thomas Lin in Quanta Magazine: “As science dives into an ocean of data, the demands of large-scale interdisciplinary collaborations are growing increasingly acute…Seven years ago, when David Schimel was asked to design an ambitious data project called the National Ecological Observatory Network, it was little more than a National Science Foundation grant. There was no formal organization, no employees, no detailed science plan. Emboldened by advances in remote sensing, data storage and computing power, NEON sought answers to the biggest question in ecology: How do global climate change, land use and biodiversity influence natural and managed ecosystems and the biosphere as a whole?…
For projects like NEON, interpreting the data is a complicated business. Early on, the team realized that its data, while mid-size compared with the largest physics and biology projects, would be big in complexity. “NEON’s contribution to big data is not in its volume,” said Steve Berukoff, the project’s assistant director for data products. “It’s in the heterogeneity and spatial and temporal distribution of data.”
Unlike the roughly 20 critical measurements in climate science or the vast but relatively structured data in particle physics, NEON will have more than 500 quantities to keep track of, from temperature, soil and water measurements to insect, bird, mammal and microbial samples to remote sensing and aerial imaging. Much of the data is highly unstructured and difficult to parse — for example, taxonomic names and behavioral observations, which are sometimes subject to debate and revision.
And, as daunting as the looming data crush appears from a technical perspective, some of the greatest challenges are wholly nontechnical. Many researchers say the big science projects and analytical tools of the future can succeed only with the right mix of science, statistics, computer science, pure mathematics and deft leadership. In the big data age of distributed computing — in which enormously complex tasks are divided across a network of computers — the question remains: How should distributed science be conducted across a network of researchers?
Part of the adjustment involves embracing “open science” practices, including open-source platforms and data analysis tools, data sharing and open access to scientific publications, said Chris Mattmann, 32, who helped develop a precursor to Hadoop, a popular open-source data analysis framework that is used by tech giants like Yahoo, Amazon and Apple and that NEON is exploring. Without developing shared tools to analyze big, messy data sets, Mattmann said, each new project or lab will squander precious time and resources reinventing the same tools. Likewise, sharing data and published results will obviate redundant research.
To this end, international representatives from the newly formed Research Data Alliance met this month in Washington to map out their plans for a global open data infrastructure.”

Using Participatory Crowdsourcing in South Africa to Create a Safer Living Environment


New Paper by Bhaveer Bhana, Stephen Flowerday, and Aharon Satt in the International Journal of Distributed Sensor Networks: “The increase in urbanisation is making the management of city resources a difficult task. Data collected through observations (utilising humans as sensors) of the city surroundings can be used to improve decision making in terms of managing these resources. However, the data collected must be of a certain quality in order to ensure that effective and efficient decisions are made. This study is focused on the improvement of emergency and non-emergency services (city resources) through the use of participatory crowdsourcing (humans as sensors) as a data collection method (collect public safety data), utilising voice technology in the form of an interactive voice response (IVR) system.
The study illustrates how participatory crowdsourcing (specifically humans as sensors) can be used as a Smart City initiative focusing on public safety by illustrating what is required to contribute to the Smart City, and developing a roadmap in the form of a model to assist decision making when selecting an optimal crowdsourcing initiative. Public safety data quality criteria were developed to assess and identify the problems affecting data quality.
This study is guided by design science methodology and applies three driving theories: the Data Information Knowledge Action Result (DIKAR) model, the characteristics of a Smart City, and a credible Data Quality Framework. Four critical success factors were developed to ensure high quality public safety data is collected through participatory crowdsourcing utilising voice technologies.”

Mobile phone data are a treasure-trove for development


Paul van der Boor and Amy Wesolowski in SciDevNet: “Each of us generates streams of digital information — a digital ‘exhaust trail’ that provides real-time information to guide decisions that affect our lives. For example, Google informs us about traffic by using both its ‘My Location’ feature on mobile phones and third-party databases to aggregate location data. BBVA, one of Spain’s largest banks, analyses transactions such as credit card payments as well as ATM withdrawals to find out when and where peak spending occurs.This type of data harvest is of great value. But, often, there is so much data that its owners lack the know-how to process it and fail to realise its potential value to policymakers.
Meanwhile, many countries, particularly in the developing world, have a dearth of information. In resource-poor nations, the public sector often lives in an analogue world where piles of paper impede operations and policymakers are hindered by uncertainty about their own strengths and capabilities.Nonetheless, mobile phones have quickly pervaded the lives of even the poorest: 75 per cent of the world’s 5.5 billion mobile subscriptions are in emerging markets. These people are also generating digital trails of anything from their movements to mobile phone top-up patterns. It may seem that putting this information to use would take vast analytical capacity. But using relatively simple methods, researchers can analyse existing mobile phone data, especially in poor countries, to improve decision-making.
Think of existing, available data as low-hanging fruit that we — two graduate students — could analyse in less than a month. This is not a test of data-scientist prowess, but more a way of saying that anyone could do it.
There are three areas that should be ‘low-hanging fruit’ in terms of their potential to dramatically improve decision-making in information-poor countries: coupling healthcare data with mobile phone data to predict disease outbreaks; using mobile phone money transactions and top-up data to assess economic growth; and predicting travel patterns after a natural disaster using historical movement patterns from mobile phone data to design robust response programmes.
Another possibility is using call-data records to analyse urban movement to identify traffic congestion points. Nationally, this can be used to prioritise infrastructure projects such as road expansion and bridge building.
The information that these analyses could provide would be lifesaving — not just informative or revenue-increasing, like much of this work currently performed in developed countries.
But some work of high social value is being done. For example, different teams of European and US researchers are trying to estimate the links between mobile phone use and regional economic development. They are using various techniques, such as merging night-time satellite imagery from NASA with mobile phone data to create behavioural fingerprints. They have found that this may be a cost-effective way to understand a country’s economic activity and, potentially, guide government spending.
Another example is given by researchers (including one of this article’s authors) who have analysed call-data records from subscribers in Kenya to understand malaria transmission within the country and design better strategies for its elimination. [1]
In this study, published in Science, the location data of the mobile phones of more than 14 million Kenyan subscribers was combined with national malaria prevalence data. After identifying the sources and sinks of malaria parasites and overlaying these with phone movements, analysis was used to identify likely transmission corridors. UK scientists later used similar methods to create different epidemic scenarios for the Côte d’Ivoire.”

Prizes and Productivity: How Winning the Fields Medal Affects Scientific Output


New NBER working paper by George J. Borjas and Kirk B. Doran: “Knowledge generation is key to economic growth, and scientific prizes are designed to encourage it. But how does winning a prestigious prize affect future output? We compare the productivity of Fields medalists (winners of the top mathematics prize) to that of similarly brilliant contenders. The two groups have similar publication rates until the award year, after which the winners’ productivity declines. The medalists begin to “play the field,” studying unfamiliar topics at the expense of writing papers. It appears that tournaments can have large post-prize effects on the effort allocation of knowledge producers.”

The Contours of Crowd Capability


New paper by Prashant Shukla and John Prpi: “The existence of dispersed knowledge has been a subject of inquiry for more than six decades. Despite the longevity of this rich research tradition, the “knowledge problem” has remained largely unresolved both in research and practice, and remains “the central theoretical problem of all social science”. However, in the 21st century, organizations are presented with opportunities through technology to potentially benefit from the dispersed knowledge problem to some extent. One such opportunity is represented by the recent emergence of a variety of crowd-engaging information systems (IS).
In this vein, Crowdsourcing  is being widely studied in numerous contexts, and the knowledge generated from these IS phenomena is well-documented. At the same time, other organizations are leveraging dispersed knowledge by putting in place IS-applications such as Predication Markets to gather large sample-size forecasts from within and without the organization. Similarly, we are also observing many organizations using IS-tools such as “Wikis” to access the knowledge of dispersed populations within the boundaries of the organization. Further still, other organizations are applying gamification techniques to accumulate Citizen Science knowledge from the public at large through IS.
Among these seemingly disparate phenomena, a complex ecology of crowd- engaging IS has emerged, involving millions of people all around the world generating knowledge for organizations through IS. However, despite the obvious scale and reach of this emerging crowd-engagement paradigm, there are no examples of research (as far as we know), that systematically compares and contrasts a large variety of these existing crowd-engaging IS-tools in one work. Understanding this current state of affairs, we seek to address this significant research void by comparing and contrasting a number of the crowd-engaging forms of IS currently available for organizational use.

To achieve this goal, we employ the Theory of Crowd Capital as a lens to systematically structure our investigation of crowd-engaging IS. Employing this parsimonious lens, we first explain how Crowd Capital is generated through Crowd Capability in organizations. Taking this conceptual platform as a point of departure, in Section 3, we offer an array of examples of IS currently in use in modern practice to generate Crowd Capital. We compare and contrast these emerging IS techniques using the Crowd Capability construct, therein highlighting some important choices that organizations face when entering the crowd- engagement fray. This comparison, which we term “The Contours of Crowd Capability”, can be used by decision-makers and researchers alike, to differentiate among the many extant methods of Crowd Capital generation. At the same time, our comparison also illustrates some important differences to be found in the internal organizational processes that accompany each form of crowd-engaging IS. In section 4, we conclude with a discussion of the limitations of our work.”