big data

Five myths about big data

Curated on August 17, 2013August 3, 2018 by Stefaan Verhulst

Samuel Arbesman, senior scholar at the Ewing Marion Kauffman Foundation and the author of “The Half-Life of Facts” in the Washington Post: “Big data holds the promise of harnessing huge amounts of information to help us better understand the world. But when talking about big data, there’s a tendency to fall into hyperbole. It is what compels contrarians to write such tweets as “Big Data, n.: the belief that any sufficiently large pile of s— contains a pony.” Let’s deflate the hype.
1. “Big data” has a clear definition.
The term “big data” has been in circulation since at least the 1990s, when it is believed to have originated in Silicon Valley. IBM offers a seemingly simple definition: Big data is characterized by the four V’s of volume, variety, velocity and veracity. But the term is thrown around so often, in so many contexts — science, marketing, politics, sports — that its meaning has become vague and ambiguous….
2. Big data is new.
By many accounts, big data exploded onto the scene quite recently. “If wonks were fashionistas, big data would be this season’s hot new color,” a Reuters report quipped last year. In a May 2011 report, the McKinsey Global Institute declared big data “the next frontier for innovation, competition, and productivity.”
It’s true that today we can mine massive amounts of data — textual, social, scientific and otherwise — using complex algorithms and computer power. But big data has been around for a long time. It’s just that exhaustive datasets were more exhausting to compile and study in the days when “computer” meant a person who performed calculations….
3. Big data is revolutionary.
In their new book, “Big Data: A Revolution That Will Transform How We Live, Work, and Think,”Viktor Mayer-Schonberger and Kenneth Cukier compare “the current data deluge” to the transformation brought about by the Gutenberg printing press.
If you want more precise advertising directed toward you, then yes, big data is revolutionary. Generally, though, it’s likely to have a modest and gradual impact on our lives….
4. Bigger data is better.
In science, some admittedly mind-blowing big-data analyses are being done. In business, companies are being told to “embrace big data before your competitors do.” But big data is not automatically better.
Really big datasets can be a mess. Unless researchers and analysts can reduce the number of variables and make the data more manageable, they get quantity without a whole lot of quality. Give me some quality medium data over bad big data any day…
5. Big data means the end of scientific theories.
Chris Anderson argued in a 2008 Wired essay that big data renders the scientific method obsolete: Throw enough data at an advanced machine-learning technique, and all the correlations and relationships will simply jump out. We’ll understand everything.
But you can’t just go fishing for correlations and hope they will explain the world. If you’re not careful, you’ll end up with spurious correlations. Even more important, to contend with the “why” of things, we still need ideas, hypotheses and theories. If you don’t have good questions, your results can be silly and meaningless.
Having more data won’t substitute for thinking hard, recognizing anomalies and exploring deep truths.”

Big data, crowdsourcing and machine learning tackle Parkinson’s

Curated on August 2, 2013August 3, 2018 by Stefaan Verhulst

Successful Workingplace: “Parkinson’s is a very tough disease to fight. People suffering from the disease often have significant tremors that keep them from being able to create accurate records of their daily challenges. Without this information, doctors are unable to fine tune drug dosages and other treatment regimens that can significantly improve the lives of sufferers.
It was a perfect catch-22 situation until recently, when the Michael J. Fox Foundation announced that LIONsolver, a company specializing in machine learning software, was able to differentiate Parkinson’s patients from healthy individuals and to also show the trend in symptoms of the disease over time.…
To set up the competition, the Foundation worked with Kaggle, an organization that specializes in crowdsourced big data analysis competitions. The use of crowdsourcing as a way to get to the heart of very difficult Big Data problems works by allowing people the world over from a myriad of backgrounds and with diverse experiences to devote time on personally chosen challenges where they can bring the most value. It’s a genius idea for bringing some of the scarcest resources together with the most intractable problems.”

Orwell is drowning in data: the volume problem

Curated on July 31, 2013August 15, 2018 by Stefaan Verhulst

Dom Shaw in OpenDemocracy: “During World War II, whilst Bletchley Park laboured in the front line of code breaking, the British Government was employing vast numbers of female operatives to monitor and report on telephone, mail and telegraph communications in and out of the country.
The biggest problem, of course, was volume. Without even the most primitive algorithm to detect key phrases that later were to cause such paranoia amongst the sixties and seventies counterculture, causing a whole generation of drug users to use a wholly unnecessary set of telephone synonyms for their desired substance, the army of women stationed in exchanges around the country was driven to report everything and then pass it on up to those whose job it was to analyse such content for significance.
Orwell’s vision of Big Brother’s omniscience was based upon the same model – vast armies of Winston Smiths monitoring data to ensure discipline and control. He saw a culture of betrayal where every citizen was held accountable for their fellow citizens’ political and moral conformity.
Up until the US Government’s Big Data Research and Development Initiative [12] and the NSA development of the Prism programme [13], the fault lines always lay in the technology used to collate or collect and the inefficiency or competing interests of the corporate systems and processes that interpreted the information. Not for the first time, the bureaucracy was the citizen’s best bulwark against intrusion.
Now that the algorithms have become more complex and the technology tilted towards passive surveillance through automation, the volume problem becomes less of an obstacle….
The technology for obtaining this information, and indeed the administration of it, is handled by corporations. The Government, driven by the creed that suggests private companies are better administrators than civil servants, has auctioned off the job to a dozen or more favoured corporate giants who are, as always, beholden not only to their shareholders, but to their patrons within the government itself….
The only problem the state had was managing the scale of the information gleaned from so many people in so many forms. Not any more. The volume problem has been overcome.”

Big data + politics = open data: The case of health care data in England

Curated on July 27, 2013August 3, 2018 by Stefaan Verhulst

New Paper in Policy & Internet: “There is a great deal of enthusiasm about the prospects for Big Data held in health care systems around the world. Health care appears to offer the ideal combination of circumstances for its exploitation, with a need to improve productivity on the one hand and the availability of data that can be used to identify opportunities for improvement on the other. The enthusiasm rests on two assumptions. First, that the data sets held by hospitals and other organizations, and the technological infrastructure needed for their acquisition, storage, and manipulation, are up to the task. Second, that organizations outside health care systems will be able to access detailed datasets. We argue that both assumptions can be challenged. The article uses the example of the National Health Service in England to identify data, technology, and information governance challenges. The public acceptability of third party access to detailed health care datasets is, at best, unclear.”

5 Big Data Projects That Could Impact Your Life

Curated on July 25, 2013August 3, 2018 by Stefaan Verhulst

Mashable: “We reached out to a few organizations using information, both hand- and algorithm-collected, to create helpful tools for their communities. This is only a small sample of what’s out there — plenty more pop up each day, and as more information becomes public, the trend will only grow….
1. Transit Time NYC
Transit Time NYC, an interactive map developed by WNYC, lets New Yorkers click a spot in any of the city’s five boroughs for an estimate of subway or train travel times. To create it, WNYC lead developer Steve Melendez broke the city into 2,930 hexagons, then pulled data from open source itinerary platform OpenTripPlanner — the Wikipedia of mapping software — and coupled it with the MTA’s publicly downloadable subway schedule….
2. Twitter’s ‘Topography of Tweets
In a blog post, Twitter unveiled a new data visualization map that displays billions of geotagged tweets in a 3D landscape format. The purpose is to display, topographically, which parts of certain cities most people are tweeting from…
3. Homicide Watch D.C.
Homicide Watch D.C. is a community-driven data site that aims to cover every murder in the District of Columbia. It’s sorted by “suspect” and “victim” profiles, where it breaks down each person’s name, age, gender and race, as well as original articles reported by Homicide Watch staff…
4. Falling Fruit
Can you find a hidden apple tree along your daily bike commute? Falling Fruit can.
The website highlights overlooked or hidden edibles in urban areas across the world. By collecting public information from the U.S. Department of Agriculture, municipal tree inventories, foraging maps and street tree databases, the site has created a network of 615 types of edibles in more than 570,000 locations. The purpose is to remind urban dwellers that agriculture does exist within city boundaries — it’s just more difficult to find….
5. AIDSvu
AIDSVu is an interactive map that illustrates the prevalence of HIV in the United States. The data is pulled from the U.S. Center for Disease Control’s national HIV surveillance reports, which are collected at both state and county levels each year…”

Metadata Liberation Movement

Curated on July 24, 2013October 9, 2018 by Stefaan Verhulst

Holman Jenkins in the Wall Street Journal: “The biggest problem, then, with metadata surveillance may simply be that the wrong agencies are in charge of it. One particular reason why this matters is that the potential of metadata surveillance might actually be quite large but is being squandered by secret agencies whose narrow interest is only looking for terrorists….
“Big data” is only as good as the algorithms used to find out things worth finding out. The efficacy and refinement of big-data techniques are advanced by repetition, by giving more chances to find something worth knowing. Bringing metadata out of its black box wouldn’t only be a way to improve public trust in what government is doing. It would be a way to get more real value for society out of techniques that are being squandered on a fairly minor threat.
Bringing metadata out of the black box would open up new worlds of possibility—from anticipating traffic jams to locating missing persons after a disaster. It would also create an opportunity to make big data more consistent with the constitutional prohibition of unwarranted search and seizure. In the first instance, with the computer withholding identifying details of the individuals involved, any red flag could be examined by a law-enforcement officer to see, based on accumulated experience, whether the indication is of interest.
If so, a warrant could be obtained to expose the identities involved. If not, the record could immediately be expunged. All this could take place in a reasonably aboveboard, legal fashion, open to inspection in court when and if charges are brought or—this would be a good idea—a court is informed of investigations that led to no action.
Our guess is that big data techniques would pop up way too many false positives at first, and only considerable learning and practice would allow such techniques to become a useful tool. At the same time, bringing metadata surveillance out of the shadows would help the Googles, Verizons and Facebooks defend themselves from a wholly unwarranted suspicion that user privacy is somehow better protected by French or British or (heavens) Chinese companies from their own governments than U.S. data is from the U.S. government.
Most of all, it would allow these techniques to be put to work on solving problems that are actual problems for most Americans, which terrorism isn’t.”

The Little Data Book on Information and Communication Technology 2013

Curated on July 19, 2013November 7, 2018 by Stefaan Verhulst

The World Bank: “This new addition to the Little Data Book series presents at-a-glance tables for more than 200 economies showing the most recent national data on key indicators of information and communications technology (ICT), including access, quality, affordability, efficiency, sustainability and applications.”

Metrics for Government Reform

Curated on July 17, 2013October 10, 2018 by Stefaan Verhulst

Geoff Mulgan: “How do you measure a programme of government reform? What counts as evidence that it’s working or not? I’ve been asked this question many times, so this very brief note suggests some simple answers – mainly prompted by seeing a few writings on this question which I thought confused some basic points.”
Any type of reform programme will combine elements at very different levels. These may include:

A new device – for example, adjusting the wording in an official letter or a call centre script to see what impact this has on such things as tax compliance.
A new kind of action – for example a new way of teaching maths in schools, treating patients with diabetes, handling prison leavers.
A new kind of policy – for example opening up planning processes to more local input; making welfare payments more conditional.
A new strategy – for example a scheme to cut carbon in cities, combining retrofitting of housing with promoting bicycle use; or a strategy for public health.
A new approach to strategy – for example making more use of foresight, scenarios or big data.
A new approach to governance – for example bringing hitherto excluded groups into political debate and decision-making.

This rough list hopefully shows just how different these levels are in their nature. Generally as we go down the list the following things rise: