Eight (No, Nine!) Problems With Big Data


Gary Marcus and Ernest Davis in the New York Times: “BIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether we’re talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers.

Or so its champions allege. “In the next two decades,” the journalist Patrick Tucker writes in the latest big data manifesto, “The Naked Future,” “we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference.” Statistical correlations have never sounded so good.

Is big data really all it’s cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Google’s search engine to the I.B.M. “Jeopardy!” champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can — and can’t — do.

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.

Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.

As a recent article in the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Fung has noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes — sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages — for example, the same Wikipedia entry in two different languages — to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their “historical importance” or “cultural contributions,” based on data drawn from Wikipedia. One is the book “Who’s Bigger? Where Historical Figures Really Rank,” by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.

Both efforts get many things right — Jesus, Lincoln and Shakespeare were surely important people — but both also make some egregious errors. “Who’s Bigger?” claims that Francis Scott Key was the 19th most important poet in history; Pantheon has claimed that Nostradamus was the 20th most important writer in history, well ahead of Jane Austen (78th) and George Eliot (380th). Worse, both projects suggest a misleading degree of scientific precision with evaluations that are inherently vague, or even meaningless. Big data can reduce anything to a single number, but you shouldn’t be fooled by the appearance of exactitude.

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as “dumbed-down escapist fare” that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate “dumbed-down escapist fare” into German and then back into English: out comes the incoherent “scaled-flight fare.” That is a long way from what Mr. Lowe intended — and from big data’s aspirations for translation.

Wait, we almost forgot one last problem: the hype….

Brainlike Computers, Learning From Experience


The New York Times: “Computers have entered the age when they are able to learn from their own mistakes, a development that is about to turn the digital world on its head.

The first commercial version of the new kind of computer chip is scheduled to be released in 2014. Not only can it automate tasks that now require painstaking programming — for example, moving a robot’s arm smoothly and efficiently — but it can also sidestep and even tolerate errors, potentially making the term “computer crash” obsolete.

The new computing approach, already in use by some large technology companies, is based on the biological nervous system, specifically on how neurons react to stimuli and connect with other neurons to interpret information. It allows computers to absorb new information while carrying out a task, and adjust what they do based on the changing signals.

In coming years, the approach will make possible a new generation of artificial intelligence systems that will perform some functions that humans do with ease: see, speak, listen, navigate, manipulate and control. That can hold enormous consequences for tasks like facial and speech recognition, navigation and planning, which are still in elementary stages and rely heavily on human programming.

Designers say the computing style can clear the way for robots that can safely walk and drive in the physical world, though a thinking or conscious computer, a staple of science fiction, is still far off on the digital horizon.

“We’re moving from engineering computing systems to something that has many of the characteristics of biological computing,” said Larry Smarr, an astrophysicist who directs the California Institute for Telecommunications and Information Technology, one of many research centers devoted to developing these new kinds of computer circuits.

Conventional computers are limited by what they have been programmed to do. Computer vision systems, for example, only “recognize” objects that can be identified by the statistics-oriented algorithms programmed into them. An algorithm is like a recipe, a set of step-by-step instructions to perform a calculation.

But last year, Google researchers were able to get a machine-learning algorithm, known as a neural network, to perform an identification task without supervision. The network scanned a database of 10 million images, and in doing so trained itself to recognize cats.

In June, the company said it had used those neural network techniques to develop a new search service to help customers find specific photos more accurately.

The new approach, used in both hardware and software, is being driven by the explosion of scientific knowledge about the brain. Kwabena Boahen, a computer scientist who leads Stanford’s Brains in Silicon research program, said that is also its limitation, as scientists are far from fully understanding how brains function.”

Ten thoughts for the future


The Economist: “CASSANDRA has decided to revisit her fellow forecasters Thomas Malnight and Tracey Keys to find out what their predictions are for 2014. Once again they have produced a collection of trends for the year ahead, in their “Global Trends Report”.
The possibilities of mind control seem alarming ( point 6) as do the  implications of growing income inequality (point 10). Cassandra also hopes that “unemployability” and “unemployerability”, as discussed in point 9, are contested next year (on both linguistic and social fronts).
Nevertheless, the forecasts make for intriguing reading and highlights appear below.
 1. From social everything to being smart socially
Social technologies are everywhere, but these vast repositories of digital “stuff” bury the exceptional among the unimportant. It’s time to get socially smart. Users are moving to niche networks to bring back the community feel and intelligence to social interactions. Businesses need to get smarter about extracting and delivering value from big data including challenging business models. For social networks, mobile is the great leveller. Competition for attention with other apps will intensify the battle to own key assets from identity to news sharing, demanding radical reinvention.
2. Information security: The genie is out of the bottle
Thought your information was safe? Think again. The information security genie is out of the bottle as cyber-surveillance and data mining by public and private organizations increases – and don’t forget criminal networks and whistleblowers. It will be increasingly hard to tell friend from foe in cyberspace as networks build artificial intelligence to decipher your emotions and smart cities track your every move. Big brother is here: Protecting identity, information and societies will be a priority for all.
3. Who needs shops anyway?
Retailers are facing a digitally driven perfect storm. Connectivity, rising consumer influence, time scarcity, mobile payments, and the internet of things, are changing where, when and how we shop – if smart machines have not already done the job. Add the sharing economy, driven by younger generations where experience and sustainable consumption are more important than ownership, and traditional retail models break down. The future of shops will be increasingly defined by experiential spaces offering personalized service, integrated online and offline value propositions, and pop-up stores to satisfy demands for immediacy and surprise.
4. Redistributing the industrial revolution
Complex, global value chains are being redistributed by new technologies, labour market shifts and connectivity. Small-scale manufacturing, including 3D and soon 4D printing, and shifting production economics are moving production closer to markets and enabling mass customization – not just by companies but by the tech-enabled maker movement which is going mainstream. Rising labour costs in developing markets, high unemployment in developed markets, global access to online talent and knowledge, plus advances in robotics mean reshoring of production to developed markets will increase. Mobility, flexibility and networks will define the future industrial landscape.
5. Hubonomics: The new face of globalization
As production and consumption become more distributed, hubs will characterize the next wave of “globalization.” They will specialize to support the needs of growing regional trade, emerging city states, on-line communities of choice, and the next generation of flexible workers and entrepreneurs. Underpinning these hubs will be global knowledge networks and new business and governance models based on hubonomics™, that leverage global assets and hub strengths to deliver local value.
6. Sci-Fi is here: Making the impossible possible
Cross-disciplinary approaches and visionary entrepreneurs are driving scientific breakthroughs that could change not just our lives and work but our bodies and intelligence. Labs worldwide are opening up the vast possibilities of mind control and artificial intelligence, shape-shifting materials and self-organizing nanobots, cyborgs and enhanced humans, space exploration, and high-speed, intelligent transportation. Expect great debate around the ethics, financing, and distribution of public and private benefits of these advances – and the challenge of translating breakthroughs into replicable benefits.
7. Growing pains: Transforming markets and generations
The BRICS are succumbing to Newton’s law of gravitation: Brazil’s lost it, India’s losing it, China’s paying the price for growth, Russia’s failing to make a superpower come-back, and South Africa’s economy is in disarray. In other developing markets currencies have tumbled, Arab Spring governments are still in turmoil and social unrest is increasing along with the number of failing states. But the BRICS & Beyond growth engine is far from dead. Rather it is experiencing growing pains which demand significant shifts in governance, financial systems, education and economic policies to catch up. The likely transformers will be younger generations who aspire to greater freedom and quality of life than their parents.
8. Panic versus denial: The resource gap grows, the global risks rise – but who is listening?
The complex nexus of food, water, energy and climate change presents huge global economic, environmental and societal challenges – heating up the battle to access new resources from the Arctic to fracking. Risks are growing, even as multilateral action stalls. It’s a crisis of morals, governance, and above all marketing and media, pitting crisis deniers against those who recognize the threats but are communicating panic versus reasoned solutions. Expect more debate and calls for responsible capitalism – those that are listening will be taking action at multiple levels in society and business.
9. Fighting unemployability and unemployerability
Companies are desperate for talented workers – yet unemployment rates remain high. Polarization towards higher and lower skill levels is squeezing mid-level jobs, even as employers complain that education systems are not preparing students for the jobs of the future. Fighting unemployability is driving new government-business partnerships worldwide, and will remain a critical issue given massive youth unemployment. Employers must also focus on organizational unemployerability – not being able to attract and retain desired talent – as new generations demand exciting and meaningful work where they can make an impact. If they can’t find it, they will quickly move on or swell the growing ranks of young entrepreneurs.
10. Surviving in a bipolar world: From expecting consistency to embracing ambiguity
Life is not fair, nor is it predictable.  Income inequality is growing. Intolerance and nationalism are rising but interdependence is the currency of a connected world. Pressure on leaders to deliver results today is intense but so too is the need for fundamental change to succeed in the long term. The contradictions of leadership and life are increasing faster than our ability to reconcile the often polarized perspectives and values each embodies. Increasingly, they are driving irrational acts of leadership (think the US debt ceiling), geopolitical, social and religious tensions, and individual acts of violence. Surviving in this world will demand stronger, responsible leadership comfortable with and capable of embracing ambiguity and uncertainty, as opposed to expecting consistency and predictability.”

Global Collective Intelligence in Technological Societies


Paper by Juan Carlos Piedra Calderón and Javier Rainer in the International Journal of Artificial Intelligence and Interactive Multimedia: “The big influence of Information and Communication Technologies (ICT), especially in area of construction of Technological Societies has generated big
social changes. That is visible in the way of relating to people in different environments. These changes have the possibility to expand the frontiers of knowledge through sharing and cooperation. That has meaning the inherently creation of a new form of Collaborative Knowledge. The potential of this Collaborative Knowledge has been given through ICT in combination with Artificial Intelligence processes, from where is obtained a Collective Knowledge. When this kind of knowledge is shared, it gives the place to the Global Collective Intelligence”.

Seven Principles for Big Data and Resilience Projects


PopTech & Rockefeler Bellagio Fellows: “The following is a draft “Code of Conduct” that seeks to provide guidance on best practices for resilience building projects that leverage Big Data and Advanced Computing. These seven core principles serve to guide data projects to ensure they are socially just, encourage local wealth- & skill-creation, require informed consent, and be maintainable over long timeframes. This document is a work in progress, so we very much welcome feedback. Our aim is not to enforce these principles on others but rather to hold ourselves accountable and in the process encourage others to do the same. Initial versions of this draft were written during the 2013 PopTech & Rockefeller Foundation workshop in Bellagio, August 2013.
Open Source Data Tools – Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.). Open source, hackable tools are generative, and building generative capacity is an important element of resilience….
Transparent Data Infrastructure – Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure. Data infrastructure should strive for built-in documentation, be extensive and provide easy access. Data is only as useful to the data scientist as her/his understanding of its collection is correct…
Develop and Maintain Local Skills – Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills. The key and most constraint ingredient to effective data solutions remains human skill/knowledge and needs to be retained locally. In doing so, consider cultural issues and language. Catalyze the next generation of data scientists and generate new required skills in the cities where the data is being collected…
Local Data Ownership – Use Creative Commons and licenses that state that data is not to be used for commercial purposes. The community directly owns the data it generates, along with the learning algorithms (machine learning classifiers) and derivatives. Strong data protection protocols need to be in place to protect identities and personally identifying information…
Ethical Data Sharing – Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option. Projects should always explicitly state which third parties will get access to data, if any, so that it is clear who will be able to access and use the data…
Right Not To Be Sensed – Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate. All too often, sensing projects are established without any ethical framework or any commitment to informed consent. It is essential that the collection of any sensitive data, from social and mobile data to video and photographic records of houses, streets and individuals, is done with full public knowledge, community discussion, and the ability to opt out…
Learning from Mistakes – Big Data and Resilience projects need to be open to face, report, and discuss failures. Big Data technology is still very much in a learning phase. Failure and the learning and insights resulting from it should be accepted and appreciated. Without admitting what does not work we are not learning effectively as a community. Quality control and assessment for data-driven solutions is notably harder than comparable efforts in other technology fields. The uncertainty about quality of the solution is created by the uncertainty inherent in data…”

If big data is an atomic bomb, disarmament begins in Silicon Valley


at GigaOM: “Big data is like atomic energy, according to scientist Albert-László Barabási in a Monday column on Politico. It’s very beneficial when used ethically, and downright destructive when turned into a weapon. He argues scientists can help resolve the damage done by government spying by embracing the principles of nuclear nonproliferation that helped bring an end to Cold War fears and distrust.
Barabási’s analogy is rather poetic:

“Powered by the right type of Big Data, data mining is a weapon. It can be just as harmful, with long-term toxicity, as an atomic bomb. It poisons trust, straining everything from human relations to political alliances and free trade. It may target combatants, but it cannot succeed without sifting through billions of data points scraped from innocent civilians. And when it is a weapon, it should be treated like a weapon.”

I think he’s right, but I think the fight to disarm the big data bomb begins in places like Silicon Valley and Madison Avenue. And it’s not just scientists; all citizens should have a role…
I write about big data and data mining for a living, and I think the underlying technologies and techniques are incredibly valuable, even if the applications aren’t always ideal. On the one hand, advances in machine learning from companies such as Google and Microsoft are fantastic. On the other hand, Facebook’s newly expanded Graph Search makes Europe’s proposed right-to-be-forgotten laws seem a lot more sensible.
But it’s all within the bounds of our user agreements and beauty is in the eye of the beholder.
Perhaps the reason we don’t vote with our feet by moving to web platforms that embrace privacy, even though we suspect it’s being violated, is that we really don’t know what privacy means. Instead of regulating what companies can and can’t do, perhaps lawmakers can mandate a degree of transparency that actually lets users understand how data is being used, not just what data is being collected. Great, some company knows my age, race, ZIP code and web history: What I really need to know is how it’s using that information to target, discriminate against or otherwise serve me.
An intelligent national discussion about the role of the NSA is probably in order. For all anyone knows,  it could even turn out we’re willing to put up with more snooping than the goverment might expect. But until we get a handle on privacy from the companies we choose to do business with, I don’t think most Americans have the stomach for such a difficult fight.”

Undefined By Data: A Survey of Big Data Definitions


Paper by Jonathan Stuart Ward and Adam Barker: “The term big data has become ubiquitous. Owing to shared origin between academia, industry and the media there is no single unified definition, and various stakeholders provide diverse and often contradictory definitions. The lack of a consistent definition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various definitions which have gained some degree of traction and to furnish a clear and concise definition of an otherwise ambiguous term…
Despite the range and differences existing within each of the aforementioned definitions there are some points of similarity. Notably all definitions make at least one of the following assertions:
Size: the volume of the datasets is a critical factor.
Complexity: the structure, behaviour and permutations of the datasets is a critical factor.
Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.
The definitions surveyed here all encompass at least one of these factors, most encompass two. An extrapolation of these factors would therefore postulate the following: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”

Cyberpsychology and New Media


A thematic reader, edited by Andrew Power, Grainne Kirwan:Cyberpsychology is the study of human interactions with the internet, mobile computing and telephony, games consoles, virtual reality, artificial intelligence, and other contemporary electronic technologies. The field has grown substantially over the past few years and this book surveys how researchers are tackling the impact of new technology on human behaviour and how people interact with this technology.

Examining topics as diverse as online dating, social networking, online communications, artificial intelligence, health-information seeking behaviour, education online, online therapies and cybercrime, Cyberpsychology and New Media book provides an in-depth overview of this burgeoning field, and allows those with little previous knowledge to gain an appreciation of the diversity of the research being undertaken in the area.”

Three ways to think of the future…


Geoff Mulgan’s blog: “Here I suggest three complementary ways of thinking about the future which provide partial protection against the pitfalls.
The shape of the future
First, create your own composite future by engaging with the trends. There are many methods available for mapping the future – from Foresight to scenarios to the Delphi method.
Behind all are implicit views about the shapes of change. Indeed any quantitative exploration of the future uses a common language of patterns (shown in this table above) which summarises the fact that some things will go up, some go down, some change suddenly and some not at all.
All of us have implicit or explicit assumptions about these. But it’s rare to interrogate them systematically and test whether our assumptions about what fits in which category are right.
Let’s start with the J shaped curves. Many of the long-term trends around physical phenomena look J-curved: rising carbon emissions, water useage and energy consumption have been exponential in shape over the centuries. As we know, physical constraints mean that these simply can’t go on – the J curves have to become S shaped sooner or later, or else crash. That is the ecological challenge of the 21st century.
New revolutions
But there are other J curves, particularly the ones associated with digital technology.  Moore’s Law and Metcalfe’s Law describe the dramatically expanding processing power of chips, and the growing connectedness of the world.  Some hope that the sheer pace of technological progress will somehow solve the ecological challenges. That hope has more to do with culture than evidence. But these J curves are much faster than the physical ones – any factor that doubles every 18 months achieves stupendous rates of change over decades.
That’s why we can be pretty confident that digital technologies will continue to throw up new revolutions – whether around the Internet of Things, the quantified self, machine learning, robots, mass surveillance or new kinds of social movement. But what form these will take is much harder to predict, and most digital prediction has been unreliable – we have Youtube but not the Interactive TV many predicted (when did you last vote on how a drama should end?); relatively simple SMS and twitter spread much more than ISDN or fibre to the home.  And plausible ideas like the long tail theory turned out to be largely wrong.
If the J curves are dramatic but unusual, much more of the world is shaped by straight line trends – like ageing or the rising price of disease that some predict will take costs of healthcare up towards 40 or 50% of GDP by late in the century, or incremental advances in fuel efficiency, or the likely relative growth of the Chinese economy.
Also important are the flat straight lines – the things that probably won’t change in the next decade or two:  the continued existence of nation states not unlike those of the 19th century? Air travel making use of fifty year old technologies?
Great imponderables
If the Js are the most challenging trends, the most interesting ones are the ‘U’s’- the examples of trends bending:  like crime which went up for a century and then started going down, or world population that has been going up but could start going down in the later part of this century, or divorce rates which seem to have plateaued, or Chinese labour supply which is forecast to turn down in the 2020s.
No one knows if the apparently remorseless upward trends of obesity and depression will turn downwards. No one knows if the next generation in the West will be poorer than their parents. And no one knows if democratic politics will reinvent itself and restore trust. In every case, much depends on what we do. None of these trends is a fact of nature or an act of God.
That’s one reason why it’s good to immerse yourself in these trends and interrogate what shape they really are. Out of that interrogation we can build a rough mental model and generate our own hypotheses – ones not based on the latest fashion or bestseller but hopefully on a sense of what the data shows and in particular what’s happening to the deltas – the current rates of change of different phenomena.”