Predictive Modeling With Big Data: Is Bigger Really Better?


New Paper by Junqué de Fortuny, Enric, Martens, David, and Provost, Foster in Big Data :“With the increasingly widespread collection and processing of “big data,” there is natural interest in using these data assets to improve decision making. One of the best understood ways to use data to improve decision making is via predictive analytics. An important, open question is: to what extent do larger data actually lead to better predictive models? In this article we empirically demonstrate that when predictive models are built from sparse, fine-grained data—such as data on low-level human behavior—we continue to see marginal increases in predictive performance even to very large scale. The empirical results are based on data drawn from nine different predictive modeling applications, from book reviews to banking transactions. This study provides a clear illustration that larger data indeed can be more valuable assets for predictive analytics. This implies that institutions with larger data assets—plus the skill to take advantage of them—potentially can obtain substantial competitive advantage over institutions without such access or skill. Moreover, the results suggest that it is worthwhile for companies with access to such fine-grained data, in the context of a key predictive task, to gather both more data instances and more possible data features. As an additional contribution, we introduce an implementation of the multivariate Bernoulli Naïve Bayes algorithm that can scale to massive, sparse data.”

The Power to Decide


Special Report by Antonio Regalado in MIT Technology Review: “Back in 1956, an engineer and a mathematician, William Fair and Earl Isaac, pooled $800 to start a company. Their idea: a score to handicap whether a borrower would repay a loan.
It was all done with pen and paper. Income, gender, and occupation produced numbers that amounted to a prediction about a person’s behavior. By the 1980s the three-digit scores were calculated on computers and instead took account of a person’s actual credit history. Today, Fair Isaac Corp., or FICO, generates about 10 billion credit scores annually, calculating 50 times a year for many Americans.
This machinery hums in the background of our financial lives, so it’s easy to forget that the choice of whether to lend used to be made by a bank manager who knew a man by his handshake. Fair and Isaac understood that all this could change, and that their company didn’t merely sell numbers. “We sell a radically different way of making decisions that flies in the face of tradition,” Fair once said.
This anecdote suggests a way of understanding the era of “big data”—terabytes of information from sensors or social networks, new computer architectures, and clever software. But even supercharged data needs a job to do, and that job is always about a decision.
In this business report, MIT Technology Review explores a big question: how are data and the analytical tools to manipulate it changing decision making today? On Nasdaq, trading bots exchange a billion shares a day. Online, advertisers bid on hundreds of thousands of keywords a minute, in deals greased by heuristic solutions and optimization models rather than two-martini lunches. The number of variables and the speed and volume of transactions are just too much for human decision makers.
When there’s a person in the loop, technology takes a softer approach (see “Software That Augments Human Thinking”). Think of recommendation engines on the Web that suggest products to buy or friends to catch up with. This works because Internet companies maintain statistical models of each of us, our likes and habits, and use them to decide what we see. In this report, we check in with LinkedIn, which maintains the world’s largest database of résumés—more than 200 million of them. One of its newest offerings is University Pages, which crunches résumé data to offer students predictions about where they’ll end up working depending on what college they go to (see “LinkedIn Offers College Choices by the Numbers”).
These smart systems, and their impact, are prosaic next to what’s planned. Take IBM. The company is pouring $1 billion into its Watson computer system, the one that answered questions correctly on the game show Jeopardy! IBM now imagines computers that can carry on intelligent phone calls with customers, or provide expert recommendations after digesting doctors’ notes. IBM wants to provide “cognitive services”—computers that think, or seem to (see “Facing Doubters, IBM Expands Plans for Watson”).
Andrew Jennings, chief analytics officer for FICO, says automating human decisions is only half the story. Credit scores had another major impact. They gave lenders a new way to measure the state of their portfolios—and to adjust them by balancing riskier loan recipients with safer ones. Now, as other industries get exposed to predictive data, their approach to business strategy is changing, too. In this report, we look at one technique that’s spreading on the Web, called A/B testing. It’s a simple tactic—put up two versions of a Web page and see which one performs better (see “Seeking Edge, Websites Turn to Experiments” and “Startups Embrace a Way to Fail Fast”).
Until recently, such optimization was practiced only by the largest Internet companies. Now, nearly any website can do it. Jennings calls this phenomenon “systematic experimentation” and says it will be a feature of the smartest companies. They will have teams constantly probing the world, trying to learn its shifting rules and deciding on strategies to adapt. “Winners and losers in analytic battles will not be determined simply by which organization has access to more data or which organization has more money,” Jennings has said.

Of course, there’s danger in letting the data decide too much. In this report, Duncan Watts, a Microsoft researcher specializing in social networks, outlines an approach to decision making that avoids the dangers of gut instinct as well as the pitfalls of slavishly obeying data. In short, Watts argues, businesses need to adopt the scientific method (see “Scientific Thinking in Business”).
To do that, they have been hiring a highly trained breed of business skeptics called data scientists. These are the people who create the databases, build the models, reveal the trends, and, increasingly, author the products. And their influence is growing in business. This could be why data science has been called “the sexiest job of the 21st century.” It’s not because mathematics or spreadsheets are particularly attractive. It’s because making decisions is powerful…”

How Internet surveillance predicts disease outbreak before WHO


Kurzweil News: “Have you ever Googled for an online diagnosis before visiting a doctor? If so, you may have helped provide early warning of an infectious disease epidemic.
In a new study published in Lancet Infectious Diseases, Internet-based surveillance has been found to detect infectious diseases such as Dengue Fever and Influenza up to two weeks earlier than traditional surveillance methods, according to Queensland University of Technology (QUT) research fellow and senior author of the paper Wenbiao Hu.
Hu, based at the Institute for Health and Biomedical Innovation, said there was often a lag time of two weeks before traditional surveillance methods could detect an emerging infectious disease.
“This is because traditional surveillance relies on the patient recognizing the symptoms and seeking treatment before diagnosis, along with the time taken for health professionals to alert authorities through their health networks. In contrast, digital surveillance can provide real-time detection of epidemics.”
Hu said the study used search engine algorithms such as Google Trends and Google Insights. It found that detecting the 2005–06 avian influenza outbreak “Bird Flu” would have been possible between one and two weeks earlier than official surveillance reports.
“In another example, a digital data collection network was found to be able to detect the SARS outbreak more than two months before the first publications by the World Health Organization (WHO),” Hu said.
According to this week’s CDC FluView report published Jan. 17, 2014, influenza activity in the United States remains high overall, with 3,745 laboratory-confirmed influenza-associated hospitalizations reported since October 1, 2013 (credit: CDC)
“Early detection means early warning and that can help reduce or contain an epidemic, as well alert public health authorities to ensure risk management strategies such as the provision of adequate medication are implemented.”
Hu said the study found that social media including Twitter and Facebook and microblogs could also be effective in detecting disease outbreaks. “The next step would be to combine the approaches currently available such as social media, aggregator websites, and search engines, along with other factors such as climate and temperature, and develop a real-time infectious disease predictor.”
“The international nature of emerging infectious diseases combined with the globalization of travel and trade, have increased the interconnectedness of all countries and that means detecting, monitoring and controlling these diseases is a global concern.”
The other authors of the paper were Gabriel Milinovich (first author), Gail Williams and Archie Clements from the University of Queensland School of Population, Health and State.
Supramap 
Another powerful tool is Supramap, a web application that synthesizes large, diverse datasets so that researchers can better understand the spread of infectious diseases across hosts and geography by integrating genetic, evolutionary, geospatial, and temporal data. It is now open-source — create your own maps here.
Associate Professor Daniel Janies, Ph.D., an expert in computational genomics at the Wexner Medical Center at The Ohio State University (OSU), worked with software engineers at the Ohio Supercomputer Center (OSC) to allow researchers and public safety officials to develop other front-end applications that draw on the logic and computing resources of Supramap.
It was originally developed in 2007 to track the spread and evolution of pandemic (H1N1) and avian influenza (H5N1).
“Using SUPRAMAP, we initially developed maps that illustrated the spread of drug-resistant influenza and host shifts in H1N1 and H5N1 influenza and in coronaviruses, such as SARS,” said Janies. “SUPRAMAP allows the user to track strains carrying key mutations in a geospatial browser such as Google Earth. Our software allows public health scientists to update and view maps on the evolution and spread of pathogens.”
Grant funding through the U.S. Army Research Laboratory and Office supports this Innovation Group on Global Infectious Disease Research project. Support for the computational requirements of the project comes from  the American Museum of Natural History (AMNH) and OSC. Ohio State’s Wexner Medical Center, Department of Biomedical Informatics and offices of Academic Affairs and Research provide additional support.”
See also

Of course we share! Testing Assumptions about Social Tagging Systems


New paper by Stephan Doerfel, Daniel Zoller, Philipp Singer, Thomas Niebler, Andreas Hotho, Markus Strohmaier: “Social tagging systems have established themselves as an important part in today’s web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource indexes using uncontrolled vocabulary not only due to the easy-to-use mechanics. Henceforth, a variety of assumptions about social tagging systems have emerged, yet testing them has been difficult due to the absence of suitable data. In this work we thoroughly investigate three available assumptions – e.g., is a tagging system really social? – by examining live log data gathered from the real-world public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected and viewed in a very critical light. Our observations have implications for the design of future search and other algorithms to better reflect the actual user behavior.”

Algorithms and the Changing Frontier


A GMU School of Public Policy Research Paper by Agwara, Hezekiah and Auerswald, Philip E. and Higginbotham, Brian D.: “We first summarize the dominant interpretations of the “frontier” in the United States and predecessor colonies over the past 400 years: agricultural (1610s-1880s), industrial (1890s-1930s), scientific (1940s-1980s), and algorithmic (1990s-present). We describe the difference between the algorithmic frontier and the scientific frontier. We then propose that the recent phenomenon referred to as “globalization” is actually better understood as the progression of the algorithmic frontier, as enabled by standards that in turn have facilitated the interoperability of firm-level production algorithms. We conclude by describing implications of the advance of the algorithmic frontier for scientific discovery and technological innovation.”

Mapping the Data Shadows of Hurricane Sandy: Uncovering the Sociospatial Dimensions of ‘Big Data’


New Paper by Shelton, T., Poorthuis, A., Graham, M., and Zook, M. : “Digital social data are now practically ubiquitous, with increasingly large and interconnected databases leading researchers, politicians, and the private sector to focus on how such ‘big data’ can allow potentially unprecedented insights into our world. This paper investigates Twitter activity in the wake of Hurricane Sandy in order to demonstrate the complex relationship between the material world and its digital representations. Through documenting the various spatial patterns of Sandy-related tweeting both within the New York metropolitan region and across the United States, we make a series of broader conceptual and methodological interventions into the nascent geographic literature on big data. Rather than focus on how these massive databases are causing necessary and irreversible shifts in the ways that knowledge is produced, we instead find it more productive to ask how small subsets of big data, especially georeferenced social media information scraped from the internet, can reveal the geographies of a range of social processes and practices. Utilizing both qualitative and quantitative methods, we can uncover broad spatial patterns within this data, as well as understand how this data reflects the lived experiences of the people creating it. We also seek to fill a conceptual lacuna in studies of user-generated geographic information, which have often avoided any explicit theorizing of sociospatial relations, by employing Jessop et al’s TPSN framework. Through these interventions, we demonstrate that any analysis of user-generated geographic information must take into account the existence of more complex spatialities than the relatively simple spatial ontology implied by latitude and longitude coordinates.”

When Does Transparency Generate Legitimacy? Experimenting on a Context-Bound Relationship


New paper by Jenny De Fine Licht, Daniel Naurin, Peter Esaiasson, and Mikael Gilljam in Governance: “We analyze the main rationale for public administrations and political institutions for supplying transparency, namely, that it generates legitimacy for these institutions. First, we discuss different theories of decision making from which plausible causal mechanisms that may drive a link between transparency and legitimacy may be derived. We find that the common notion of a straightforward positive correlation is naïve and that transparency reforms are rather unpredictable phenomena. Second, we test the effect of transparency on procedure acceptance using vignette experiments of representative decision making in schools. We find that transparency can indeed generate legitimacy. Interestingly, however, the form need not be “fishbowl transparency,” with full openness of the decision-making process. Decision makers may improve their legitimacy simply by justifying carefully afterward the decisions taken behind closed doors. Only when behavior close to a deliberative democratic ideal was displayed did openness of the process generate more legitimacy than closed-door decision making with postdecisional justifications.”

Enhancing Social Innovation by Rethinking Collaboration, Leadership and Public Governance


New paper by Professors Eva Sørensen & Jacob Torfing: “It is widely recognized that public innovation is the intelligent alternative to blind across-the-board-cuts in times of shrinking budgets, and that innovation may help to break policy deadlocks and adjust welfare services to new and changing demands. At the same time, there is growing evidence that multi-actor collaboration in networks, partnerships and interorganizational teams can spur public innovation (Sørensen and Torfing, 2011). The involvement of different public and private actors in public innovation processes improves the understanding of the problem or challenge at hand and brings forth new ideas and proposals. It also ensures that the needs of users, citizens and civil society organizations are taken into account when innovative solutions are selected, tested and implemented.
While a lot of public innovation continues to be driven by particular public employees and managers, there seems to be a significant surge in collaborative forms of innovation that cut across the institutional and organization boundaries within the public sector and involve a plethora of private actors with relevant innovation assets. Indeed, the enhancement of collaborative innovation has be come a key aspiration of many public organizations around the world. However, if we fail to develop a more precise and sophisticated understanding of the concepts of ‘innovation’ and ‘collaboration’, we risk that both terms are reduced to empty and tiresome buzzwords that will not last to the end of the season. Moreover, in reality, collaborative and innovative processes are difficult to trigger and sustain without proper innovation management and a supporting cultural and institutional environment. This insight calls for further reflections on the role of public leadership and management and for a transformation of the entire system of public governing.
Hence, in order to spur collaborative innovation in the public sector, we need to clarify the basic terms of the debate and explore how collaborative innovation can be enhanced by new forms of innovation management and new forms of public governing. To this end, we shall first define the notions of innovation and public innovation and discuss the relation between public innovation and social innovation in order to better understand the purposes of different forms of innovation.
We shall then seek to clarify the notion of collaboration and pinpoint why and how collaboration enhances public innovation. Next, we shall offer some theoretical and practical reflections about how public leaders and managers can advance collaborative innovation. Finally, we shall argue that the enhancement of collaborative forms of social innovation calls for a transformation of the system of public governing that shifts the balance from New Public Management towards New Public Governance.”

E-government and organisational transformation of government: Black box revisited?


New paper in Government Information Quarterly: “During the e-government era the role of technology in the transformation of public sector organisations has significantly increased, whereby the relationship between ICT and organisational change in the public sector has become the subject of increasingly intensive research over the last decade. However, an overview of the literature to date indicates that the impacts of e-government on the organisational transformation of administrative structures and processes are still relatively poorly understood and vaguely defined.

The main purpose of the paper is therefore the following: (1) to examine the interdependence of e-government development and organisational transformation in public sector organisations and propose a clearer explanation of ICT’s role as a driving force of organisational transformation in further e-government development; and (2) to specify the main characteristics of organisational transformation in the e-government era through the development of a new framework. This framework describes organisational transformation in two dimensions, i.e. the ‘depth’ and the ‘nature’ of changes, and specifies the key attributes related to the three typical organisational levels.”

Prospects for Online Crowdsourcing of Social Science Research Tasks: A Case Study Using Amazon Mechanical Turk


New paper by Catherine E. Schmitt-Sands and Richard J. Smith: “While the internet has created new opportunities for research, managing the increased complexity of relationships and knowledge also creates challenges. Amazon.com has a Mechanical Turk service that allows people to crowdsource simple tasks for a nominal fee. The online workers may be anywhere in North America or India and range in ability. Social science researchers are only beginning to use this service. While researchers have used crowdsourcing to find research subjects or classify texts, we used Mechanical Turk to conduct a policy scan of local government websites. This article describes the process used to train and ensure quality of the policy scan. It also examines choices in the context of research ethics.”