The Secret Science of Retweets


Emerging Technology From the arXiv: “If you send a tweet to a stranger asking them to retweet it, you probably wouldn’t be surprised if they ignored you entirely. But if you sent out lots of tweets like this, perhaps a few might end up being passed on.

How come? What makes somebody retweet information from a stranger? That’s the question addressed today by Kyumin Lee from Utah State University in Logan and a few pals from IBM’s Almaden research center in San Jose….by studying the characteristics of Twitter users, it is possible to identify strangers who are more likely to pass on your message than others. And in doing this, the researchers say they’ve been able to improve the retweet rate of messages sent strangers by up to 680 percent.
So how did they do it? The new technique is based on the idea that some people are more likely to tweet than others, particularly on certain topics and at certain times of the day. So the trick is to find these individuals and target them when they are likely to be most effective.
So the approach was straightforward. The idea is to study the individuals on Twitter, looking at their profiles and their past tweeting behavior, looking for clues that they might be more likely to retweet certain types of information. Having found these individuals, send your tweets to them.
That’s the theory. In practice, it’s a little more involved. Lee and co wanted to test people’s response to two types of information: local news (in San Francisco) and tweets about bird flu, a significant issue at the time of their research. They then created several Twitter accounts with a few followers, specifically to broadcast information of this kind.
Next, they selected people to receive their tweets. For the local news broadcasts, they searched for Twitter users geolocated in the Bay area, finding over 34,000 of them and choosing 1,900 at random.
They then a sent a single message to each user of the format:
“@ SFtargetuser “A man was killed and three others were wounded in a shooting … http://bit.ly/KOl2sC” Plz RT this safety news”
So the tweet included the user’s name, a short headline, a link to the story and a request to retweet.
Of these 1,900 people, 52 retweeted the message they received. That’s 2.8 percent.
For the bird flu information, Lee and co hunted for people who had already tweeted about bird flu, finding 13,000 of them and choosing 1,900 at random. Of these, 155 retweeted the message they received, a retweet rate of 8.4 percent.
But Lee and co found a way to significantly improve these retweet rates. They went back to the original lists of Twitter users and collected publicly available information about each of them, such as their personal profile, the number of followers, the people they followed, their 200 most recent tweets and whether they retweeted the message they had received
Next, the team used a machine learning algorithm to search for correlations in this data that might predict whether somebody was more likely to retweet. For example, they looked at whether people with older accounts were more likely to retweet or how the ratio of friends to followers influenced the retweet likelihood, or even how the types of negative or positive words they used in previous tweets showed any link. They also looked at the time of day that people were most active in tweeting.
The result was a machine learning algorithm capable of picking users who were most likely to retweet on a particular topic.
And the results show that it is surprisingly effective. When the team sent local information tweets to individuals identified by the algorithm, 13.3 percent retweeted it, compared to just 2.6 percent of people chosen at random.
And they got even better results when they timed the request to match the periods when people had been most active in the past. In that case, the retweet rate rose to 19.3 percent. That’s an improvement of over 600 percent.
Similarly, the rate for bird flu information rose from 8.3 percent for users chosen at random to 19.7 percent for users chosen by the algorithm.
That’s a significant result that marketers, politicians, news organizations will be eyeing with envy.
An interesting question is how they can make this technique more generally applicable. It raises the prospect of an app that allows anybody to enter a topic of interest and which then creates a list of people most likely to retweet on that topic in the next few hours.
Lee and co do not mention any plans of this kind. But if they don’t exploit it, then there will surely be others who will.
Ref: arxiv.org/abs/1405.3750 : Who Will Retweet This? Automatically Identifying and Engaging Strangers on Twitter to Spread Information”

Saving Big Data from Big Mouths


Cesar A. Hidalgo in Scientific American: “It has become fashionable to bad-mouth big data. In recent weeks the New York Times, Financial Times, Wired and other outlets have all run pieces bashing this new technological movement. To be fair, many of the critiques have a point: There has been a lot of hype about big data and it is important not to inflate our expectations about what it can do.
But little of this hype has come from the actual people working with large data sets. Instead, it has come from people who see “big data” as a buzzword and a marketing opportunity—consultants, event organizers and opportunistic academics looking for their 15 minutes of fame.
Most of the recent criticism, however, has been weak and misguided. Naysayers have been attacking straw men, focusing on worst practices, post hoc failures and secondary sources. The common theme has been to a great extent obvious: “Correlation does not imply causation,” and “data has biases.”
Critics of big data have been making three important mistakes:
First, they have misunderstood big data, framing it narrowly as a failed revolution in social science hypothesis testing. In doing so they ignore areas where big data has made substantial progress, such as data-rich Web sites, information visualization and machine learning. If there is one group of big-data practitioners that the critics should worship, they are the big-data engineers building the social media sites where their platitudes spread. Engineering a site rich in data, like Facebook, YouTube, Vimeo or Twitter, is extremely challenging. These sites are possible because of advances made quietly over the past five years, including improvements in database technologies and Web development frameworks.
Big data has also contributed to machine learning and computer vision. Thanks to big data, Facebook algorithms can now match faces almost as accurately as humans do.
And detractors have overlooked big data’s role in the proliferation of computational design, data journalism and new forms of artistic expression. Computational artists, journalists and designers—the kinds of people who congregate at meetings like Eyeo—are using huge sets of data to give us online experiences that are unlike anything we experienced in paper. If we step away from hypothesis testing, we find that big data has made big contributions.
The second mistake critics often make is to confuse the limitations of prototypes with fatal flaws. This is something I have experienced often. For example, in Place Pulse—a project I created with my team the M.I.T. Media Lab—we used Google Street View images and crowdsourced visual surveys to map people’s perception of a city’s safety and wealth. The original method was rife with limitations that we dutifully acknowledged in our paper. Google Street View images are taken at arbitrary times of the day and showed cities from the perspective of a car. City boundaries were also arbitrary. To overcome these limitations, however, we needed a first data set. Producing that first limited version of Place Pulse was a necessary part of the process of making a working prototype.
A year has passed since we published Place Pulse’s first data set. Now, thanks to our focus on “making,” we have computer vision and machine-learning algorithms that we can use to correct for some of these easy-to-spot distortions. Making is allowing us to correct for time of the day and dynamically define urban boundaries. Also, we are collecting new data to extend the method to new geographical boundaries.
Those who fail to understand that the process of making is iterative are in danger of  being too quick to condemn promising technologies.  In 1920 the New York Times published a prediction that a rocket would never be able to leave  atmosphere. Similarly erroneous predictions were made about the car or, more recently, about iPhone’s market share. In 1969 the Times had to publish a retraction of their 1920 claim. What similar retractions will need to be published in the year 2069?
Finally, the doubters have relied too heavily on secondary sources. For instance, they made a piñata out of the 2008 Wired piece by Chris Anderson framing big data as “the end of theory.” Others have criticized projects for claims that their creators never made. A couple of weeks ago, for example, Gary Marcus and Ernest Davis published a piece on big data in the Times. There they wrote about another of one of my group’s projects, Pantheon, which is an effort to collect, visualize and analyze data on historical cultural production. Marcus and Davis wrote that Pantheon “suggests a misleading degree of scientific precision.” As an author of the project, I have been unable to find where I made such a claim. Pantheon’s method section clearly states that: “Pantheon will always be—by construction—an incomplete resource.” That same section contains a long list of limitations and caveats as well as the statement that “we interpret this data set narrowly, as the view of global cultural production that emerges from the multilingual expression of historical figures in Wikipedia as of May 2013.”
Bickering is easy, but it is not of much help. So I invite the critics of big data to lead by example. Stop writing op–eds and start developing tools that improve on the state of the art. They are much appreciated. What we need are projects that are worth imitating and that we can build on, not obvious advice such as “correlation does not imply causation.” After all, true progress is not something that is written, but made.”

Seven Principles for Big Data and Resilience Projects


PopTech & Rockefeler Bellagio Fellows: “The following is a draft “Code of Conduct” that seeks to provide guidance on best practices for resilience building projects that leverage Big Data and Advanced Computing. These seven core principles serve to guide data projects to ensure they are socially just, encourage local wealth- & skill-creation, require informed consent, and be maintainable over long timeframes. This document is a work in progress, so we very much welcome feedback. Our aim is not to enforce these principles on others but rather to hold ourselves accountable and in the process encourage others to do the same. Initial versions of this draft were written during the 2013 PopTech & Rockefeller Foundation workshop in Bellagio, August 2013.
Open Source Data Tools – Wherever possible, data analytics and manipulation tools should be open source, architecture independent and broadly prevalent (R, python, etc.). Open source, hackable tools are generative, and building generative capacity is an important element of resilience….
Transparent Data Infrastructure – Infrastructure for data collection and storage should operate based on transparent standards to maximize the number of users that can interact with the infrastructure. Data infrastructure should strive for built-in documentation, be extensive and provide easy access. Data is only as useful to the data scientist as her/his understanding of its collection is correct…
Develop and Maintain Local Skills – Make “Data Literacy” more widespread. Leverage local data labor and build on existing skills. The key and most constraint ingredient to effective data solutions remains human skill/knowledge and needs to be retained locally. In doing so, consider cultural issues and language. Catalyze the next generation of data scientists and generate new required skills in the cities where the data is being collected…
Local Data Ownership – Use Creative Commons and licenses that state that data is not to be used for commercial purposes. The community directly owns the data it generates, along with the learning algorithms (machine learning classifiers) and derivatives. Strong data protection protocols need to be in place to protect identities and personally identifying information…
Ethical Data Sharing – Adopt existing data sharing protocols like the ICRC’s (2013). Permission for sharing is essential. How the data will be used should be clearly articulated. An opt in approach should be the preference wherever possible, and the ability for individuals to remove themselves from a data set after it has been collected must always be an option. Projects should always explicitly state which third parties will get access to data, if any, so that it is clear who will be able to access and use the data…
Right Not To Be Sensed – Local communities have a right not to be sensed. Large scale city sensing projects must have a clear framework for how people are able to be involved or choose not to participate. All too often, sensing projects are established without any ethical framework or any commitment to informed consent. It is essential that the collection of any sensitive data, from social and mobile data to video and photographic records of houses, streets and individuals, is done with full public knowledge, community discussion, and the ability to opt out…
Learning from Mistakes – Big Data and Resilience projects need to be open to face, report, and discuss failures. Big Data technology is still very much in a learning phase. Failure and the learning and insights resulting from it should be accepted and appreciated. Without admitting what does not work we are not learning effectively as a community. Quality control and assessment for data-driven solutions is notably harder than comparable efforts in other technology fields. The uncertainty about quality of the solution is created by the uncertainty inherent in data…”

If big data is an atomic bomb, disarmament begins in Silicon Valley


at GigaOM: “Big data is like atomic energy, according to scientist Albert-László Barabási in a Monday column on Politico. It’s very beneficial when used ethically, and downright destructive when turned into a weapon. He argues scientists can help resolve the damage done by government spying by embracing the principles of nuclear nonproliferation that helped bring an end to Cold War fears and distrust.
Barabási’s analogy is rather poetic:

“Powered by the right type of Big Data, data mining is a weapon. It can be just as harmful, with long-term toxicity, as an atomic bomb. It poisons trust, straining everything from human relations to political alliances and free trade. It may target combatants, but it cannot succeed without sifting through billions of data points scraped from innocent civilians. And when it is a weapon, it should be treated like a weapon.”

I think he’s right, but I think the fight to disarm the big data bomb begins in places like Silicon Valley and Madison Avenue. And it’s not just scientists; all citizens should have a role…
I write about big data and data mining for a living, and I think the underlying technologies and techniques are incredibly valuable, even if the applications aren’t always ideal. On the one hand, advances in machine learning from companies such as Google and Microsoft are fantastic. On the other hand, Facebook’s newly expanded Graph Search makes Europe’s proposed right-to-be-forgotten laws seem a lot more sensible.
But it’s all within the bounds of our user agreements and beauty is in the eye of the beholder.
Perhaps the reason we don’t vote with our feet by moving to web platforms that embrace privacy, even though we suspect it’s being violated, is that we really don’t know what privacy means. Instead of regulating what companies can and can’t do, perhaps lawmakers can mandate a degree of transparency that actually lets users understand how data is being used, not just what data is being collected. Great, some company knows my age, race, ZIP code and web history: What I really need to know is how it’s using that information to target, discriminate against or otherwise serve me.
An intelligent national discussion about the role of the NSA is probably in order. For all anyone knows,  it could even turn out we’re willing to put up with more snooping than the goverment might expect. But until we get a handle on privacy from the companies we choose to do business with, I don’t think most Americans have the stomach for such a difficult fight.”

Undefined By Data: A Survey of Big Data Definitions


Paper by Jonathan Stuart Ward and Adam Barker: “The term big data has become ubiquitous. Owing to shared origin between academia, industry and the media there is no single unified definition, and various stakeholders provide diverse and often contradictory definitions. The lack of a consistent definition introduces ambiguity and hampers discourse relating to big data. This short paper attempts to collate the various definitions which have gained some degree of traction and to furnish a clear and concise definition of an otherwise ambiguous term…
Despite the range and differences existing within each of the aforementioned definitions there are some points of similarity. Notably all definitions make at least one of the following assertions:
Size: the volume of the datasets is a critical factor.
Complexity: the structure, behaviour and permutations of the datasets is a critical factor.
Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.
The definitions surveyed here all encompass at least one of these factors, most encompass two. An extrapolation of these factors would therefore postulate the following: Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.”

Three ways to think of the future…


Geoff Mulgan’s blog: “Here I suggest three complementary ways of thinking about the future which provide partial protection against the pitfalls.
The shape of the future
First, create your own composite future by engaging with the trends. There are many methods available for mapping the future – from Foresight to scenarios to the Delphi method.
Behind all are implicit views about the shapes of change. Indeed any quantitative exploration of the future uses a common language of patterns (shown in this table above) which summarises the fact that some things will go up, some go down, some change suddenly and some not at all.
All of us have implicit or explicit assumptions about these. But it’s rare to interrogate them systematically and test whether our assumptions about what fits in which category are right.
Let’s start with the J shaped curves. Many of the long-term trends around physical phenomena look J-curved: rising carbon emissions, water useage and energy consumption have been exponential in shape over the centuries. As we know, physical constraints mean that these simply can’t go on – the J curves have to become S shaped sooner or later, or else crash. That is the ecological challenge of the 21st century.
New revolutions
But there are other J curves, particularly the ones associated with digital technology.  Moore’s Law and Metcalfe’s Law describe the dramatically expanding processing power of chips, and the growing connectedness of the world.  Some hope that the sheer pace of technological progress will somehow solve the ecological challenges. That hope has more to do with culture than evidence. But these J curves are much faster than the physical ones – any factor that doubles every 18 months achieves stupendous rates of change over decades.
That’s why we can be pretty confident that digital technologies will continue to throw up new revolutions – whether around the Internet of Things, the quantified self, machine learning, robots, mass surveillance or new kinds of social movement. But what form these will take is much harder to predict, and most digital prediction has been unreliable – we have Youtube but not the Interactive TV many predicted (when did you last vote on how a drama should end?); relatively simple SMS and twitter spread much more than ISDN or fibre to the home.  And plausible ideas like the long tail theory turned out to be largely wrong.
If the J curves are dramatic but unusual, much more of the world is shaped by straight line trends – like ageing or the rising price of disease that some predict will take costs of healthcare up towards 40 or 50% of GDP by late in the century, or incremental advances in fuel efficiency, or the likely relative growth of the Chinese economy.
Also important are the flat straight lines – the things that probably won’t change in the next decade or two:  the continued existence of nation states not unlike those of the 19th century? Air travel making use of fifty year old technologies?
Great imponderables
If the Js are the most challenging trends, the most interesting ones are the ‘U’s’- the examples of trends bending:  like crime which went up for a century and then started going down, or world population that has been going up but could start going down in the later part of this century, or divorce rates which seem to have plateaued, or Chinese labour supply which is forecast to turn down in the 2020s.
No one knows if the apparently remorseless upward trends of obesity and depression will turn downwards. No one knows if the next generation in the West will be poorer than their parents. And no one knows if democratic politics will reinvent itself and restore trust. In every case, much depends on what we do. None of these trends is a fact of nature or an act of God.
That’s one reason why it’s good to immerse yourself in these trends and interrogate what shape they really are. Out of that interrogation we can build a rough mental model and generate our own hypotheses – ones not based on the latest fashion or bestseller but hopefully on a sense of what the data shows and in particular what’s happening to the deltas – the current rates of change of different phenomena.”

Frontiers in Massive Data Analysis


New report from the National Academy of Sciences: “Data mining of massive data sets is transforming the way we think about crisis response, marketing, entertainment, cybersecurity and national intelligence. Collections of documents, images, videos, and networks are being thought of not merely as bit strings to be stored, indexed, and retrieved, but as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data.
Frontiers in Massive Data Analysis examines the frontier of analyzing massive amounts of data, whether in a static database or streaming through a system. Data at that scale–terabytes and petabytes–is increasingly common in science (e.g., particle physics, remote sensing, genomics), Internet commerce, business analytics, national security, communications, and elsewhere. The tools that work to infer knowledge from data at smaller scales do not necessarily work, or work well, at such massive scale. New tools, skills, and approaches are necessary, and this report identifies many of them, plus promising research directions to explore. Frontiers in Massive Data Analysis discusses pitfalls in trying to infer knowledge from massive data, and it characterizes seven major classes of computation that are common in the analysis of massive data. Overall, this report illustrates the cross-disciplinary knowledge–from computer science, statistics, machine learning, and application disciplines–that must be brought to bear to make useful inferences from massive data.”

New! Humanitarian Computing Library


Patrick Meier at iRevolution: “The field of “Humanitarian Computing” applies Human Computing and Machine Computing to address major information-based challengers in the humanitarian space. Human Computing refers to crowdsourcing and microtasking, which is also referred to as crowd computing. In contrast, Machine Computing draws on natural language processing and machine learning, amongst other disciplines. The Next Generation Humanitarian Technologies we are prototyping at QCRI are powered by Humanitarian Computing research and development (R&D).
My QCRI colleagues and I  just launched the first ever Humanitarian Computing Library which is publicly available here. The purpose of this library, or wiki, is to consolidate existing and future research that relate to Humanitarian Computing in order to support the development of next generation humanitarian tech. The repository currently holds over 500 publications that span topics such as Crisis Management, Trust and Security, Software and Tools, Geographical Analysis and Crowdsourcing. These publications are largely drawn from (but not limited to) peer-reviewed papers submitted at leading conferences around the world. We invite you to add your own research on humanitarian computing to this growing collection of resources.”

Big data, crowdsourcing and machine learning tackle Parkinson’s


Successful Workingplace: “Parkinson’s is a very tough disease to fight. People suffering from the disease often have significant tremors that keep them from being able to create accurate records of their daily challenges. Without this information, doctors are unable to fine tune drug dosages and other treatment regimens that can significantly improve the lives of sufferers.
It was a perfect catch-22 situation until recently, when the Michael J. Fox Foundation announced that LIONsolver, a company specializing in machine learning software, was able to differentiate Parkinson’s patients from healthy individuals and to also show the trend in symptoms of the disease over time.
To set up the competition, the Foundation worked with Kaggle, an organization that specializes in crowdsourced big data analysis competitions. The use of crowdsourcing as a way to get to the heart of very difficult Big Data problems works by allowing people the world over from a myriad of backgrounds and with diverse experiences to devote time on personally chosen challenges where they can bring the most value. It’s a genius idea for bringing some of the scarcest resources together with the most intractable problems.”
 

Data Science for Social Good


Data Science for Social Good: “By analyzing data from police reports to website clicks to sensor signals, governments are starting to spot problems in real-time and design programs to maximize impact. More nonprofits are measuring whether or not they’re helping people, and experimenting to find interventions that work.
None of this is inevitable, however.
We’re just realizing the potential of using data for social impact and face several hurdles to it’s widespread adoption:

  • Most governments and nonprofits simply don’t know what’s possible yet. They have data – but often not enough and maybe not the right kind.
  • There are too few data scientists out there – and too many spending their days optimizing ads instead of bettering lives.

To make an impact, we need to show social good organizations the power of data and analytics. We need to work on analytics projects that have high social impact. And we need to expose data scientists to the problems that really matter.

The fellowship

That’s exactly why we’re doing the Eric and Wendy Schmidt Data Science for Social Good summer fellowship at the University of Chicago.
We want to bring three dozen aspiring data scientists to Chicago, and have them work on data science projects with social impact.
Working closely with governments and nonprofits, fellows will take on real-world problems in education, health, energy, transportation, and more.
Over the next three months, they’ll apply their coding, machine learning, and quantitative skills, collaborate in a fast-paced atmosphere, and learn from mentors in industry, academia, and the Obama campaign.
The program is led by a strong interdisciplinary team from the Computation institute and the Harris School of Public Policy at the University of Chicago.”