A New Source of Data for Public Health Surveillance: Facebook Likes


Paper by Steven Gittelman et al in the Journal of Medical Internet Research: “The development of the Internet and the explosion of social media have provided many new opportunities for health surveillance. The use of the Internet for personal health and participatory health research has exploded, largely due to the availability of online resources and health care information technology applications [18]. These online developments, plus a demand for more timely, widely available, and cost-effective data, have led to new ways epidemiological data are collected, such as digital disease surveillance and Internet surveys [825]. Over the past 2 decades, Internet technology has been used to identify disease outbreaks, track the spread of infectious disease, monitor self-care practices among those with chronic conditions, and to assess, respond, and evaluate natural and artificial disasters at a population level [6,8,11,12,14,15,17,22,2628]. Use of these modern communication tools for public health surveillance has proven to be less costly and more timely than traditional population surveillance modes (eg, mail surveys, telephone surveys, and face-to-face household surveys).

The Internet has spawned several sources of big data, such as Facebook [29], Twitter [30], Instagram [31], Tumblr [32], Google [33], and Amazon [34]. These online communication channels and market places provide a wealth of passively collected data that may be mined for purposes of public health, such as sociodemographic characteristics, lifestyle behaviors, and social and cultural constructs. Moreover, researchers have demonstrated that these digital data sources can be used to predict otherwise unavailable information, such as sociodemographic characteristics among anonymous Internet users [3538]. For example, Goel et al [36] found no difference by demographic characteristics in the usage of social media and email. However, the frequency with which individuals accessed the Web for news, health care, and research was a predictor of gender, race/ethnicity, and educational attainment, potentially providing useful targeting information based on ethnicity and income [36]. Integrating these big data sources into the practice of public health surveillance is vital to move the field of epidemiology into the 21st century as called for in the 2012 US “Big Data Research and Development Initiative” [19,39].

Understanding how big data can be used to predict lifestyle behavior and health-related data is a step toward the use of these electronic data sources for epidemiologic needs…(More)”

CrowdFlower Launches Open Data Project


Anthony Ha at Techcrunch: “Crowdsourcing company CrowdFlower allows businesses to tap into a distributed workforce of 5 million contributors for basic tasks like sentiment analysis. Today it’s releasing some of that data to the public through its new Data for Everyone initiative…. hope is to turn CrowdFlower into a central repository where open data can be found by researchers and entrepreneurs. (Factual was another startup trying to become a hub for open data, though in recent years, it’s become more focused on gathering location data to power mobile ads.)…

As for the data that’s available now, …There’s a lot of Twitter sentiment analysis covering things like from attitudes towards brands and products, yogurt (?), and climate change. Among the more recent data sets, I was particularly taken in the gender breakdown of who’s been on the cover of Time magazine and, yes, the analysis of who thought the dress (you know the one) was gold and white versus blue and black…. (More)”

Pantheon: A Dataset for the Study of Global Cultural Production


Paper by Amy Zhao Yu, Shahar Ronen, Kevin Hu, Tiffany Lu, and César A. Hidalgo: “We present the Pantheon 1.0 dataset: a manually curated dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually curated demographic information (place of birth, date of birth, and gender), (ii) a cultural domain classification categorizing each biography at three levels of aggregation (i.e. Arts/Fine Arts/Painting), and (iii) measures of global visibility (fame) including the number of languages in which a biography is present in Wikipedia, the monthly page-views received by a biography (2008-2013), and a global visibility metric we name the Historical Popularity Index (HPI). We validate our measures of global visibility (HPI and Wikipedia language editions) using external measures of accomplishment in several cultural domains: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of accomplishments and fame (HPI) correlate with an R250, suggesting that measures of global fame are appropriate proxies for measures of accomplishment….(More)

How to Convince Men to Help the Poor


at Pacific Standard: “Please give. It’s a plea we are confronted with constantly, as a variety of charities implore us to help them help the less fortunate.

Whether we get out our checkbook or throw the request in the recycling bin is determined, in part, by the specific way the request is framed. But a new study suggests non-profits might want to create two separate appeals: One aimed at men, and another at women.

A research team led by Stanford University sociologist Robb Willer reports empathy-based appeals tend to be effective with women. But as a rule, men—who traditionally give somewhat less to anti-poverty charities—need to be convinced that their self-interest aligns with that of the campaign.

“Framing poverty as an issue that negatively affects all Americans increased men’s willingness to donate to the cause, eliminating the gender gap,” the researchers write in the journal Social Science Research….

“While this reframing resonated with men, who were otherwise less likely to spontaneously express concern about poverty,” Willer and his colleagues write, “it had the opposite effect for women, who might have felt less motivated to express concern about poverty when doing so seemed inconsistent with feeling empathy for the poor.”…(More)”

The Data Manifesto


Development Initiatives: “Staging a Data Revolution

Accessible, useable, timely and complete data is core to sustainable development and social progress. Access to information provides people with a base to make better choices and have more control over their lives. Too often attempts to deliver sustainable economic, social and environmental results are hindered by the failure to get the right information, in the right format, to the right people, at the right time. Worse still, the most acute data deficits often affect the people and countries facing the most acute problems.

The Data Revolution should be about data grounded in real life. Data and information that gets to the people who need it at national and sub-national levels to help with the decisions they face – hospital directors, school managers, city councillors, parliamentarians. Data that goes beyond averages – that is disaggregated to show the different impacts of decisions, policies and investments on gender, social groups and people living in different places and over time.

We need a Data Revolution that sets a new political agenda, that puts existing data to work, that improves the way data is gathered and ensures that information can be used. To deliver this vision, we need the following steps.


12 steps to a Data Revolution

1.     Implement a national ‘Data Pledge’ to citizens that is supported by governments, private and non-governmental sectors
2.     Address real world questions with joined up and disaggregated data
3.      Empower and up-skill data users of the future through education
4.     Examine existing frameworks and publish existing data
5.     Build an information bank of data assets
6.     Allocate funding available for better data according to national and sub-national priorities
7.     Strengthen national statistical systems’ capacity to collect data
8.     Implement a policy that data is ‘open by default’
9.     Improve data quality by subjecting it to public scrutiny
10.  Put information users’ needs first
11.  Recognise technology cannot solve all barriers to information
12.  Invest in infomediaries’ capacity to translate data into information that policymakers, civil society and the media can actually use…”

Can We Build a Safer Internet?


in the New York Times: “We often take it as a given that the Internet is a cruel place, a natural haven for those who seek to harass and threaten others. But to some people, social networks are not mere conduits for our worst impulses. They’re structures whose design can influence how we behave, for good as well as for ill.

Right now, having a social media account can mean facing down a torrent of harassment — including, for some, attacks that are misogynist, racist or both. “Just as you create a space for people to use something in innovative, creative ways, there are also people who will use it for other means,” Moya Bailey, a postdoctoral fellow at Northeastern University who writes about race, gender and media, told Op-Talk. She mentioned Anita Sarkeesian, the video game critic who has faced harassment for critiquing the portrayal of women in games.

“Because she is doing that work, she becomes a target of a lot of violence and hate,” said Ms. Bailey. The rise of online communication is “a gift and a curse always. It’s always both/and.”

And the way we behave online may depend on which site we’re using. Ms. Bailey cites Tumblr as an example. “I think there’s something about Tumblr that is really attractive to social-justice folks, and the kinds of conversations that people have on Tumblr are very different from what’s possible on Facebook,” she explained. “The platforms themselves help shape the kind of content that people post to those different sites.”

The design of those platforms can also determine who sees what we post. Kate Losse, a writer on technology and culture and a former product manager at Facebook, told Op-Talk that Facebook has widened the scope of some of our conversations.

“Pre-Facebook there would be all these different kinds of interactions you might have socially,” she said. “You might talk to one person, you might talk to three people, you might talk to a hundred people. But Facebook’s interesting because you’re always talking to a hundred people when you post, or more.”

“You have to look at something like Facebook as structuring social interactions,” she added. And interacting via what Ms. Losse called “large-scale announcements” can introduce problems. “The Internet is the classic case of tragedy of the commons,” she said. “If something that’s important to me gets viewed by someone across the world, who has no attachment to me, doesn’t care about me at all, doesn’t have any reason to know me or have empathy for me, it’s much easier for that person to do something hateful with the content than to be respectful of it.”

But if platforms can structure our interactions, can they steer us toward kindness rather than toward bile? Batya Friedman, a professor at the University of Washington’s Information School who studies the relationship between technology and human priorities, thinks it’s possible. “Any time people talk to each other,” she told Op-Talk, “we have all kinds of social norms that check how we say things to each other. We give each other social cues, we tell each other when somebody’s starting to go too far.”

The question for designers of online communities, she said, is “how do we either create virtual norms that are comparable, or how do we represent those things so that people are getting those cues, so they modulate their behavior?”…”

Citi Bike System Data


Citi Bike: “Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? We’ve heard all of these questions and more from you and now we are happy to provide the datasets to help you discover the answers to these questions and more. We invite developers, engineers, statisticians, artists, academics and other members of the interested public to use the data we provide for analysis, development, visualization and whatever else moves you.
This data is provided according to the NYCBS Data Use Policy.
Citi Bike Trip Histories
Below are links to downloadable files of Citi Bike trip data. The data includes:

  • Trip Duration (seconds)
  • Start Time and Date
  • Stop Time and Date
  • Start Station Name
  • End Station Name
  • Station ID
  • Station Lat/Long
  • Bike ID
  • User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
  • Gender
  • Year of Birth”

Big Data’s Dangerous New Era of Discrimination


Michael Schrage in HBR blog: “Congratulations. You bought into Big Data and it’s paying off Big Time. You slice, dice, parse and process every screen-stroke, clickstream, Like, tweet and touch point that matters to your enterprise. You now know exactly who your best — and worst — customers, clients, employees and partners are.  Knowledge is power.  But what kind of power does all that knowledge buy?
Big Data creates Big Dilemmas. Greater knowledge of customers creates new potential and power to discriminate. Big Data — and its associated analytics — dramatically increase both the dimensionality and degrees of freedom for detailed discrimination. So where, in your corporate culture and strategy, does value-added personalization and segmentation end and harmful discrimination begin?
Let’s say, for example, that your segmentation data tells you the following:
Your most profitable customers by far are single women between the ages of 34 and 55 closely followed by “happily married” women with at least one child. Divorced women are slightly more profitable than “never marrieds.” Gay males — single and in relationships — are also disproportionately profitable. The “sweet spot” is urban and 28 to 50. These segments collectively account for roughly two-thirds of your profitability.  (Unexpected factoid: Your most profitable customers are overwhelmingly Amazon Prime subscriber. What might that mean?)
Going more granular, as Big Data does, offers even sharper ethno-geographic insight into customer behavior and influence:

  • Single Asian, Hispanic, and African-American women with urban post codes are most likely to complain about product and service quality to the company. Asian and Hispanic complainers happy with resolution/refund tend to be in the top quintile of profitability. African-American women do not.
  • Suburban Caucasian mothers are most likely to use social media to share their complaints, followed closely by Asian and Hispanic mothers. But if resolved early, they’ll promote the firm’s responsiveness online.
  • Gay urban males receiving special discounts and promotions are the most effective at driving traffic to your sites.

My point here is that these data are explicit, compelling and undeniable. But how should sophisticated marketers and merchandisers use them?
Campaigns, promotions and loyalty programs targeting women and gay males seem obvious. But should Asian, Hispanic and white females enjoy preferential treatment over African-American women when resolving complaints? After all, they tend to be both more profitable and measurably more willing to effectively use social media. Does it make more marketing sense encouraging African-American female customers to become more social media savvy? Or are resources better invested in getting more from one’s best customers? Similarly, how much effort and ingenuity flow should go into making more gay male customers better social media evangelists? What kinds of offers and promotions could go viral on their networks?…
Of course, the difference between price discrimination and discrimination positively correlated with gender, ethnicity, geography, class, personality and/or technological fluency is vanishingly small. Indeed, the entire epistemological underpinning of Big Data for business is that it cost-effectively makes informed segmentation and personalization possible…..
But the main source of concern won’t be privacy, per se — it will be whether and how companies and organizations like your own use Big Data analytics to justify their segmentation/personalization/discrimination strategies. The more effective Big Data analytics are in profitably segmenting and serving customers, the more likely those algorithms will be audited by regulators or litigators.
Tomorrow’s Big Data challenge isn’t technical; it’s whether managements have algorithms and analytics that are both fairly transparent and transparently fair. Big Data champions and practitioners had better be discriminating about how discriminating they want to be.”

The Moneyball Effect: How smart data is transforming criminal justice, healthcare, music, and even government spending


TED: “When Anne Milgram became the Attorney General of New Jersey in 2007, she was stunned to find out just how little data was available on who was being arrested, who was being charged, who was serving time in jails and prisons, and who was being released. It turns out that most big criminal justice agencies like my own didn’t track the things that matter,” she says in today’s talk, filmed at TED@BCG. “We didn’t share data, or use analytics, to make better decisions and reduce crime.”
Milgram’s idea for how to change this: “I wanted to moneyball criminal justice.”
Moneyball, of course, is the name of a 2011 movie starring Brad Pitt and the book it’s based on, written by Michael Lewis in 2003. The term refers to a practice adopted by the Oakland A’s general manager Billy Beane in 2002 — the organization began basing decisions not on star power or scout instinct, but on statistical analysis of measurable factors like on-base and slugging percentages. This worked exceptionally well. On a tiny budget, the Oakland A’s made it to the playoffs in 2002 and 2003, and — since then — nine other major league teams have hired sabermetric analysts to crunch these types of numbers.
Milgram is working hard to bring smart statistics to criminal justice. To hear the results she’s seen so far, watch this talk. And below, take a look at a few surprising sectors that are getting the moneyball treatment as well.

Moneyballing music. Last year, Forbes magazine profiled the firm Next Big Sound, a company using statistical analysis to predict how musicians will perform in the market. The idea is that — rather than relying on the instincts of A&R reps — past performance on Pandora, Spotify, Facebook, etc can be used to predict future potential. The article reads, “For example, the company has found that musicians who gain 20,000 to 50,000 Facebook fans in one month are four times more likely to eventually reach 1 million. With data like that, Next Big Sound promises to predict album sales within 20% accuracy for 85% of artists, giving labels a clearer idea of return on investment.”
Moneyballing human resources. In November, The Atlantic took a look at the practice of “people analytics” and how it’s affecting employers. (Billy Beane had something to do with this idea — in 2012, he gave a presentation at the TLNT Transform Conference called “The Moneyball Approach to Talent Management.”) The article describes how Bloomberg reportedly logs its employees’ keystrokes and the casino, Harrah’s, tracks employee smiles. It also describes where this trend could be going — for example, how a video game called Wasabi Waiter could be used by employers to judge potential employees’ ability to take action, solve problems and follow through on projects. The article looks at the ways these types of practices are disconcerting, but also how they could level an inherently unequal playing field. After all, the article points out that gender, race, age and even height biases have been demonstrated again and again in our current hiring landscape.
Moneyballing healthcare. Many have wondered: what about a moneyball approach to medicine? (See this call out via Common Health, this piece in Wharton Magazine or this op-ed on The Huffington Post from the President of the New York State Health Foundation.) In his TED Talk, “What doctors can learn from each other,” Stefan Larsson proposed an idea that feels like something of an answer to this question. In the talk, Larsson gives a taste of what can happen when doctors and hospitals measure their outcomes and share this data with each other: they are able to see which techniques are proving the most effective for patients and make adjustments. (Watch the talk for a simple way surgeons can make hip surgery more effective.) He imagines a continuous learning process for doctors — that could transform the healthcare industry to give better outcomes while also reducing cost.
Moneyballing government. This summer, John Bridgeland (the director of the White House Domestic Policy Council under President George W. Bush) and Peter Orszag (the director of the Office of Management and Budget in Barack Obama’s first term) teamed up to pen a provocative piece for The Atlantic called, “Can government play moneyball?” In it, the two write, “Based on our rough calculations, less than $1 out of every $100 of government spending is backed by even the most basic evidence that the money is being spent wisely.” The two explain how, for example, there are 339 federally-funded programs for at-risk youth, the grand majority of which haven’t been evaluated for effectiveness. And while many of these programs might show great results, some that have been evaluated show troubling results. (For example, Scared Straight has been shown to increase criminal behavior.) Yet, some of these ineffective programs continue because a powerful politician champions them. While Bridgeland and Orszag show why Washington is so averse to making data-based appropriation decisions, the two also see the ship beginning to turn around. They applaud the Obama administration for a 2014 budget with an “unprecendented focus on evidence and results.” The pair also gave a nod to the nonprofit Results for America, which advocates that for every $99 spent on a program, $1 be spent on evaluating it. The pair even suggest a “Moneyball Index” to encourage politicians not to support programs that don’t show results.
In any industry, figuring out what to measure, how to measure it and how to apply the information gleaned from those measurements is a challenge. Which of the applications of statistical analysis has you the most excited? And which has you the most terrified?”

The Power to Decide


Special Report by Antonio Regalado in MIT Technology Review: “Back in 1956, an engineer and a mathematician, William Fair and Earl Isaac, pooled $800 to start a company. Their idea: a score to handicap whether a borrower would repay a loan.
It was all done with pen and paper. Income, gender, and occupation produced numbers that amounted to a prediction about a person’s behavior. By the 1980s the three-digit scores were calculated on computers and instead took account of a person’s actual credit history. Today, Fair Isaac Corp., or FICO, generates about 10 billion credit scores annually, calculating 50 times a year for many Americans.
This machinery hums in the background of our financial lives, so it’s easy to forget that the choice of whether to lend used to be made by a bank manager who knew a man by his handshake. Fair and Isaac understood that all this could change, and that their company didn’t merely sell numbers. “We sell a radically different way of making decisions that flies in the face of tradition,” Fair once said.
This anecdote suggests a way of understanding the era of “big data”—terabytes of information from sensors or social networks, new computer architectures, and clever software. But even supercharged data needs a job to do, and that job is always about a decision.
In this business report, MIT Technology Review explores a big question: how are data and the analytical tools to manipulate it changing decision making today? On Nasdaq, trading bots exchange a billion shares a day. Online, advertisers bid on hundreds of thousands of keywords a minute, in deals greased by heuristic solutions and optimization models rather than two-martini lunches. The number of variables and the speed and volume of transactions are just too much for human decision makers.
When there’s a person in the loop, technology takes a softer approach (see “Software That Augments Human Thinking”). Think of recommendation engines on the Web that suggest products to buy or friends to catch up with. This works because Internet companies maintain statistical models of each of us, our likes and habits, and use them to decide what we see. In this report, we check in with LinkedIn, which maintains the world’s largest database of résumés—more than 200 million of them. One of its newest offerings is University Pages, which crunches résumé data to offer students predictions about where they’ll end up working depending on what college they go to (see “LinkedIn Offers College Choices by the Numbers”).
These smart systems, and their impact, are prosaic next to what’s planned. Take IBM. The company is pouring $1 billion into its Watson computer system, the one that answered questions correctly on the game show Jeopardy! IBM now imagines computers that can carry on intelligent phone calls with customers, or provide expert recommendations after digesting doctors’ notes. IBM wants to provide “cognitive services”—computers that think, or seem to (see “Facing Doubters, IBM Expands Plans for Watson”).
Andrew Jennings, chief analytics officer for FICO, says automating human decisions is only half the story. Credit scores had another major impact. They gave lenders a new way to measure the state of their portfolios—and to adjust them by balancing riskier loan recipients with safer ones. Now, as other industries get exposed to predictive data, their approach to business strategy is changing, too. In this report, we look at one technique that’s spreading on the Web, called A/B testing. It’s a simple tactic—put up two versions of a Web page and see which one performs better (see “Seeking Edge, Websites Turn to Experiments” and “Startups Embrace a Way to Fail Fast”).
Until recently, such optimization was practiced only by the largest Internet companies. Now, nearly any website can do it. Jennings calls this phenomenon “systematic experimentation” and says it will be a feature of the smartest companies. They will have teams constantly probing the world, trying to learn its shifting rules and deciding on strategies to adapt. “Winners and losers in analytic battles will not be determined simply by which organization has access to more data or which organization has more money,” Jennings has said.

Of course, there’s danger in letting the data decide too much. In this report, Duncan Watts, a Microsoft researcher specializing in social networks, outlines an approach to decision making that avoids the dangers of gut instinct as well as the pitfalls of slavishly obeying data. In short, Watts argues, businesses need to adopt the scientific method (see “Scientific Thinking in Business”).
To do that, they have been hiring a highly trained breed of business skeptics called data scientists. These are the people who create the databases, build the models, reveal the trends, and, increasingly, author the products. And their influence is growing in business. This could be why data science has been called “the sexiest job of the 21st century.” It’s not because mathematics or spreadsheets are particularly attractive. It’s because making decisions is powerful…”