The Stasi, casinos and the Big Data rush


Book Review by Hannah Kuchler of “What Stays in Vegas” (by Adam Tanner) in the Financial Times: “Books with sexy titles and decidedly unsexy topics – like, say, data – have a tendency to disappoint. But What Stays in Vegas is an engrossing, story-packed takedown of the data industry.

It begins, far from America’s gambling capital, in communist East Germany. The author, Adam Tanner, now a fellow at Harvard’s Institute for Quantitative Social Science, was in the late 1980s a travel writer taking notes on Dresden. What he did not realise was that the Stasi was busy taking notes on him – 50 pages in all – which he found when the files were opened after reunification. The secret police knew where he had stopped to consult a map, to whom he asked questions and when he looked in on a hotel.
Today, Tanner explains: “Thanks to meticulous data gathering from both public documents and commercial records, companies . . . know far more about typical consumers than the feared East German secret police recorded about me.”
Shining a light on how businesses outside the tech sector have become data addicts, Tanner focuses on Las Vegas casinos, which spotted the value in data decades ago. He was given access to Caesar’s Entertainment, one of the world’s largest casino operators. When chief executive Gary Loveman joined in the late 1990s, the former Harvard Business School professor bet the company’s future on harvesting personal data from its loyalty scheme. Rather than wooing the “whales” who spent the most, the company would use the data to decide which freebies were worth giving away to lure in mid-spenders who came back often – a strategy credited with helping the business grow.
The real revelations come when Tanner examines the data brokers’ “Cheez Whiz”. Like the maker of a popular processed dairy spread, he argues, data brokers blend ingredients from a range of sources, such as public records, marketing lists and commercial records, to create a detailed picture of your identity – and you will never quite be able to pin down the origin of any component…
The Big Data rush has gone into overdrive since the global economic crisis as marketers from different industries have sought new methods to grab the limited consumer spending available. Tanner argues that while users have in theory given permission for much of this information to be made public in bits and pieces, increasingly industrial-scale aggregation often feels like an invasion of privacy.
Privacy policies are so long and obtuse (one study Tanner quotes found that it would take a person more than a month, working full-time, to read all the privacy statements they come across in a year), people are unwittingly littering their data all over the internet. Anyway, marketers can intuit what we are like from the people we are connected to online. And as the data brokers’ lists are usually private, there is no way to check the compilers have got their facts right…”

Citizen Science: The Law and Ethics of Public Access to Medical Big Data


New Paper by Sharona Hoffman: Patient-related medical information is becoming increasingly available on the Internet, spurred by government open data policies and private sector data sharing initiatives. Websites such as HealthData.gov, GenBank, and PatientsLikeMe allow members of the public to access a wealth of health information. As the medical information terrain quickly changes, the legal system must not lag behind. This Article provides a base on which to build a coherent data policy. It canvasses emergent data troves and wrestles with their legal and ethical ramifications.
Publicly accessible medical data have the potential to yield numerous benefits, including scientific discoveries, cost savings, the development of patient support tools, healthcare quality improvement, greater government transparency, public education, and positive changes in healthcare policy. At the same time, the availability of electronic personal health information that can be mined by any Internet user raises concerns related to privacy, discrimination, erroneous research findings, and litigation. This Article analyzes the benefits and risks of health data sharing and proposes balanced legislative, regulatory, and policy modifications to guide data disclosure and use.”

Agency Liability Stemming from Citizen-Generated Data


Paper by Bailey Smith for The Wilson Center’s Science and Technology Innovation Program: “New ways to gather data are on the rise. One of these ways is through citizen science. According to a new paper by Bailey Smith, JD, federal agencies can feel confident about using citizen science for a few reasons. First, the legal system provides significant protection from liability through the Federal Torts Claim Act (FTCA) and Administrative Procedures Act (APA). Second, training and technological innovation has made it easier for the non-scientist to collect high quality data.”

What Is Big Data?


datascience@berkeley Blog: ““Big Data.” It seems like the phrase is everywhere. The term was added to the Oxford English Dictionary in 2013 External link, appeared in Merriam-Webster’s Collegiate Dictionary by 2014 External link, and Gartner’s just-released 2014 Hype Cycle External link shows “Big Data” passing the “Peak of Inflated Expectations” and on its way down into the “Trough of Disillusionment.” Big Data is all the rage. But what does it actually mean?
A commonly repeated definition External link cites the three Vs: volume, velocity, and variety. But others argue that it’s not the size of data that counts, but the tools being used, or the insights that can be drawn from a dataset.
To settle the question once and for all, we asked 40+ thought leaders in publishing, fashion, food, automobiles, medicine, marketing and every industry in between how exactly they would define the phrase “Big Data.” Their answers might surprise you! Take a look below to find out what big data is:

  1. John Akred, Founder and CTO, Silicon Valley Data Science
  2. Philip Ashlock, Chief Architect of Data.gov
  3. Jon Bruner, Editor-at-Large, O’Reilly Media
  4. Reid Bryant, Data Scientist, Brooks Bell
  5. Mike Cavaretta, Data Scientist and Manager, Ford Motor Company
  6. Drew Conway, Head of Data, Project Florida
  7. Rohan Deuskar, CEO and Co-Founder, Stylitics
  8. Amy Escobar, Data Scientist, 2U
  9. Josh Ferguson, Chief Technology Officer, Mode Analytics
  10. John Foreman, Chief Data Scientist, MailChimp

FULL LIST at datascience@berkeley Blog”

Data Mining Reveals How Social Coding Succeeds (And Fails)


Emerging Technology From the arXiv : “Collaborative software development can be hugely successful or fail spectacularly. An analysis of the metadata associated with these projects is teasing apart the difference….
The process of developing software has undergone huge transformation in the last decade or so. One of the key changes has been the evolution of social coding websites, such as GitHub and BitBucket.
These allow anyone to start a collaborative software project that other developers can contribute to on a voluntary basis. Millions of people have used these sites to build software, sometimes with extraordinary success.
Of course, some projects are more successful than others. And that raises an interesting question: what are the differences between successful and unsuccessful projects on these sites?
Today, we get an answer from Yuya Yoshikawa at the Nara Institute of Science and Technology in Japan and a couple of pals at the NTT Laboratories, also in Japan.  These guys have analysed the characteristics of over 300,000 collaborative software projects on GitHub to tease apart the factors that contribute to success. Their results provide the first insights into social coding success from this kind of data mining.
A social coding project begins when a group of developers outline a project and begin work on it. These are the “internal developers” and have the power to update the software in a process known as a “commit”. The number of commits is a measure of the activity on the project.
External developers can follow the progress of the project by “starring” it, a form of bookmarking on GitHub. The number of stars is a measure of the project’s popularity. These external developers can also request changes, such as additional features and so on, in a process known as a pull request.
Yoshikawa and co begin by downloading the data associated with over 300,000 projects from the GitHub website. This includes the number of internal developers, the number of stars a project receives over time and the number of pull requests it gets.
The team then analyse the effectiveness of the project by calculating factors such as the number of commits per internal team member, the popularity of the project over time, the number of pull requests that are fulfilled and so on.
The results provide a fascinating insight into the nature of social coding. Yoshikawa and co say the number of internal developers on a project plays a significant role in its success. “Projects with larger numbers of internal members have higher activity, popularity and sociality,” they say….
Ref: arxiv.org/abs/1408.6012 : Collaboration on Social Media: Analyzing Successful Projects on Social Coding”

Using Crowds for Evaluation Tasks: Validity by Numbers vs. Validity by Expertise


Paper by Christoph Hienerth and Frederik Riar:Developing and commercializing novel ideas is central to innovation processes. As the outcome of such ideas cannot fully be foreseen, the evaluation of them is crucial. With the rise of the internet and ICT, more and new kinds of evaluations are done by crowds. This raises the question whether individuals in crowds possess necessary capabilities to evaluate and whether their outcomes are valid. As empirical insights are not yet available, this paper deals with the examination of evaluation processes and general evaluation components, the discussion of underlying characteristics and mechanism of these components affecting evaluation outcomes (i.e. evaluation validity). We further investigate differences between firm- and crowd-based evaluation using different cases of applications, and develop a theoretical framework towards evaluation validity, i.e. validity by numbers vs. the validity by expertise. The identified factors that influence the validity of evaluations are: (1) the number of evaluation tasks, (2) complexity, (3) expertise, (4) costs, and (5) time to outcome. For each of these factors, hypotheses are developed based on theoretical arguments. We conclude with implications, proposing a model of evaluation validity.”

A Few Useful Things to Know about Machine Learning


A new research paper by Pedro Domingos: “Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.”
 

The wisest choices depend on instinct and careful analysis


John Kay in the Financial Times: “Moneyball, Michael Lewis’s 2003 book on the science of picking baseball teams, was perhaps written to distract himself from his usual work of attacking the financial services industry. Even after downloading the rules of baseball, I still could not fully understand what was going on. But I caught the drift: sabermetrics, the statistical analysis of the records of players, proved a better guide than the accumulated wisdom of experienced coaches.

Another lesson, important for business strategy, was the brevity of the benefits gained by the Oakland A’s, Lewis’s sporting heroes. If the only source of competitive advantage is better quantitative analysis – whether in baseball or quant strategies in the financial sector – such an advantage can be rapidly and accurately imitated.

At the same time, another genre of books proclaims the virtues of instinctive decision-making. Malcolm Gladwell’s Blink (2005) begins with accounts of how experts could identify the Getty kouros – a statue of naked youth purported to be of ancient Greek provenance and purchased in 1985 for $9m – as fake immediately, even though it had supposedly been authenticated through extended scientific tests.

Gary Klein, a cognitive psychologist has for many years monitored the capabilities of experienced practical decision makers – firefighters, nurses and military personnel – who make immediate judgments that are vindicated by the more elaborate assessments possible only with hindsight.
Of course, there is no real inconsistency between the two propositions. The experienced coaches disparaged by sabermetrics enthusiasts were right to believe they knew a lot about spotting baseball talent; they just did not know as much as they thought they did. The art experts and firefighters who made instantaneous, but accurate, judgments were not hearing voices in the air. But no expert can compete with chemical analysis and carbon dating in assessing the age of a work of art.
There are two ways of reconciling expertise with analysis. One takes the worst of both worlds, combining the overconfidence of experience with the naive ignorance of the quant. The resulting bogus rationality seeks to objectivise expertise by fitting it into a template.
It is exemplified in the processes by which interviewers for jobs, and managers who make personnel assessments, are required to complete checklists explaining how they reached their conclusion using prescribed criteria….”

Values at Play in Digital Games


New book by Mary Flanagan and Helen Nissenbaum: “All games express and embody human values, providing a compelling arena in which we play out beliefs and ideas. “Big ideas” such as justice, equity, honesty, and cooperation—as well as other kinds of ideas, including violence, exploitation, and greed—may emerge in games whether designers intend them or not. In this book, Mary Flanagan and Helen Nissenbaum present Values at Play, a theoretical and practical framework for identifying socially recognized moral and political values in digital games. Values at Play can also serve as a guide to designers who seek to implement values in the conception and design of their games.
After developing a theoretical foundation for their proposal, Flanagan and Nissenbaum provide detailed examinations of selected games, demonstrating the many ways in which values are embedded in them. They introduce the Values at Play heuristic, a systematic approach for incorporating values into the game design process. Interspersed among the book’s chapters are texts by designers who have put Values at Play into practice by accepting values as a design constraint like any other, offering a real-world perspective on the design challenges involved.”

Riding the Second Wave of Civic Innovation


Jeremy Goldberg at Governing: “Innovation and entrepreneurship in local government increasingly require mobilizing talent from many sectors and skill sets. Fortunately, the opportunities for nurturing cross-pollination between the public and private sectors have never been greater, thanks in large part to the growing role of organizations such as Bayes Impact, Code for America, Data Science for Social Good and Fuse Corps.
Indeed, there’s reason to believe that we might be entering an even more exciting period of public-private collaboration. As one local-government leader recently put it to me when talking about the critical mass of pro-bono civic-innovation efforts taking place across the San Francisco Bay area, “We’re now riding the second wave of civic pro-bono and civic innovation.”
As an alumni of Fuse Corps’ executive fellows program, I’m convinced that the opportunities initiated by it and similar organizations are integral to civic innovation. Fuse Corps brings civic entrepreneurs with experience across the public, private and nonprofit sectors to work closely with government employees to help them negotiate project design, facilitation and management hurdles. The organization’s leadership training emphasizes “smallifying” — building innovation capacity by breaking big challenges down into smaller tasks in a shorter timeframe — and making “little bets” — low-risk actions aimed at developing and testing an idea.
Since 2012, I have managed programs and cross-sector networks for the Silicon Valley Talent Partnership. I’ve witnessed a groundswell of civic entrepreneurs from across the region stepping up to participate in discussions and launch rapid-prototyping labs focused on civic innovation.
Cities across the nation are creating new roles and programs to engage these civic start-ups. They’re learning that what makes these projects, and specifically civic pro-bono programs, work best is a process of designing, building, operationalizing and bringing them to scale. If you’re setting out to create such a program, here’s a short list of best practices:
Assets: Explore existing internal resources and knowledge to understand the history, departmental relationships and overall functions of the relevant agencies or departments. Develop a compendium of current service/volunteer programs.
City policies/legal framework: Determine what the city charter, city attorney’s office or employee-relations rules and policies say about procurement, collective bargaining and public-private partnerships.
Leadership: The support of the city’s top leadership is especially important during the formative stages of a civic-innovation program, so it’s important to understand how the city’s form of government will impact the program. For example, in a “strong mayor” government the ability to make definitive decisions on a public-private collaboration may be unlikely to face the same scrutiny as it might under a “council/mayor” government.
Cross-departmental collaboration: This is essential. Without the support of city staff across departments, innovation projects are unlikely to take off. Convening a “tiger team” of individuals who are early adopters of such initiatives is important step. Ultimately, city staffers best understand the needs and demands of their departments or agencies.
Partners from corporations and philanthropy: Leveraging existing partnerships will help to bring together an advisory group of cross-sector leaders and executives to participate in the early stages of program development.
Business and member associations: For the Silicon Valley Talent Partnership, the Silicon Valley Leadership Group has been instrumental in advocating for pro-bono volunteerism with the cities of Fremont, San Jose and Santa Clara….”