Reality Mining: Using Big Data to Engineer a Better World


New book by Nathan Eagle and Kate Greene : “Big Data is made up of lots of little data: numbers entered into cell phones, addresses entered into GPS devices, visits to websites, online purchases, ATM transactions, and any other activity that leaves a digital trail. Although the abuse of Big Data—surveillance, spying, hacking—has made headlines, it shouldn’t overshadow the abundant positive applications of Big Data. In Reality Mining, Nathan Eagle and Kate Greene cut through the hype and the headlines to explore the positive potential of Big Data, showing the ways in which the analysis of Big Data (“Reality Mining”) can be used to improve human systems as varied as political polling and disease tracking, while considering user privacy.

Eagle, a recognized expert in the field, and Greene, an experienced technology journalist, describe Reality Mining at five different levels: the individual, the neighborhood and organization, the city, the nation, and the world. For each level, they first offer a nontechnical explanation of data collection methods and then describe applications and systems that have been or could be built. These include a mobile app that helps smokers quit smoking; a workplace “knowledge system”; the use of GPS, Wi-Fi, and mobile phone data to manage and predict traffic flows; and the analysis of social media to track the spread of disease. Eagle and Greene argue that Big Data, used respectfully and responsibly, can help people live better, healthier, and happier lives.”

Monitoring Arms Control Compliance With Web Intelligence


Chris Holden and Maynard Holliday at Commons Lab: “Traditional monitoring of arms control treaties, agreements, and commitments has required the use of National Technical Means (NTM)—large satellites, phased array radars, and other technological solutions. NTM was a good solution when the treaties focused on large items for observation, such as missile silos or nuclear test facilities. As the targets of interest have shrunk by orders of magnitude, the need for other, more ubiquitous, sensor capabilities has increased. The rise in web-based, or cloud-based, analytic capabilities will have a significant influence on the future of arms control monitoring and the role of citizen involvement.
Since 1999, the U.S. Department of State has had at its disposal the Key Verification Assets Fund (V Fund), which was established by Congress. The Fund helps preserve critical verification assets and promotes the development of new technologies that support the verification of and compliance with arms control, nonproliferation, and disarmament requirements.
Sponsored by the V Fund to advance web-based analytic capabilities, Sandia National Laboratories, in collaboration with Recorded Future (RF), synthesized open-source data streams from a wide variety of traditional and nontraditional web sources in multiple languages along with topical texts and articles on national security policy to determine the efficacy of monitoring chemical and biological arms control agreements and compliance. The team used novel technology involving linguistic algorithms to extract temporal signals from unstructured text and organize that unstructured text into a multidimensional structure for analysis. In doing so, the algorithm identifies the underlying associations between entities and events across documents and sources over time. Using this capability, the team analyzed several events that could serve as analogs to treaty noncompliance, technical breakout, or an intentional attack. These events included the H7N9 bird flu outbreak in China, the Shanghai pig die-off and the fungal meningitis outbreak in the United States last year.
h7n9-for-blog
 
For H7N9 we found that open source social media were the first to report the outbreak and give ongoing updates.  The Sandia RF system was able to roughly estimate lethality based on temporal hospitalization and fatality reporting.  For the Shanghai pig die-off the analysis tracked the rapid assessment by Chinese authorities that H7N9 was not the cause of the pig die-off as had been originally speculated. Open source reporting highlighted a reduced market for pork in China due to the very public dead pig display in Shanghai. Possible downstream health effects were predicted (e.g., contaminated water supply and other overall food ecosystem concerns). In addition, legitimate U.S. food security concerns were raised based on the Chinese purchase of the largest U.S. pork producer (Smithfield) because of a fear of potential import of tainted pork into the United States….
To read the full paper, please click here.”

Thousands Can Fact-Check The News With Grasswire


in TechCrunch: “We all know you can’t believe everything you read on the Internet. But with Grasswire, you can at least “refute” it.
Austen Allred’s new venture allows news junkies to confirm and refute posts about breaking news. The “real-time newsroom controlled by everyone” divides posts into popular news topics, such as the Malaysia Airlines Crash in Ukraine and the Israeli-Palestinian conflict.
Once you select a topic, you then can upvote posts like Reddit to make them appear at the top of the page. If you see something that is incorrect, you can refute it by posting a source URL to information that disproves it. You can do the same to confirm a report. When you share the post on social media, all of these links are shared with it….
“Obviously there are some journalists who think turning journalism over to people who aren’t professional journalists is dangerous, but we disagree with those people,” Allred said. “I feel like the ability to refute something is not that incredibly difficult. The real power of journalism is when we have massive amounts of people trying to scrutinize whether or not that is accurate enough.”…
But despite these flaws, other attempts to fact check breaking news online have faltered. We still see false reports tweeted by verified accounts all the time, for instance. Something like Grasswire could serve the same role as a correction or a revision posted on an article. By linking to source material that continues to appear every time the post is shared, it is much like an article with an editor’s note that explains why something has been altered or changed.
For journalists trying to balance old-school ethics with new media tools, this option could be crucial. If executed correctly, it could lead to far fewer false reports because thousands of people could be fact checking information, not just a handful in a newsroom….”

Time for 21st century democracy


Martin Smith and Dave Richards at Policy Network (UK): “…The way that the world has changed is leading to a clash between two contrasting cultures.   Traditional, top down, elite models of democracy and accountability are no longer sustainable in an age of a digitally more open-society. As the recent Hansard Society Report into PMQs clearly reveals, the people see politicians as out of touch and remote.   What we need are two major changes. One is the recognition by institutions that they are now making decisions in an open world.  That even if they make decisions in private (which in certain cases they clearly have to) they should recognise that at some point those decisions may need to be justified.  Therefore every decision should be made on the basis that if it were open it would be deemed as legitimate.
The second is the development of bottom up accountability – we have to develop mechanisms where accountability is not mediated through institutions (as is the case with parliamentary accountability).  In its conclusion, the Hansard Society report proposes new technology could be used to allow citizens rather than MPs to ask questions at Prime Minister’s question time.  This is one of many forms of citizen led accountability that could reinforce the openness of decision making.
New technology creates the opportunity to move away from 19th century democracy.  Technology can be used to change the way decisions are made, how citizens are involved and how institutions are held to account.  This is already happening with social groups using social media, on-line petitions and mobile technologies as part of their campaigns.  However, this process needs to be formalised (such as in the Hansard Society’s suggestion for citizen’s questions).  There is also a need for more user friendly ways of analysing big data around government performance.  Big data creates many new ways in which decisions can be opened up and critically reviewed.  We also need much more explicit policies of leak and whistleblowing so that those who do reveal the inner workings of governments are not criminalised….”
Fundamentally, the real change is about treating citizens as grown-ups recognising that they can be privy to the details of the policy-making process.  There is a great irony in the playground behaviour of Prime Minister’s question time and the patronising attitudes of political elites towards voters (which tends to infantilise citizens as not to have the expertise to fully participate).  The most important change is that institutions start to act as if they are operating in an open society where they are directly accountable and hence are in a position to start regaining the trust of the people.   The closed world of institutions is no longer viable in a digital age.

Using the Wisdom of the Crowd to Democratize Markets


David Weidner at the Wall Street Journal: “For years investors have largely depended on three sources to distill the relentless onslaught of information about public companies: the companies themselves, Wall Street analysts and the media.
Each of these has their strengths, but they may have even bigger weaknesses. Companies spin. Analysts have conflicts of interest. The financial media is under deadline pressure and ill-equipped to act as a catch-all watchdog.
But in recent years, the tech whizzes out of Silicon Valley have been trying to democratize the markets. In 2010 I wrote about an effort called Moxy Vote, an online system for shareholders to cast ballots in proxy contests. Moxy Vote had some initial success but ran into regulatory trouble and failed to gain traction.
Some newer efforts are more promising, mostly because they depend on users, or some form of crowdsourcing, for their content. Crowdsourcing is when a need is turned over to a large group, usually an online community, rather than traditional paid employees or outside providers….
Estimize.com is one. It was founded in 2011 by former trader Leigh Drogan, but recently has undergone some significant expansion, adding a crowd-sourced prediction for mergers and acquisitions. Estimize also boasts a track record. It claims it beats Wall Street analysts 65.9% of the time during earnings season. Like SeekingAlpha, Estimize does, however, lean heavily on pros or semi-pros. Nearly 5,000 of its contributors are analysts.
Closer to the social networking world there’s scutify.com, a website and mobile app that aggregates what’s being said about individual stocks on social networks, blogs and other sources. It highlights trending stocks and links to chatter on social networks. (The site is owned by Cody Willard, a contributor to MarketWatch, which is owned by Dow Jones, the publisher of The Wall Street Journal.)
Perhaps the most intriguing startup is TwoMargins.com. The site allows investors, analysts, average Joes — anyone, really — to annotate company releases. In that way, Two Margins potentially can tap the power of the crowd to provide a fourth source for the marketplace.
Two Margins, a startup funded by Bloomberg L.P.’s venture capital fund, borrows annotation technology that’s already in use on other sites such as genius.com and scrible.com. Participants can sign in with their Twitter or Facebook accounts and post to those networks from the site. (Dow Jones competes with Bloomberg in the provision of news and financial data.)
At this moment, Two Margins isn’t a game changer. Founders Gniewko Lubecki and Akash Kapur said the site is in a pre-beta phase, which is to say it’s sort of up and running and being constantly tweaked.
Right now there’s nothing close to the critical mass needed for an exhaustive look at company filings. There’s just a handful of users and less than a dozen company releases and filings available.
Still, in the first moments after Twitter Inc.’s earnings were released Tuesday, Two Margins’ most loyal users began to scour the release. “Looks like Twitter is getting significantly better at monetizing users,” wrote a user named “George” who had annotated the revenue line from the company’s financial statement. Another user, “Scott Paster,” noted Twitter’s stock option grants to executives were nearly as high as its reported loss.
“The sum is greater than it’s parts when you pull together a community of users,” Mr. Kapur said. “Widening access to these documents is one goal. The other goal is broadening the pool of knowledge that’s brought to bear on these documents.”
In the end, this new wave of tech-driven services may never capture enough users to make it into the investing mainstream. They all struggle with uninformed and inaccurate content especially if they gain critical mass. Vetting is a problem.
For that reasons, it’s hard to predict whether these new entries will flourish or even survive. That’s not a bad thing. The march of technology will either improve on the idea or come up with a new one.
Ultimately, technology is making possible what hasn’t been. That is, free discussion, access and analysis of information. Some may see it as a threat to Wall Street, which has always charged for expert analysis. Really, though, these efforts are good for markets, which pride themselves on being fair and transparent.
It’s not just companies that should compete, but ideas too.”

Powerful new patent service shows every US invention, and a new view of R&D relationships


at GigaOm: “The website for the U.S. Patent Office website is famously clunky: searching and sorting patents can feel like playing an old Atari game, rather than watching innovation at work. But now a young inventor has come along with a tool to build a better patent office.
The service is called Trea, and was launched by Max Yuan, an engineer who received a patent of his own for a bike motor in 2007. After writing a tool to download patents related to his own invention, he expanded the process to slurp every patent and image in the USPTO database, and compile the information in a user-friendly interface.
Trea has been in beta for a while, but will formally launch on Wednesday. The tool not only provides an easy way to see what inventions a company or inventor is patenting, but also shows the fields in which they are most active. Here is a screenshot from Trea that shows what Apple has been up to in the last 12 months:
Trea screenshot of Apple inventions
Such information could be valuable to investors or to companies that want to use the filings as a way to track what might be in their competitors’ product pipelines. The Trea database also probes the USPTO for new filings, and can send alerts to subscribers. Yuan has also created a Twitter account just for new Apple filings.
Trea also draws on the patent database to display what Yuan calls a “unified knowledge graph” of relationships between inventors. Pictures, like the one below for IBM, show clusters of inventors and, at a broader level, the viral transmission of human ideas within a company:
Trea IBM screenshot
 
This type of information, gleaned from patent filings, could be valuable to corporate strategists, or to journalists, scholars or business historians. And making government websites more user-friendly, as Rankandfiled.com is attempting to do with Securities and Exchange Commission filings, can certainly help people understand what their regulators are doing….”

How to harness the wisdom of crowds to improve public service delivery and policymaking


Eddie Copeland in PolicyBytes: “…In summary, government has used technology to streamline transactions and better understand the public’s opinions. Yet it has failed to use it to radically change the way it works. Have public services been reinvented? Is government smaller and leaner? Have citizens, businesses and civic groups been offered the chance to take part in the work of government and improve their own communities? On all counts the answer is unequivocally, no. What is needed, therefore, is a means to enable citizens to provide data to government to inform policymaking and to improve – or even help deliver – public services. What is needed is a Government Data Marketplace.

Government Data Marketplace

A Government Data Marketplace (GDM) would be a website that brought together public sector bodies that needed data, with individuals, businesses and other organisations that could provide it. Imagine an open data portal in reverse: instead of government publishing its own datasets to be used by citizens and businesses, it would instead publish its data needs and invite citizens, businesses or community groups to provide that data (for free or in return for payment). Just as open data portals aim to provide datasets in standard, machine-readable formats, GDM would operate according to strict open standards, and provide a consistent and automated way to deliver data to government through APIs.
How would it work? Imagine a local council that wished to know where instances of graffiti occurred within its borough. The council would create an account on GDM and publish a new request, outlining the data it required (not dissimilar to someone posting a job on a site like Freelancer). Citizens, businesses and other organisations would be able to view that request on GDM and bid to offer the service. For example, an app-development company could offer to build an app that would enable citizens to photograph and locate instances of graffiti in the borough. The app would be able to upload the data to GDM. The council could connect its own IT system to GDM to pass the data to their own database.
Importantly, the app-development company would specify via GDM how much it would charge to provide the data. Other companies and organisations could offer competing bids for delivering the same – or an even better service – at different prices. Supportive local civic hacker groups could even offer to provide the data for free. Either way, the council would get the data it needed without having to collect it for itself, whilst also ensuring it paid the best price from a number of competing providers.
Since GDM would be a public marketplace, other local authorities would be able to see that a particular company had designed a graffiti-reporting solution for one council, and could ask for the same data to be collected in their own boroughs. This would be quick and easy for the developer, as instead of having to create a bespoke solution to work with each council’s IT system, they could connect to all of them using one common interface via GDM. That would good for the company, as they could sell to a much larger market (the same solution would work for one council or all), and good for the councils, as they would benefit from cheaper prices generated from economies of scale. And since GDM would use open standards, if a council was unhappy with the data provided by one supplier, it could simply look to another company to provide the same information.
What would be the advantages of such a system? Firstly, innovation. GDM would free government from having to worry about what software it needed, and instead allow it to focus on the data it required to provide a service. To be clear: councils themselves do not need a graffiti app – they need data on where graffiti is. By focusing attention on its data needs, the public sector could let the market innovate to find the best solutions for providing it. That might be via an app, perhaps via a website, social media, or Internet of Things sensors, or maybe even using a completely new service that collected information in a radically different way. It will not matter – the right information would be provided in a common format via GDM.
Secondly, the potential cost savings of this approach would be many and considerable. At the very least, by creating a marketplace, the public sector would be able to source data at a competitive price. If several public sector bodies needed the same service via GDM, companies providing that data would be able to offer much cheaper prices for all, as instead of having to deal with hundreds of different organisations (and different interfaces) they could create one solution that worked for all of them. As prices became cheaper for standard solutions, this would in turn encourage more public sector bodies to converge on common ways of working, driving down costs still further. Yet these savings would be dwarfed by those possible if GDM could be used to source data that public sectors bodies currently have to manually collect themselves. Imagine if instead of having teams of inspectors to locate instances X, Y or Z, it could instead source the same data from citizens via GDM?
There would no limit to the potential applications to which GDM could be put by central and local government and other public sector bodies: for graffiti, traffic levels, environmental issues, education or welfare. It could be used to crowdsource facts, figures, images, map coordinates, text – anything that can be collected as data. Government could request information on areas on which it previously had none, helping them to assign their finite resources and money in a much more targeted way. New York City’s Mayor’s Office of Data Analytics has demonstrated that up to 500% increases in the efficiency of providing some public services can be achieved, if only the right data is available.
For the private sector, GDM would stimulate the growth of innovative new companies offering community data, and make it easier for them to sell data solutions across the whole of the public sector. They could pioneer in new data methods, and potentially even take over the provision of entire services which the public sector currently has to provide itself. For citizens, it would offer a means to genuinely get involved in solving issues that matter to their local communities, either by using apps made by businesses, or working to provide the data themselves.
And what about the benefits for policymaking? It is important to acknowledge that the idea of harnessing the wisdom of crowds for policymaking is currently experimental. In the case of Policy Futures Markets, some applications have also been considered to be highly controversial. So which methods would be most effective? What would they look like? In what policy domains would they provide most value? The simple fact is that we do not know. What is certain, however, is that innovation in open policymaking and crowdsourcing ideas will never be achieved until a platform is available that allows such ideas to be tried and tested. GDM could be that platform.
Public sector bodies could experiment with asking citizens for information or answers to particular, fact-based questions, or even for predictions on future outcomes, to help inform their policymaking activities. The market could then innovate to develop solutions to source that data from citizens, using the many different models for harnessing the wisdom of crowds. The effectiveness of those initiatives could then be judged, and the techniques honed. In the worst case scenario that it did not work, money would not have been wasted on building the wrong platform – GDM would continue to have value in providing data for public service needs as described above….”

Interpreting Hashtag Politics – Policy Ideas in an Era of Social Media


New book by Stephen Jeffares: “Why do policy actors create branded terms like Big Society and does launching such policy ideas on Twitter extend or curtail their life? This book argues that the practice of hashtag politics has evolved in response to an increasingly congested and mediatised environment, with the recent and rapid growth of high speed internet connections, smart phones and social media. It examines how policy analysis can adapt to offer interpretive insights into the life and death of policy ideas in an era of hashtag politics.
This text reveals that policy ideas can at the same time be ideas, instruments, visions, containers and brands, and advises readers on how to tell if a policy idea is dead or dying, how to map the diversity of viewpoints, how to capture the debate, when to engage and when to walk away. Each chapter showcases innovative analytic techniques, illustrated by application to contemporary policy ideas.”

OkCupid reveals it’s been lying to some of its users. Just to see what’ll happen.


Brian Fung in the Washington Post: “It turns out that OkCupid has been performing some of the same psychological experiments on its users that landed Facebook in hot water recently.
In a lengthy blog post, OkCupid cofounder Christian Rudder explains that OkCupid has on occasion played around with removing text from people’s profiles, removing photos, and even telling some users they were an excellent match when in fact they were only a 30 percent match according to the company’s systems. Just to see what would happen.
OkCupid defends this behavior as something that any self-respecting Web site would do.
“OkCupid doesn’t really know what it’s doing. Neither does any other Web site,” Rudder wrote. “But guess what, everybody: if you use the Internet, you’re the subject of hundreds of experiments at any given time, on every site. That’s how websites work.”…
we have a bigger problem on our hands: A problem about how to reconcile the sometimes valuable lessons of data science with the creep factor — particularly when you aren’t notified about being studied. But as I’ve written before, these kinds of studies happen all the time; it’s just rare that the public is presented with the results.
Short of banning the practice altogether, which seems totally unrealistic, corporate data science seems like an opportunity on a number of levels, particularly if it’s disclosed to the public. First, it helps us understand how human beings tend to behave at Internet scale. Second, it tells us more about how Internet companies work. And third, it helps consumers make better decisions about which services they’re comfortable using.
I suspect that what bothers us most of all is not that the research took place, but that we’re slowly coming to grips with how easily we ceded control over our own information — and how the machines that collect all this data may all know more about us than we do ourselves. We had no idea we were even in a rabbit hole, and now we’ve discovered we’re 10 feet deep. As many as 62.5 percent of Facebook users don’t know the news feed is generated by a company algorithm, according to a recent study conducted by Christian Sandvig, an associate professor at the University of Michigan, and Karrie Karahalios, an associate professor at the University of Illinois.
OkCupid’s blog post is distinct in several ways from Facebook’s psychological experiment. OkCupid didn’t try to publish its findings in a scientific journal. It isn’t even claiming that what it did was science. Moreover, OkCupid’s research is legitimately useful to users of the service — in ways that Facebook’s research is arguably not….
But in any case, there’s no such motivating factor when it comes to Facebook. Unless you’re a page administrator or news organization, understanding how the newsfeed works doesn’t really help the average user in the way that understanding how OkCupid works does. That’s because people use Facebook for all kinds of reasons that have nothing to do with Facebook’s commercial motives. But people would stop using OkCupid if they discovered it didn’t “work.”
If you’re lying to your users in an attempt to improve your service, what’s the line between A/B testing and fraud?”

The Social Laboratory


Shane Harris in Foreign Policy: “…, Singapore has become a laboratory not only for testing how mass surveillance and big-data analysis might prevent terrorism, but for determining whether technology can be used to engineer a more harmonious society….Months after the virus abated, Ho and his colleagues ran a simulation using Poindexter’s TIA ideas to see whether they could have detected the outbreak. Ho will not reveal what forms of information he and his colleagues used — by U.S. standards, Singapore’s privacy laws are virtually nonexistent, and it’s possible that the government collected private communications, financial data, public transportation records, and medical information without any court approval or private consent — but Ho claims that the experiment was very encouraging. It showed that if Singapore had previously installed a big-data analysis system, it could have spotted the signs of a potential outbreak two months before the virus hit the country’s shores. Prior to the SARS outbreak, for example, there were reports of strange, unexplained lung infections in China. Threads of information like that, if woven together, could in theory warn analysts of pending crises.
The RAHS system was operational a year later, and it immediately began “canvassing a range of sources for weak signals of potential future shocks,” one senior Singaporean security official involved in the launch later recalled.
The system uses a mixture of proprietary and commercial technology and is based on a “cognitive model” designed to mimic the human thought process — a key design feature influenced by Poindexter’s TIA system. RAHS, itself, doesn’t think. It’s a tool that helps human beings sift huge stores of data for clues on just about everything. It is designed to analyze information from practically any source — the input is almost incidental — and to create models that can be used to forecast potential events. Those scenarios can then be shared across the Singaporean government and be picked up by whatever ministry or department might find them useful. Using a repository of information called an ideas database, RAHS and its teams of analysts create “narratives” about how various threats or strategic opportunities might play out. The point is not so much to predict the future as to envision a number of potential futures that can tell the government what to watch and when to dig further.
The officials running RAHS today are tight-lipped about exactly what data they monitor, though they acknowledge that a significant portion of “articles” in their databases come from publicly available information, including news reports, blog posts, Facebook updates, and Twitter messages. (“These articles have been trawled in by robots or uploaded manually” by analysts, says one program document.) But RAHS doesn’t need to rely only on open-source material or even the sorts of intelligence that most governments routinely collect: In Singapore, electronic surveillance of residents and visitors is pervasive and widely accepted…”