How Data Scientists Are Uncovering War Crimes in Syria


Lorenzo Franceschi-Bicchierai at Mashable: “For more than three years, Syria has been crippled by a bloody civil war that has laid waste to cities and exacted a heavy civilian toll. But because reporting in Syria is so dangerous, the bloodletting has largely taken place away from the media spotlight. One group of researchers, though, is determined to document every single killing.
Through painstaking data-gathering and assiduous verification, the group Syrian Tracker has tallied 111,915 deaths in the course of the conflict so far.
Syria Tracker gets reports from eyewitnesses and volunteers on the ground. Researchers also cull data from news reports.
The database has yielded some important insights such as possible war crimes committed by the Syrian regime.
Working in collaboration with researchers from the nonprofit organization SumAll.org, the researchers discovered that more women were getting killed in the conflict. In April of 2011, women made up only 1% of those killed. Today, 13% of victims are women, according to the latest data.

Syria Female Deaths

Image: SumAll

Those numbers alone don’t tell the whole story, though. Taking a closer look at how women were killed, the researchers discovered a pattern. Women weren’t random victims of bombings for example. Instead, many were killed by snipers, indicating a deliberate policy to go after female civilians, which would constitute a war crime.
Data on how children were killed suggest a similar conclusions. Of the thousands killed in the conflict, at least 700 have been summarily executed and tortured, and about 200 boys under the age of 13 have been killed by sniper fire, according to the data…”

Thousands Can Fact-Check The News With Grasswire


in TechCrunch: “We all know you can’t believe everything you read on the Internet. But with Grasswire, you can at least “refute” it.
Austen Allred’s new venture allows news junkies to confirm and refute posts about breaking news. The “real-time newsroom controlled by everyone” divides posts into popular news topics, such as the Malaysia Airlines Crash in Ukraine and the Israeli-Palestinian conflict.
Once you select a topic, you then can upvote posts like Reddit to make them appear at the top of the page. If you see something that is incorrect, you can refute it by posting a source URL to information that disproves it. You can do the same to confirm a report. When you share the post on social media, all of these links are shared with it….
“Obviously there are some journalists who think turning journalism over to people who aren’t professional journalists is dangerous, but we disagree with those people,” Allred said. “I feel like the ability to refute something is not that incredibly difficult. The real power of journalism is when we have massive amounts of people trying to scrutinize whether or not that is accurate enough.”…
But despite these flaws, other attempts to fact check breaking news online have faltered. We still see false reports tweeted by verified accounts all the time, for instance. Something like Grasswire could serve the same role as a correction or a revision posted on an article. By linking to source material that continues to appear every time the post is shared, it is much like an article with an editor’s note that explains why something has been altered or changed.
For journalists trying to balance old-school ethics with new media tools, this option could be crucial. If executed correctly, it could lead to far fewer false reports because thousands of people could be fact checking information, not just a handful in a newsroom….”

Time for 21st century democracy


Martin Smith and Dave Richards at Policy Network (UK): “…The way that the world has changed is leading to a clash between two contrasting cultures.   Traditional, top down, elite models of democracy and accountability are no longer sustainable in an age of a digitally more open-society. As the recent Hansard Society Report into PMQs clearly reveals, the people see politicians as out of touch and remote.   What we need are two major changes. One is the recognition by institutions that they are now making decisions in an open world.  That even if they make decisions in private (which in certain cases they clearly have to) they should recognise that at some point those decisions may need to be justified.  Therefore every decision should be made on the basis that if it were open it would be deemed as legitimate.
The second is the development of bottom up accountability – we have to develop mechanisms where accountability is not mediated through institutions (as is the case with parliamentary accountability).  In its conclusion, the Hansard Society report proposes new technology could be used to allow citizens rather than MPs to ask questions at Prime Minister’s question time.  This is one of many forms of citizen led accountability that could reinforce the openness of decision making.
New technology creates the opportunity to move away from 19th century democracy.  Technology can be used to change the way decisions are made, how citizens are involved and how institutions are held to account.  This is already happening with social groups using social media, on-line petitions and mobile technologies as part of their campaigns.  However, this process needs to be formalised (such as in the Hansard Society’s suggestion for citizen’s questions).  There is also a need for more user friendly ways of analysing big data around government performance.  Big data creates many new ways in which decisions can be opened up and critically reviewed.  We also need much more explicit policies of leak and whistleblowing so that those who do reveal the inner workings of governments are not criminalised….”
Fundamentally, the real change is about treating citizens as grown-ups recognising that they can be privy to the details of the policy-making process.  There is a great irony in the playground behaviour of Prime Minister’s question time and the patronising attitudes of political elites towards voters (which tends to infantilise citizens as not to have the expertise to fully participate).  The most important change is that institutions start to act as if they are operating in an open society where they are directly accountable and hence are in a position to start regaining the trust of the people.   The closed world of institutions is no longer viable in a digital age.

Using the Wisdom of the Crowd to Democratize Markets


David Weidner at the Wall Street Journal: “For years investors have largely depended on three sources to distill the relentless onslaught of information about public companies: the companies themselves, Wall Street analysts and the media.
Each of these has their strengths, but they may have even bigger weaknesses. Companies spin. Analysts have conflicts of interest. The financial media is under deadline pressure and ill-equipped to act as a catch-all watchdog.
But in recent years, the tech whizzes out of Silicon Valley have been trying to democratize the markets. In 2010 I wrote about an effort called Moxy Vote, an online system for shareholders to cast ballots in proxy contests. Moxy Vote had some initial success but ran into regulatory trouble and failed to gain traction.
Some newer efforts are more promising, mostly because they depend on users, or some form of crowdsourcing, for their content. Crowdsourcing is when a need is turned over to a large group, usually an online community, rather than traditional paid employees or outside providers….
Estimize.com is one. It was founded in 2011 by former trader Leigh Drogan, but recently has undergone some significant expansion, adding a crowd-sourced prediction for mergers and acquisitions. Estimize also boasts a track record. It claims it beats Wall Street analysts 65.9% of the time during earnings season. Like SeekingAlpha, Estimize does, however, lean heavily on pros or semi-pros. Nearly 5,000 of its contributors are analysts.
Closer to the social networking world there’s scutify.com, a website and mobile app that aggregates what’s being said about individual stocks on social networks, blogs and other sources. It highlights trending stocks and links to chatter on social networks. (The site is owned by Cody Willard, a contributor to MarketWatch, which is owned by Dow Jones, the publisher of The Wall Street Journal.)
Perhaps the most intriguing startup is TwoMargins.com. The site allows investors, analysts, average Joes — anyone, really — to annotate company releases. In that way, Two Margins potentially can tap the power of the crowd to provide a fourth source for the marketplace.
Two Margins, a startup funded by Bloomberg L.P.’s venture capital fund, borrows annotation technology that’s already in use on other sites such as genius.com and scrible.com. Participants can sign in with their Twitter or Facebook accounts and post to those networks from the site. (Dow Jones competes with Bloomberg in the provision of news and financial data.)
At this moment, Two Margins isn’t a game changer. Founders Gniewko Lubecki and Akash Kapur said the site is in a pre-beta phase, which is to say it’s sort of up and running and being constantly tweaked.
Right now there’s nothing close to the critical mass needed for an exhaustive look at company filings. There’s just a handful of users and less than a dozen company releases and filings available.
Still, in the first moments after Twitter Inc.’s earnings were released Tuesday, Two Margins’ most loyal users began to scour the release. “Looks like Twitter is getting significantly better at monetizing users,” wrote a user named “George” who had annotated the revenue line from the company’s financial statement. Another user, “Scott Paster,” noted Twitter’s stock option grants to executives were nearly as high as its reported loss.
“The sum is greater than it’s parts when you pull together a community of users,” Mr. Kapur said. “Widening access to these documents is one goal. The other goal is broadening the pool of knowledge that’s brought to bear on these documents.”
In the end, this new wave of tech-driven services may never capture enough users to make it into the investing mainstream. They all struggle with uninformed and inaccurate content especially if they gain critical mass. Vetting is a problem.
For that reasons, it’s hard to predict whether these new entries will flourish or even survive. That’s not a bad thing. The march of technology will either improve on the idea or come up with a new one.
Ultimately, technology is making possible what hasn’t been. That is, free discussion, access and analysis of information. Some may see it as a threat to Wall Street, which has always charged for expert analysis. Really, though, these efforts are good for markets, which pride themselves on being fair and transparent.
It’s not just companies that should compete, but ideas too.”

Policy bubbles: What factors drive their birth, maturity and death?


Moshe Maor at LSE Blog: “A policy bubble is a real or perceived policy overreaction that is reinforced by positive feedback over a relatively long period of time. This type of policy imposes objective and/or perceived social costs without producing offsetting objective and/or perceived benefits over a considerable length of time. A case in point is when government spending over a policy problem increases due to public demand for more policy while the severity of the problem decreases over an extended period of time. Another case is when governments raise ‘green’ or other standards due to public demand while the severity of the problem does not justify this move…
Drawing on insights from a variety of fields – including behavioural economics, psychology, sociology, political science and public policy – three phases of the life-cycle of a policy bubble may be identified: birth, maturity and death. A policy bubble may emerge when certain individuals perceive opportunities to gain from public policy or to exploit it by rallying support for the policy, promoting word-of-mouth enthusiasm and widespread endorsement of the policy, heightening expectations for further policy, and increasing demand for this policy….
How can one identify a policy bubble? A policy bubble may be identified by measuring parliamentary concerns, media concerns, public opinion regarding the policy at hand, and the extent of a policy problem, against the budget allocation to said policy over the same period, preferably over 50 years or more. Measuring the operation of different transmission mechanisms in emotional contagion and human herding, particularly the spread of social influence and feeling, can also work to identify a policy bubble.
Here, computer-aided content analysis of verbal and non-verbal communication in social networks, especially instant messaging, may capture emotional and social contagion. A further way to identify a policy bubble revolves around studying bubble expectations and individuals’ confidence over time by distributing a questionnaire to a random sample of the population, experts in the relevant policy sub-field, as well as decision makers, and comparing the results across time and nations.
To sum up, my interpretation of the process that leads to the emergence of policy bubbles allows for the possibility that different modes of policy overreaction lead to different types of human herding, thereby resulting in different types of policy bubbles. This interpretation has the added benefit of contributing to the explanation of economic, financial, technological and social bubbles as well”

How to harness the wisdom of crowds to improve public service delivery and policymaking


Eddie Copeland in PolicyBytes: “…In summary, government has used technology to streamline transactions and better understand the public’s opinions. Yet it has failed to use it to radically change the way it works. Have public services been reinvented? Is government smaller and leaner? Have citizens, businesses and civic groups been offered the chance to take part in the work of government and improve their own communities? On all counts the answer is unequivocally, no. What is needed, therefore, is a means to enable citizens to provide data to government to inform policymaking and to improve – or even help deliver – public services. What is needed is a Government Data Marketplace.

Government Data Marketplace

A Government Data Marketplace (GDM) would be a website that brought together public sector bodies that needed data, with individuals, businesses and other organisations that could provide it. Imagine an open data portal in reverse: instead of government publishing its own datasets to be used by citizens and businesses, it would instead publish its data needs and invite citizens, businesses or community groups to provide that data (for free or in return for payment). Just as open data portals aim to provide datasets in standard, machine-readable formats, GDM would operate according to strict open standards, and provide a consistent and automated way to deliver data to government through APIs.
How would it work? Imagine a local council that wished to know where instances of graffiti occurred within its borough. The council would create an account on GDM and publish a new request, outlining the data it required (not dissimilar to someone posting a job on a site like Freelancer). Citizens, businesses and other organisations would be able to view that request on GDM and bid to offer the service. For example, an app-development company could offer to build an app that would enable citizens to photograph and locate instances of graffiti in the borough. The app would be able to upload the data to GDM. The council could connect its own IT system to GDM to pass the data to their own database.
Importantly, the app-development company would specify via GDM how much it would charge to provide the data. Other companies and organisations could offer competing bids for delivering the same – or an even better service – at different prices. Supportive local civic hacker groups could even offer to provide the data for free. Either way, the council would get the data it needed without having to collect it for itself, whilst also ensuring it paid the best price from a number of competing providers.
Since GDM would be a public marketplace, other local authorities would be able to see that a particular company had designed a graffiti-reporting solution for one council, and could ask for the same data to be collected in their own boroughs. This would be quick and easy for the developer, as instead of having to create a bespoke solution to work with each council’s IT system, they could connect to all of them using one common interface via GDM. That would good for the company, as they could sell to a much larger market (the same solution would work for one council or all), and good for the councils, as they would benefit from cheaper prices generated from economies of scale. And since GDM would use open standards, if a council was unhappy with the data provided by one supplier, it could simply look to another company to provide the same information.
What would be the advantages of such a system? Firstly, innovation. GDM would free government from having to worry about what software it needed, and instead allow it to focus on the data it required to provide a service. To be clear: councils themselves do not need a graffiti app – they need data on where graffiti is. By focusing attention on its data needs, the public sector could let the market innovate to find the best solutions for providing it. That might be via an app, perhaps via a website, social media, or Internet of Things sensors, or maybe even using a completely new service that collected information in a radically different way. It will not matter – the right information would be provided in a common format via GDM.
Secondly, the potential cost savings of this approach would be many and considerable. At the very least, by creating a marketplace, the public sector would be able to source data at a competitive price. If several public sector bodies needed the same service via GDM, companies providing that data would be able to offer much cheaper prices for all, as instead of having to deal with hundreds of different organisations (and different interfaces) they could create one solution that worked for all of them. As prices became cheaper for standard solutions, this would in turn encourage more public sector bodies to converge on common ways of working, driving down costs still further. Yet these savings would be dwarfed by those possible if GDM could be used to source data that public sectors bodies currently have to manually collect themselves. Imagine if instead of having teams of inspectors to locate instances X, Y or Z, it could instead source the same data from citizens via GDM?
There would no limit to the potential applications to which GDM could be put by central and local government and other public sector bodies: for graffiti, traffic levels, environmental issues, education or welfare. It could be used to crowdsource facts, figures, images, map coordinates, text – anything that can be collected as data. Government could request information on areas on which it previously had none, helping them to assign their finite resources and money in a much more targeted way. New York City’s Mayor’s Office of Data Analytics has demonstrated that up to 500% increases in the efficiency of providing some public services can be achieved, if only the right data is available.
For the private sector, GDM would stimulate the growth of innovative new companies offering community data, and make it easier for them to sell data solutions across the whole of the public sector. They could pioneer in new data methods, and potentially even take over the provision of entire services which the public sector currently has to provide itself. For citizens, it would offer a means to genuinely get involved in solving issues that matter to their local communities, either by using apps made by businesses, or working to provide the data themselves.
And what about the benefits for policymaking? It is important to acknowledge that the idea of harnessing the wisdom of crowds for policymaking is currently experimental. In the case of Policy Futures Markets, some applications have also been considered to be highly controversial. So which methods would be most effective? What would they look like? In what policy domains would they provide most value? The simple fact is that we do not know. What is certain, however, is that innovation in open policymaking and crowdsourcing ideas will never be achieved until a platform is available that allows such ideas to be tried and tested. GDM could be that platform.
Public sector bodies could experiment with asking citizens for information or answers to particular, fact-based questions, or even for predictions on future outcomes, to help inform their policymaking activities. The market could then innovate to develop solutions to source that data from citizens, using the many different models for harnessing the wisdom of crowds. The effectiveness of those initiatives could then be judged, and the techniques honed. In the worst case scenario that it did not work, money would not have been wasted on building the wrong platform – GDM would continue to have value in providing data for public service needs as described above….”

Interpreting Hashtag Politics – Policy Ideas in an Era of Social Media


New book by Stephen Jeffares: “Why do policy actors create branded terms like Big Society and does launching such policy ideas on Twitter extend or curtail their life? This book argues that the practice of hashtag politics has evolved in response to an increasingly congested and mediatised environment, with the recent and rapid growth of high speed internet connections, smart phones and social media. It examines how policy analysis can adapt to offer interpretive insights into the life and death of policy ideas in an era of hashtag politics.
This text reveals that policy ideas can at the same time be ideas, instruments, visions, containers and brands, and advises readers on how to tell if a policy idea is dead or dying, how to map the diversity of viewpoints, how to capture the debate, when to engage and when to walk away. Each chapter showcases innovative analytic techniques, illustrated by application to contemporary policy ideas.”

Request for Proposals: Exploring the Implications of Government Release of Large Datasets


“The Berkeley Center for Law & Technology and Microsoft are issuing this request for proposals (RFP) to fund scholarly inquiry to examine the civil rights, human rights, security and privacy issues that arise from recent initiatives to release large datasets of government information to the public for analysis and reuse.  This research may help ground public policy discussions and drive the development of a framework to avoid potential abuses of this data while encouraging greater engagement and innovation.
This RFP seeks to:

    • Gain knowledge of the impact of the online release of large amounts of data generated by citizens’ interactions with government
    • Imagine new possibilities for technical, legal, and regulatory interventions that avoid abuse
    • Begin building a body of research that addresses these issues

– BACKGROUND –

 
Governments at all levels are releasing large datasets for analysis by anyone for any purpose—“Open Data.”  Using Open Data, entrepreneurs may create new products and services, and citizens may use it to gain insight into the government.  A plethora of time saving and other useful applications have emerged from Open Data feeds, including more accurate traffic information, real-time arrival of public transportation, and information about crimes in neighborhoods.  Sometimes governments release large datasets in order to encourage the development of unimagined new applications.  For instance, New York City has made over 1,100 databases available, some of which contain information that can be linked to individuals, such as a parking violation database containing license plate numbers and car descriptions.
Data held by the government is often implicitly or explicitly about individuals—acting in roles that have recognized constitutional protection, such as lobbyist, signatory to a petition, or donor to a political cause; in roles that require special protection, such as victim of, witness to, or suspect in a crime; in the role as businessperson submitting proprietary information to a regulator or obtaining a business license; and in the role of ordinary citizen.  While open government is often presented as an unqualified good, sometimes Open Data can identify individuals or groups, leading to a more transparent citizenry.  The citizen who foresees this growing transparency may be less willing to engage in government, as these transactions may be documented and released in a dataset to anyone to use for any imaginable purpose—including to deanonymize the database—forever.  Moreover, some groups of citizens may have few options or no choice as to whether to engage in governmental activities.  Hence, open data sets may have a disparate impact on certain groups. The potential impact of large-scale data and analysis on civil rights is an area of growing concern.  A number of civil rights and media justice groups banded together in February 2014 to endorse the “Civil Rights Principles for the Era of Big Data” and the potential of new data systems to undermine longstanding civil rights protections was flagged as a “central finding” of a recent policy review by White House adviser John Podesta.
The Berkeley Center for Law & Technology (BCLT) and Microsoft are issuing this request for proposals in an effort to better understand the implications and potential impact of the release of data related to U.S. citizens’ interactions with their local, state and federal governments. BCLT and Microsoft will fund up to six grants, with a combined total of $300,000.  Grantees will be required to participate in a workshop to present and discuss their research at the Berkeley Technology Law Journal (BTLJ) Spring Symposium.  All grantees’ papers will be published in a dedicated monograph.  Grantees’ papers that approach the issues from a legal perspective may also be published in the BTLJ. We may also hold a followup workshop in New York City or Washington, DC.
While we are primarily interested in funding proposals that address issues related to the policy impacts of Open Data, many of these issues are intertwined with general societal implications of “big data.” As a result, proposals that explore Open Data from a big data perspective are welcome; however, proposals solely focused on big data are not.  We are open to proposals that address the following difficult question.  We are also open to methods and disciplines, and are particularly interested in proposals from cross-disciplinary teams.

    • To what extent does existing Open Data made available by city and state governments affect individual profiling?  Do the effects change depending on the level of aggregation (neighborhood vs. cities)?  What releases of information could foreseeably cause discrimination in the future? Will different groups in society be disproportionately impacted by Open Data?
    • Should the use of Open Data be governed by a code of conduct or subject to a review process before being released? In order to enhance citizen privacy, should governments develop guidelines to release sampled or perturbed data, instead of entire datasets? When datasets contain potentially identifiable information, should there be a notice-and-comment proceeding that includes proposed technological solutions to anonymize, de-identify or otherwise perturb the data?
    • Is there something fundamentally different about government services and the government’s collection of citizen’s data for basic needs in modern society such as power and water that requires governments to exercise greater due care than commercial entities?
    • Companies have legal and practical mechanisms to shield data submitted to government from public release.  What mechanisms do individuals have or should have to address misuse of Open Data?  Could developments in the constitutional right to information policy as articulated in Whalen and Westinghouse Electric Co address Open Data privacy issues?
    • Collecting data costs money, and its release could affect civil liberties.  Yet it is being given away freely, sometimes to immensely profitable firms.  Should governments license data for a fee and/or impose limits on its use, given its value?
    • The privacy principle of “collection limitation” is under siege, with many arguing that use restrictions will be more efficacious for protecting privacy and more workable for big data analysis.  Does the potential of Open Data justify eroding state and federal privacy act collection limitation principles?   What are the ethical dimensions of a government system that deprives the data subject of the ability to obscure or prevent the collection of data about a sensitive issue?  A move from collection restrictions to use regulation raises a number of related issues, detailed below.
    • Are use restrictions efficacious in creating accountability?  Consumer reporting agencies are regulated by use restrictions, yet they are not known for their accountability.  How could use regulations be implemented in the context of Open Data efficaciously?  Can a self-learning algorithm honor data use restrictions?
    • If an Open Dataset were regulated by a use restriction, how could individuals police wrongful uses?   How would plaintiffs overcome the likely defenses or proof of facts in a use regulation system, such as a burden to prove that data were analyzed and the product of that analysis was used in a certain way to harm the plaintiff?  Will plaintiffs ever be able to beat first amendment defenses?
    • The President’s Council of Advisors on Science and Technology big data report emphasizes that analysis is not a “use” of data.  Such an interpretation suggests that NSA metadata analysis and large-scale scanning of communications do not raise privacy issues.  What are the ethical and legal implications of the “analysis is not use” argument in the context of Open Data?
    • Open Data celebrates the idea that information collected by the government can be used by another person for various kinds of analysis.  When analysts are not involved in the collection of data, they are less likely to understand its context and limitations.  How do we ensure that this knowledge is maintained in a use regulation system?
    • Former President William Clinton was admitted under a pseudonym for a procedure at a New York Hospital in 2004.  The hospital detected 1,500 attempts by its own employees to access the President’s records.  With snooping such a tempting activity, how could incentives be crafted to cause self-policing of government data and the self-disclosure of inappropriate uses of Open Data?
    • It is clear that data privacy regulation could hamper some big data efforts.  However, many examples of big data successes hail from highly regulated environments, such as health care and financial services—areas with statutory, common law, and IRB protections.  What are the contours of privacy law that are compatible with big data and Open Data success and which are inherently inimical to it?
    • In recent years, the problem of “too much money in politics” has been addressed with increasing disclosure requirements.  Yet, distrust in government remains high, and individuals identified in donor databases have been subjected to harassment.  Is the answer to problems of distrust in government even more Open Data?
    • What are the ethical and epistemological implications of encouraging government decision-making based upon correlation analysis, without a rigorous understanding of cause and effect?  Are there decisions that should not be left to just correlational proof? While enthusiasm for data science has increased, scientific journals are elevating their standards, with special scrutiny focused on hypothesis-free, multiple comparison analysis. What could legal and policy experts learn from experts in statistics about the nature and limits of open data?…
      To submit a proposal, visit the Conference Management Toolkit (CMT) here.
      Once you have created a profile, the site will allow you to submit your proposal.
      If you have questions, please contact Chris Hoofnagle, principal investigator on this project.”

Sharing Data Is a Form of Corporate Philanthropy


Matt Stempeck in HBR Blog:  “Ever since the International Charter on Space and Major Disasters was signed in 1999, satellite companies like DMC International Imaging have had a clear protocol with which to provide valuable imagery to public actors in times of crisis. In a single week this February, DMCii tasked its fleet of satellites on flooding in the United Kingdom, fires in India, floods in Zimbabwe, and snow in South Korea. Official crisis response departments and relevant UN departments can request on-demand access to the visuals captured by these “eyes in the sky” to better assess damage and coordinate relief efforts.

DMCii is a private company, yet it provides enormous value to the public and social sectors simply by periodically sharing its data.
Back on Earth, companies create, collect, and mine data in their day-to-day business. This data has quickly emerged as one of this century’s most vital assets. Public sector and social good organizations may not have access to the same amount, quality, or frequency of data. This imbalance has inspired a new category of corporate giving foreshadowed by the 1999 Space Charter: data philanthropy.
The satellite imagery example is an area of obvious societal value, but data philanthropy holds even stronger potential closer to home, where a wide range of private companies could give back in meaningful ways by contributing data to public actors. Consider two promising contexts for data philanthropy: responsive cities and academic research.
The centralized institutions of the 20th century allowed for the most sophisticated economic and urban planning to date. But in recent decades, the information revolution has helped the private sector speed ahead in data aggregation, analysis, and applications. It’s well known that there’s enormous value in real-time usage of data in the private sector, but there are similarly huge gains to be won in the application of real-time data to mitigate common challenges.
What if sharing economy companies shared their real-time housing, transit, and economic data with city governments or public interest groups? For example, Uber maintains a “God’s Eye view” of every driver on the road in a city:
stempeck2
Imagine combining this single data feed with an entire portfolio of real-time information. An early leader in this space is the City of Chicago’s urban data dashboard, WindyGrid. The dashboard aggregates an ever-growing variety of public datasets to allow for more intelligent urban management.
stempeck3
Over time, we could design responsive cities that react to this data. A responsive city is one where services, infrastructure, and even policies can flexibly respond to the rhythms of its denizens in real-time. Private sector data contributions could greatly accelerate these nascent efforts.
Data philanthropy could similarly benefit academia. Access to data remains an unfortunate barrier to entry for many researchers. The result is that only researchers with access to certain data, such as full-volume social media streams, can analyze and produce knowledge from this compelling information. Twitter, for example, sells access to a range of real-time APIs to marketing platforms, but the price point often exceeds researchers’ budgets. To accelerate the pursuit of knowledge, Twitter has piloted a program called Data Grants offering access to segments of their real-time global trove to select groups of researchers. With this program, academics and other researchers can apply to receive access to relevant bulk data downloads, such as an period of time before and after an election, or a certain geographic area.
Humanitarian response, urban planning, and academia are just three sectors within which private data can be donated to improve the public condition. There are many more possible applications possible, but few examples to date. For companies looking to expand their corporate social responsibility initiatives, sharing data should be part of the conversation…
Companies considering data philanthropy can take the following steps:

  • Inventory the information your company produces, collects, and analyzes. Consider which data would be easy to share and which data will require long-term effort.
  • Think who could benefit from this information. Who in your community doesn’t have access to this information?
  • Who could be harmed by the release of this data? If the datasets are about people, have they consented to its release? (i.e. don’t pull a Facebook emotional manipulation experiment).
  • Begin conversations with relevant public agencies and nonprofit partners to get a sense of the sort of information they might find valuable and their capacity to work with the formats you might eventually make available.
  • If you expect an onslaught of interest, an application process can help qualify partnership opportunities to maximize positive impact relative to time invested in the program.
  • Consider how you’ll handle distribution of the data to partners. Even if you don’t have the resources to set up an API, regular releases of bulk data could still provide enormous value to organizations used to relying on less-frequently updated government indices.
  • Consider your needs regarding privacy and anonymization. Strip the data of anything remotely resembling personally identifiable information (here are some guidelines).
  • If you’re making data available to researchers, plan to allow researchers to publish their results without obstruction. You might also require them to share the findings with the world under Open Access terms….”

Selected Readings on Sentiment Analysis


The Living Library’s Selected Readings series seeks to build a knowledge base on innovative approaches for improving the effectiveness and legitimacy of governance. This curated and annotated collection of recommended works on the topic of sentiment analysis was originally published in 2014.

Sentiment Analysis is a field of Computer Science that uses techniques from natural language processing, computational linguistics, and machine learning to predict subjective meaning from text. The term opinion mining is often used interchangeably with Sentiment Analysis, although it is technically a subfield focusing on the extraction of opinions (the umbrella under which sentiment, evaluation, appraisal, attitude, and emotion all lie).

The rise of Web 2.0 and increased information flow has led to an increase in interest towards Sentiment Analysis — especially as applied to social networks and media. Events causing large spikes in media — such as the 2012 Presidential Election Debates — are especially ripe for analysis. Such analyses raise a variety of implications for the future of crowd participation, elections, and governance.

Selected Reading List (in alphabetical order)

Annotated Selected Reading List (in alphabetical order)

Choi, Eunsol et al. “Hedge detection as a lens on framing in the GMO debates: a position paper.” Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics 13 Jul. 2012: 70-79. http://bit.ly/1wweftP

  • Understanding the ways in which participants in public discussions frame their arguments is important for understanding how public opinion is formed. This paper adopts the position that it is time for more computationally-oriented research on problems involving framing. In the interests of furthering that goal, the authors propose the following question: In the controversy regarding the use of genetically-modified organisms (GMOs) in agriculture, do pro- and anti-GMO articles differ in whether they choose to adopt a more “scientific” tone?
  • Prior work on the rhetoric and sociology of science suggests that hedging may distinguish popular-science text from text written by professional scientists for their colleagues. The paper proposes a detailed approach to studying whether hedge detection can be used to understand scientific framing in the GMO debates, and provides corpora to facilitate this study. Some of the preliminary analyses suggest that hedges occur less frequently in scientific discourse than in popular text, a finding that contradicts prior assertions in the literature.

Michael, Christina, Francesca Toni, and Krysia Broda. “Sentiment analysis for debates.” (Unpublished MSc thesis). Department of Computing, Imperial College London (2013). http://bit.ly/Wi86Xv

  • This project aims to expand on existing solutions used for automatic sentiment analysis on text in order to capture support/opposition and agreement/disagreement in debates. In addition, it looks at visualizing the classification results for enhancing the ease of understanding the debates and for showing underlying trends. Finally, it evaluates proposed techniques on an existing debate system for social networking.

Murakami, Akiko, and Rudy Raymond. “Support or oppose?: classifying positions in online debates from reply activities and opinion expressions.” Proceedings of the 23rd International Conference on Computational Linguistics: Posters 23 Aug. 2010: 869-875. https://bit.ly/2Eicfnm

  • In this paper, the authors propose a method for the task of identifying the general positions of users in online debates, i.e., support or oppose the main topic of an online debate, by exploiting local information in their remarks within the debate. An online debate is a forum where each user posts an opinion on a particular topic while other users state their positions by posting their remarks within the debate. The supporting or opposing remarks are made by directly replying to the opinion, or indirectly to other remarks (to express local agreement or disagreement), which makes the task of identifying users’ general positions difficult.
  • A prior study has shown that a link-based method, which completely ignores the content of the remarks, can achieve higher accuracy for the identification task than methods based solely on the contents of the remarks. In this paper, it is shown that utilizing the textual content of the remarks into the link-based method can yield higher accuracy in the identification task.

Pang, Bo, and Lillian Lee. “Opinion mining and sentiment analysis.” Foundations and trends in information retrieval 2.1-2 (2008): 1-135. http://bit.ly/UaCBwD

  • This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Its focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. It includes material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

Ranade, Sarvesh et al. “Online debate summarization using topic directed sentiment analysis.” Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining 11 Aug. 2013: 7. http://bit.ly/1nbKtLn

  • Social networking sites provide users a virtual community interaction platform to share their thoughts, life experiences and opinions. Online debate forum is one such platform where people can take a stance and argue in support or opposition of debate topics. An important feature of such forums is that they are dynamic and grow rapidly. In such situations, effective opinion summarization approaches are needed so that readers need not go through the entire debate.
  • This paper aims to summarize online debates by extracting highly topic relevant and sentiment rich sentences. The proposed approach takes into account topic relevant, document relevant and sentiment based features to capture topic opinionated sentences. ROUGE (Recall-Oriented Understudy for Gisting Evaluation, which employ a set of metrics and a software package to compare automatically produced summary or translation against human-produced onces) scores are used to evaluate the system. This system significantly outperforms several baseline systems and show improvement over the state-of-the-art opinion summarization system. The results verify that topic directed sentiment features are most important to generate effective debate summaries.

Schneider, Jodi. “Automated argumentation mining to the rescue? Envisioning argumentation and decision-making support for debates in open online collaboration communities.” http://bit.ly/1mi7ztx

  • Argumentation mining, a relatively new area of discourse analysis, involves automatically identifying and structuring arguments. Following a basic introduction to argumentation, the authors describe a new possible domain for argumentation mining: debates in open online collaboration communities.
  • Based on our experience with manual annotation of arguments in debates, the authors propose argumentation mining as the basis for three kinds of support tools, for authoring more persuasive arguments, finding weaknesses in others’ arguments, and summarizing a debate’s overall conclusions.