Continue reading “NOAA announces RFI to unleash power of 'big data'”
Mapping Twitter Topic Networks: From Polarized Crowds to Community Clusters
Pew Internet: “Conversations on Twitter create networks with identifiable contours as people reply to and mention one another in their tweets. These conversational structures differ, depending on the subject and the people driving the conversation. Six structures are regularly observed: divided, unified, fragmented, clustered, and inward and outward hub and spoke structures. These are created as individuals choose whom to reply to or mention in their Twitter messages and the structures tell a story about the nature of the conversation.
If a topic is political, it is common to see two separate, polarized crowds take shape. They form two distinct discussion groups that mostly do not interact with each other. Frequently these are recognizably liberal or conservative groups. The participants within each separate group commonly mention very different collections of website URLs and use distinct hashtags and words. The split is clearly evident in many highly controversial discussions: people in clusters that we identified as liberal used URLs for mainstream news websites, while groups we identified as conservative used links to conservative news websites and commentary sources. At the center of each group are discussion leaders, the prominent people who are widely replied to or mentioned in the discussion. In polarized discussions, each group links to a different set of influential people or organizations that can be found at the center of each conversation cluster.
While these polarized crowds are common in political conversations on Twitter, it is important to remember that the people who take the time to post and talk about political issues on Twitter are a special group. Unlike many other Twitter members, they pay attention to issues, politicians, and political news, so their conversations are not representative of the views of the full Twitterverse. Moreover, Twitter users are only 18% of internet users and 14% of the overall adult population. Their demographic profile is not reflective of the full population. Additionally, other work by the Pew Research Center has shown that tweeters’ reactions to events are often at odds with overall public opinion— sometimes being more liberal, but not always. Finally, forthcoming survey findings from Pew Research will explore the relatively modest size of the social networking population who exchange political content in their network.
Still, the structure of these Twitter conversations says something meaningful about political discourse these days and the tendency of politically active citizens to sort themselves into distinct partisan camps. Social networking maps of these conversations provide new insights because they combine analysis of the opinions people express on Twitter, the information sources they cite in their tweets, analysis of who is in the networks of the tweeters, and how big those networks are. And to the extent that these online conversations are followed by a broader audience, their impact may reach well beyond the participants themselves.
Our approach combines analysis of the size and structure of the network and its sub-groups with analysis of the words, hashtags and URLs people use. Each person who contributes to a Twitter conversation is located in a specific position in the web of relationships among all participants in the conversation. Some people occupy rare positions in the network that suggest that they have special importance and power in the conversation.
Social network maps of Twitter crowds and other collections of social media can be created with innovative data analysis tools that provide new insight into the landscape of social media. These maps highlight the people and topics that drive conversations and group behavior – insights that add to what can be learned from surveys or focus groups or even sentiment analysis of tweets. Maps of previously hidden landscapes of social media highlight the key people, groups, and topics being discussed.
Conversational archetypes on Twitter
The Polarized Crowd network structure is only one of several different ways that crowds and conversations can take shape on Twitter. There are at least six distinctive structures of social media crowds which form depending on the subject being discussed, the information sources being cited, the social networks of the people talking about the subject, and the leaders of the conversation. Each has a different social structure and shape: divided, unified, fragmented, clustered, and inward and outward hub and spokes.
After an analysis of many thousands of Twitter maps, we found six different kinds of network crowds.

Polarized Crowd: Polarized discussions feature two big and dense groups that have little connection between them. The topics being discussed are often highly divisive and heated political subjects. In fact, there is usually little conversation between these groups despite the fact that they are focused on the same topic. Polarized Crowds on Twitter are not arguing. They are ignoring one another while pointing to different web resources and using different hashtags.
Why this matters: It shows that partisan Twitter users rely on different information sources. While liberals link to many mainstream news sources, conservatives link to a different set of websites.

Tight Crowd: These discussions are characterized by highly interconnected people with few isolated participants. Many conferences, professional topics, hobby groups, and other subjects that attract communities take this Tight Crowd form.
Why this matters: These structures show how networked learning communities function and how sharing and mutual support can be facilitated by social media.

Brand Clusters: When well-known products or services or popular subjects like celebrities are discussed in Twitter, there is often commentary from many disconnected participants: These “isolates” participating in a conversation cluster are on the left side of the picture on the left). Well-known brands and other popular subjects can attract large fragmented Twitter populations who tweet about it but not to each other. The larger the population talking about a brand, the less likely it is that participants are connected to one another. Brand-mentioning participants focus on a topic, but tend not to connect to each other.
Why this matters: There are still institutions and topics that command mass interest. Often times, the Twitter chatter about these institutions and their messages is not among people connecting with each other. Rather, they are relaying or passing along the message of the institution or person and there is no extra exchange of ideas.

Community Clusters: Some popular topics may develop multiple smaller groups, which often form around a few hubs each with its own audience, influencers, and sources of information. These Community Clusters conversations look like bazaars with multiple centers of activity. Global news stories often attract coverage from many news outlets, each with its own following. That creates a collection of medium-sized groups—and a fair number of isolates (the left side of the picture above).
Why this matters: Some information sources and subjects ignite multiple conversations, each cultivating its own audience and community. These can illustrate diverse angles on a subject based on its relevance to different audiences, revealing a diversity of opinion and perspective on a social media topic.

Broadcast Network: Twitter commentary around breaking news stories and the output of well-known media outlets and pundits has a distinctive hub and spoke structure in which many people repeat what prominent news and media organizations tweet. The members of the Broadcast Network audience are often connected only to the hub news source, without connecting to one another. In some cases there are smaller subgroups of densely connected people— think of them as subject groupies—who do discuss the news with one another.
Why this matters: There are still powerful agenda setters and conversation starters in the new social media world. Enterprises and personalities with loyal followings can still have a large impact on the conversation.

Support Network: Customer complaints for a major business are often handled by a Twitter service account that attempts to resolve and manage customer issues around their products and services. This produces a hub and spoke structure that is different from the Broadcast Network pattern. In the Support Network structure, the hub account replies to many otherwise disconnected users, creating outward spokes. In contrast, in the Broadcast pattern, the hub gets replied to or retweeted by many disconnected people, creating inward spokes.
Why this matters: As government, businesses, and groups increasingly provide services and support via social media, support network structures become an important benchmark for evaluating the performance of these institutions. Customer support streams of advice and feedback can be measured in terms of efficiency and reach using social media network maps.
Why is it useful to map the social landscape this way?
Social media is increasingly home to civil society, the place where knowledge sharing, public discussions, debates, and disputes are carried out. As the new public square, social media conversations are as important to document as any other large public gathering. Network maps of public social media discussions in services like Twitter can provide insights into the role social media plays in our society. These maps are like aerial photographs of a crowd, showing the rough size and composition of a population. These maps can be augmented with on the ground interviews with crowd participants, collecting their words and interests. Insights from network analysis and visualization can complement survey or focus group research methods and can enhance sentiment analysis of the text of messages like tweets.
Like topographic maps of mountain ranges, network maps can also illustrate the points on the landscape that have the highest elevation. Some people occupy locations in networks that are analogous to positions of strategic importance on the physical landscape. Network measures of “centrality” can identify key people in influential locations in the discussion network, highlighting the people leading the conversation. The content these people create is often the most popular and widely repeated in these networks, reflecting the significant role these people play in social media discussions.
While the physical world has been mapped in great detail, the social media landscape remains mostly unknown. However, the tools and techniques for social media mapping are improving, allowing more analysts to get social media data, analyze it, and contribute to the collective construction of a more complete map of the social media world. A more complete map and understanding of the social media landscape will help interpret the trends, topics, and implications of these new communication technologies.”
Can We Balance Data Protection With Value Creation?
A “privacy perspective” by Sara Degli Esposti: “In the last few years there has been a dramatic change in the opportunities organizations have to generate value from the data they collect about customers or service users. Customers and users are rapidly becoming collections of “data points” and organizations can learn an awful lot from the analysis of this huge accumulation of data points, also known as “Big Data.”
Some may ask whether it’s even possible to balance the two.
Enter the Big Data Protection Project (BDPP): an Open University study on organizations’ ability to leverage Big Data while complying with EU data protection principles. The study represents a chance for you to contribute to, and learn about, the debate on the reform of the EU Data Protection Directive. It is open to staff with interests in data management or use, from all types of organizations, both for-profit and nonprofit, with interests in Europe.
Join us by visiting the study’s page on the Open University website. Participants will receive a report with all the results. The BDP is a scientific project—no commercial organization is involved—with implications relevant to both policy-makers and industry representatives..
What kind of legislation do we need to create that positive system of incentive for organizations to innovate in the privacy field?
There is no easy answer.
That’s why we need to undertake empirical research into actual information management practices to understand the effects of regulation on people and organizations. Legal instruments conceived with the best intentions can be ineffective or detrimental in practice. However, other factors can also intervene and motivate business players to develop procedures and solutions which go far beyond compliance. Good legislation should complement market forces in bringing values and welfare to both consumers and organizations.
Is European data protection law keeping its promise of protecting users’ information privacy while contributing to the flourishing of the digital economy or not? Will the proposed General Data Protection Regulation (GDPR) be able to achieve this goal? What would you suggest to do to motivate organizations to invest in information security and take information privacy seriously?
Let’s consider for a second some basic ideas such as the eight fundamental data protection principles: notice, consent, purpose specification and limitation, data quality, respect of data subjects’ rights, information security and accountability. Many of these ideas are present in the EU 1995 Data Protection Directive, the U.S. Fair Information Practice Principles (FIPPs) andthe 1980 OECD Guidelines. The fundamental question now is, should all these ideas be brought into the future, as suggested in the proposed new GDPR, orshould we reconsider our approach and revise some of them, as recommended in the 21st century version of the 1980 OECD Guidelines?
As you may know, notice and consent are often taken as examples of how very good intentions can be transformed into actions of limited importance. Rather than increase people’s awareness of the growing data economy, notice and consent have produced a tick-box tendency accompanied by long and unintelligible privacy policies. Besides, consent is rarely freely granted. Individuals give their consent in exchange for some product or service or as part of a job relationship. The imbalance between the two goods traded—think about how youngsters perceive not having access to some social media as a form of social exclusion—and the lack of feasible alternatives often make an instrument, such as the current use made of consent, meaningless.
On the other hand, a principle such as data quality, which has received very limited attention, could offer opportunities to policy-makers and businesses to reopen the debate on users’ control of their personal data. Having updated, accurate data is something very valuable for organizations. Data quality is also key to the success of many business models. New partnerships between users and organizations could be envisioned under this principle.
Finally, data collection limitation and purpose specification could be other examples of the divide between theory and practice: The tendency we see is that people and businesses want to share, merge and reuse data over time and to do new and unexpected things. Of course, we all want to avoid function creep and prevent any detrimental use of our personal data. We probably need new, stronger mechanisms to ensure data are used for good purposes.
Digital data have become economic assets these days. We need good legislation to stop the black market for personal data and open the debate on how each of us wants to contribute to, and benefit from, the data economy.”
Crowdsourcing voices to study Parkinson’s disease
TedMed: “Mathematician Max Little is launching a project that aims to literally give Parkinson’s disease (PD) patients a voice in their own diagnosis and help them monitor their disease progression.
Patients Voice Analysis (PVA) is an open science project that uses phone-based voice recordings and self-reported symptoms, along with software Little designed, to track disease progression. Little, a TEDMED 2013 speaker and TED Fellow, is partnering with the online community PatientsLikeMe, co-founded by TEDMED 2009 speaker James Heywood, and Sage Bionetworks, a non-profit research organization, to conduct the research.
The new project is an extension of Little’s Parkinson’s Voice Initiative, which used speech analysis algorithms to diagnose Parkinson’s from voice records with the help of 17,000 volunteers. This time, he seeks to not only detect markers of PD, but also to add information reported by patients using PatientsLikeMe’s Parkinson’s Disease Rating Scale (PDRS), a tool that documents patients’ answers to questions that measure treatment effectiveness and disease progression….
As openly shared information, the collected data has potential to help vast numbers of individuals by tapping into collective ingenuity. Little has long argued that for science to progress, researchers need to democratize research and move past jostling for credit. Sage Bionetworks has designed a platform called Synapse to allow data sharing with collaborative version control, an effort led by open data advocate John Wilbanks.
“If you can’t share your data, how can you reproduce your science? One of the big problems we’re facing with this kind of medical research is the data is not open and getting access to it is a nightmare,” Little says.
With the PVA project, “Basically anyone can log on download the anonymized data and play around with data mining techniques. We don’t really care what people are able to come up with. We just want the most accurate prediction we can get.
“In research, you’re almost always constrained by what you think is the best way to do things. Unless you open it to the community at large, you’ll never know,” he says.”
OpenRFPs
Clay Johnson: “Tomorrow at CodeAcross we’ll be launching our first community-based project, OpenRFPs. The goal is to liberate the data inside of every state RFP listing website in the country. We hope you’ll find your own state’s RFP site, and contribute a parser.
The Department of Better Technology’s goal is to improve the way government works by making it easier for small, innovative businesses to provide great technology to government. But those businesses can barely make it through the front door when the RFPs themselves are stored in archaic systems, with sloppy user interfaces and disparate data formats, or locked behind paywalls.
What are a business’s current choices? They can pay for subscription based services that are either shady vendors disguised as officious sounding associations, or have less than stellar user interfaces.
The sum of these experiences? The barrier to entry is so high that the small companies who are great at technology are effectively turned away at the door. The only ones who get through are those that can afford to spend employee time scouring for the right RFPs.
But the harm to business isn’t the big harm. Government contracting becomes an overlooked area by the transparency community, (especially at the state and local level,) because the data isn’t available en-masse unless it’s paid for commercially, which comes with a restrictive usage license.”
NatureNet: a model for crowdsourcing the design of citizen science systems
Paper in CSCW Companion ’14, the companion publication of the 17th ACM conference on Computer supported cooperative work & social computing: “NatureNet is citizen science system designed for collecting bio-diversity data in nature park settings. Park visitors are encouraged to participate in the design of the system in addition to collecting bio-diversity data. Our goal is to increase the motivation to participate in citizen science via crowdsourcing: the hypothesis is that when the crowd plays a role in the design and development of the system, they become stakeholders in the project and work to ensure its success. This paper presents a model for crowdsourcing design and citizen science data collection, and the results from early trials with users that illustrate the potential of this approach.”
Can Twitter Predict Major Events Such As Mass Protests?
Emerging Technology From the arXiv : “The idea that social media sites such as Twitter can predict the future has a controversial history. In the last few years, various groups have claimed to be able to predict everything from the outcome of elections to the box office takings for new movies.
It’s fair to say that these claims have generated their fair share of criticism. So it’s interesting to see a new claim come to light.
Today, Nathan Kallus at the Massachusetts Institute of Technology in Cambridge says he has developed a way to predict crowd behaviour using statements made on Twitter. In particular, he has analysed the tweets associated with the 2013 coup d’état in Egypt and says that the civil unrest associated with this event was clearly predictable days in advance.
It’s not hard to imagine how the future behaviour of crowds might be embedded in the Twitter stream. People often signal their intent to meet in advance and even coordinate their behaviour using social media. So this social media activity is a leading indicator of future crowd behaviour.
That makes it seem clear that predicting future crowd behaviour is simply a matter of picking this leading indicator out of the noise.
Kallus says this is possible by mining tweets for any mention of future events and then analysing trends associated with them. “The gathering of crowds into a single action can often be seen through trends appearing in this data far in advance,” he says.
It turns out that exactly this kind of analysis is available from a company called Recorded Future based in Cambridge, which scans 300,000 different web sources in seven different languages from all over the world. It then extracts mentions of future events for later analysis….
The bigger question is whether it’s possible to pick out this evidence in advance. In other words, is possible to make predictions before the events actually occur?
That’s not so clear but there are good reasons to be cautious. First of all, while it’s possible to correlate Twitter activity to real protests, it’s also necessary to rule out false positives. There may be significant Twitter trends that do not lead to significant protests in the streets. Kallus does not adequately address the question of how to tell these things apart.
Then there is the question of whether tweets are trustworthy. It’s not hard to imagine that when it comes to issues of great national consequence, propaganda, rumor and irony may play a significant role. So how to deal with this?
There is also the question of demographics and whether tweets truly represent the intentions and activity of the population as a whole. People who tweet are overwhelmingly likely to be young but there is another silent majority that plays hugely important role. So can the Twitter firehose really represent the intentions of this part of the population too?
The final challenge is in the nature of prediction. If the Twitter feed is predictive, then what’s needed is evidence that it can be used to make real predictions about the future and not just historical predictions about the past.
We’ve looked at some of these problems with the predictive power of social media before and the challenge is clear: if there is a claim to be able to predict the future, then this claim must be accompanied by convincing evidence of an actual prediction about an event before it happens.
Until then, it would surely be wise to be circumspect about the predictive powers of Twitter and other forms of social media.
Ref: arxiv.org/abs/1402.2308: Predicting Crowd Behavior with Big Public Data”
Choosing Not to Choose
New paper by Cass Sunstein: “Choice can be an extraordinary benefit or an immense burden. In some contexts, people choose not to choose, or would do so if they were asked. For example, many people prefer not to make choices about their health or retirement plans; they want to delegate those choices to a private or public institution that they trust (and may well be willing to pay a considerable amount for such delegations). This point suggests that however well-accepted, the line between active choosing and paternalism is often illusory. When private or public institutions override people’s desire not to choose, and insist on active choosing, they may well be behaving paternalistically, through a form of choice-requiring paternalism. Active choosing can be seen as a form of libertarian paternalism, and a frequently attractive one, if people are permitted to opt out of choosing in favor of a default (and in that sense not to choose); it is a form of nonlibertarian paternalism insofar as people are required to choose. For both ordinary people and private or public institutions, the ultimate judgment in favor of active choosing, or in favor of choosing not to choose, depends largely on the costs of decisions and the costs of errors. But the value of learning, and of developing one’s own preferences and values, is also important, and may argue on behalf of active choosing, and against the choice not to choose. For law and policy, these points raise intriguing puzzles about the idea of “predictive shopping,” which is increasingly feasible with the rise of large data sets containing information about people’s previous choices. Some empirical results are presented about people’s reactions to predictive shopping; the central message is that most (but not all) people reject predictive shopping in favor of active choosing.”
Structuring Big Data to Facilitate Democratic Participation in International Law
New paper by Roslyn Fuller: “This is an interdisciplinary article focusing on the interplay between information and communication technology (ICT) and international law (IL). Its purpose is to open up a dialogue between ICT and IL practitioners that focuses on the ways in which ICT can enhance equitable participation in international legal structures, particularly through capturing the possibilities associated with big data. This depends on the ability of individuals to access big data, for it to be structured in a manner that makes it accessible and for the individual to be able to take action based on it.”
LocalWiki turns open local data into open local knowledge
Marina Kukso at OpenGovVoices:” LocalWiki is an open knowledge project focusing on giving everyone the opportunity to collaborate to create and share all kinds of information about the place where they live.
The project started in 2004 in Davis, Calif. as the Davis Wiki, now the primary local information resource for Davis residents. One-in-seven residents have contributed to the project and, in a given month, almost every resident uses it.
In 2010, we received funding from the Knight Foundation to bring LocalWiki to many more communities. We created a wiki software specifically designed for local collaboration and have seen adoption in more than 70 communities worldwide. People now use LocalWiki for everything from mapping out nature trails to planning a grassroots mayoral election candidate debate….
There’s a great deal of expertise within our communities, and at LocalWiki we see part of the mission of our work as providing a platform for people to contextualize and make meaning out of the information made available through open data and open gov efforts at the local level.
There are obviously limitations to the ability of programming laypeople to make use of open data to create new knowledge to drive action, most notably many people’s lack of expertise in data analysis, but with LocalWiki we hope to at least address some of those limitations by making it significantly easier for people to collaborate to create meaning out of open data and to share it with others. This is why LocalWiki has a wysiwyg editor, which includes mapping as a core feature and prioritizes usability in design.
Finally, adding information about a community on LocalWiki is a way to create new open data. It’s incredibly important to make things like internal city crime statistics public, but residents’ perspectives on the relative safety of their neighborhoods is a different kind of data that provides additional insights into public safety challenges and adds complexity to the picture created by statistics.”