How Data Mining could have prevented Tunisia’s Terror attack in Bardo Museum


Wassim Zoghlami at Medium: “…Data mining is the process of posing queries and extracting useful patterns or trends often previously unknown from large amounts of data using various techniques such as those from pattern recognition and machine learning. Latelely there has been a big interest on leveraging the use of data mining for counter-terrorism applications

Using the data on more than 50.000+ ISIS connected twitter accounts , I was able to establish an understanding of some factors determined how often ISIS attacks occur , what different types of terror strikes are used in which geopolitical situations, and many other criteria through graphs about the frequency of hashtags usages and the frequency of a particular group of the words used in the tweets.

A simple data mining project of some of the repetitive hashtags and sequences of words used typically by ISIS militants in their tweets yielded surprising results. The results show a rise of some keywords on the tweets that started from Marsh 15, three days before Bardo museum attacks.

Some of the common frequent keywords and hashtags that had a unusual peak since marsh 15 , three days before the attack :

#طواغيت تونس : Tyrants of Tunisia = a reference to the military

بشرى تونس : Good news for Tunisia.

قريبا تونس : Soon in Tunisia.

#إفريقية_للإعلام : The head of social media of Afriqiyah

#غزوة_تونس : The foray of Tunis…

Big Data and Data Mining should be used for national security intelligence

The Tunisian national security has to leverage big data to predict such attacks and to achieve objectives as the volume of digital data. Some of the challenges facing the Data mining techniques are that to carry out effective data mining and extract useful information for counterterrorism and national security, we need to gather all kinds of information about individuals. However, this information could be a threat to the individuals’ privacy and civil liberties…(More)”

How Crowdsourcing And Machine Learning Will Change The Way We Design Cities


Shaunacy Ferro at FastCompany: “In 2011, researchers at the MIT Media Lab debuted Place Pulse, a website that served as a kind of “hot or not” for cities. Given two Google Street View images culled from a select few cities including New York City and Boston, the site asked users to click on the one that seemed safer, more affluent, or more unique. The result was an empirical way to measure urban aesthetics.

Now, that data is being used to predict what parts of cities feel the safest. StreetScore, a collaboration between the MIT Media Lab’s Macro Connections and Camera Culture groups, uses an algorithm to create a super high-resolution map of urban perceptions. The algorithmically generated data could one day be used to research the connection between urban perception and crime, as well as informing urban design decisions.

The algorithm, created by Nikhil Naik, a Ph.D. student in the Camera Culture lab, breaks an image down into its composite features—such as building texture, colors, and shapes. Based on how Place Pulse volunteers rated similar features, the algorithm assigns the streetscape a perceived safety score between 1 and 10. These scores are visualized as geographic points on a map, designed by MIT rising sophomore Jade Philipoom. Each image available from Google Maps in the two cities are represented by a colored dot: red for the locations that the algorithm tags as unsafe, and dark green for those that appear safest. The site, now limited to New York and Boston, will be expanded to feature Chicago and Detroit later this month, and eventually, with data collected from a new version of Place Pulse, will feature dozens of cities around the world….(More)”

Modern Methods for Sentiment Analysis


Review by Michael Czerny: “Sentiment analysis is a common application of Natural Language Processing (NLP) methodologies, particularly classification, whose goal is to extract the emotional content in text. In this way, sentiment analysis can be seen as a method to quantify qualitative data with some sentiment score. While sentiment is largely subjective, sentiment quantification has enjoyed many useful implementations, such as businesses gaining understanding about consumer reactions to a product, or detecting hateful speech in online comments.

The simplest form of sentiment analysis is to use a dictionary of good and bad words. Each word in a sentence has a score, typically +1 for positive sentiment and -1 for negative. Then, we simply add up the scores of all the words in the sentence to get a final sentiment total. Clearly, this has many limitations, the most important being that it neglects context and surrounding words. For example, in our simple model the phrase “not good” may be classified as 0 sentiment, given “not” has a score of -1 and “good” a score of +1. A human would likely classify “not good” as negative, despite the presence of “good”.

Another common method is to treat a text as a “bag of words”. We treat each text as a 1 by N vector, where N is the size of our vocabulary. Each column is a word, and the value is the number of times that word appears. For example, the phrase “bag of bag of words” might be encoded as [2, 2, 1]. This could then be fed into a machine learning algorithm for classification, such as logistic regression or SVM, to predict sentiment on unseen data. Note that this requires data with known sentiment to train on in a supervised fashion. While this is an improvement over the previous method, it still ignores context, and the size of the data increases with the size of the vocabulary.

Word2Vec and Doc2Vec

Recently, Google developed a method called Word2Vec that captures the context of words, while at the same time reducing the size of the data. Word2Vec is actually two different methods: Continuous Bag of Words (CBOW) and Skip-gram. In the CBOW method, the goal is to predict a word given the surrounding words. Skip-gram is the converse: we want to predict a window of words given a single word (see Figure 1). Both methods use artificial neural networks as their classification algorithm. Initially, each word in the vocabulary is a random N-dimensional vector. During training, the algorithm learns the optimal vector for each word using the CBOW or Skip-gram method….(More)

Citizen Science for Citizen Access to Law


Paper by Michael Curtotti, Wayne Weibel, Eric McCreath, Nicolas Ceynowa, Sara Frug, and Tom R Bruce: “This paper sits at the intersection of citizen access to law, legal informatics and plain language. The paper reports the results of a joint project of the Cornell University Legal Information Institute and the Australian National University which collected thousands of crowdsourced assessments of the readability of law through the Cornell LII site. The aim of the project is to enhance accuracy in the prediction of the readability of legal sentences. The study requested readers on legislative pages of the LII site to rate passages from the United States Code and the Code of Federal Regulations and other texts for readability and other characteristics. The research provides insight into who uses legal rules and how they do so. The study enables conclusions to be drawn as to the current readability of law and spread of readability among legal rules. The research is intended to enable the creation of a dataset of legal rules labelled by human judges as to readability. Such a dataset, in combination with machine learning, will assist in identifying factors in legal language which impede readability and access for citizens. As far as we are aware, this research is the largest ever study of readability and usability of legal language and the first research which has applied crowdsourcing to such an investigation. The research is an example of the possibilities open for enhancing access to law through engagement of end users in the online legal publishing environment for enhancement of legal accessibility and through collaboration between legal publishers and researchers….(More)”

The End of Asymmetric Information


Essay by Alex Tabarrok and Tyler Cowen: Might the age of asymmetric information – for better or worse – be over?  Market institutions are rapidly evolving to a situation where very often the buyer and the seller have roughly equal knowledge. Technological developments are giving everyone who wants it access to the very best information when it comes to product quality, worker performance, matches to friends and partners, and the nature of financial transactions, among many other areas.

These developments will have implications for how markets work, how much consumers benefit, and also economic policy and the law. As we will see, there may be some problematic sides to these new arrangements, specifically when it comes to privacy. Still, a large amount of economic regulation seems directed at a set of problems which, in large part, no longer exist…

Many “public choice” problems are really problems of asymmetric information. In William Niskanen’s (1974) model of bureaucracy, government workers usually benefit from larger bureaus, and they are able to expand their bureaus to inefficient size because they are the primary providers of information to politicians. Some bureaus, such as the NSA and the CIA, may still be able to use secrecy to benefit from information asymmetry. For instance they can claim to politicians that they need more resources to deter or prevent threats, and it is hard for the politicians to have well-informed responses on the other side of the argument. Timely, rich information about most other bureaucracies, however, is easily available to politicians and increasingly to the public as well. As information becomes more symmetric, Niskanen’s (1974) model becomes less applicable, and this may help check the growth of unneeded bureaucracy.

Cheap sensors are greatly extending how much information can be economically gathered and analyzed. It’s not uncommon for office workers to have every key stroke logged. When calling customer service, who has not been told “this call may be monitored for quality control purposes?” Service-call workers have their location tracked through cell phones. Even information that once was thought to be purely subjective can now be collected and analyzed, often with the aid of smart software or artificial intelligence. One firm, for example, uses badges equipped with microphones, accelerometers, and location sensors to measure tone of voice, posture, and body language, as well as who spoke to whom and for how long (Lohr 2014). The purpose is not only to monitor workers but to deduce when, where and why workers are the most productive. We are again seeing trade-offs which bring greater productivity, and limit asymmetric information, albeit at the expense of some privacy.

As information becomes more prevalent and symmetric, earlier solutions to asymmetric problems will become less necessary. When employers do not easily observe workers, for example, employers may pay workers unusually high wages, generating a rent. Workers will then work at high levels despite infrequent employer observation, to maintain their future rents (Shapiro and Stiglitz 1984). But those higher wages involved a cost, namely that fewer workers were hired, and the hires that were made often were directed to people who were already known to the firm. Better monitoring of workers will mean that employers will hire more people and furthermore they may be more willing to take chances on risky outsiders, rather than those applicants who come with impeccable pedigree. If the outsider does not work out and produce at an acceptable level, it is easy enough to figure this out and fire them later on….(More)”

Big Data for Social Good


Introduction to a Special Issue of the Journal “Big Data” by Catlett Charlie and Ghani Rayid: “…organizations focused on social good are realizing the potential as well but face several challenges as they seek to become more data-driven. The biggest challenge they face is a paucity of examples and case studies on how data can be used for social good. This special issue of Big Data is targeted at tackling that challenge and focuses on highlighting some exciting and impactful examples of work that uses data for social good. The special issue is just one example of the recent surge in such efforts by the data science community. …

This special issue solicited case studies and problem statements that would either highlight (1) the use of data to solve a social problem or (2) social challenges that need data-driven solutions. From roughly 20 submissions, we selected 5 articles that exemplify this type of work. These cover five broad application areas: international development, healthcare, democracy and government, human rights, and crime prevention.

“Understanding Democracy and Development Traps Using a Data-Driven Approach” (Ranganathan et al.) details a data-driven model between democracy, cultural values, and socioeconomic indicators to identify a model of two types of “traps” that hinder the development of democracy. They use historical data to detect causal factors and make predictions about the time expected for a given country to overcome these traps.

“Targeting Villages for Rural Development Using Satellite Image Analysis” (Varshney et al.) discusses two case studies that use data and machine learning techniques for international economic development—solar-powered microgrids in rural India and targeting financial aid to villages in sub-Saharan Africa. In the process, the authors stress the importance of understanding the characteristics and provenance of the data and the criticality of incorporating local “on the ground” expertise.

In “Human Rights Event Detection from Heterogeneous Social Media Graphs,” Chen and Neil describe efficient and scalable techniques to use social media in order to detect emerging patterns in human rights events. They test their approach on recent events in Mexico and show that they can accurately detect relevant human rights–related tweets prior to international news sources, and in some cases, prior to local news reports, which could potentially lead to more timely, targeted, and effective advocacy by relevant human rights groups.

“Finding Patterns with a Rotten Core: Data Mining for Crime Series with Core Sets” (Wang et al.) describes a case study with the Cambridge Police Department, using a subspace clustering method to analyze the department’s full housebreak database, which contains detailed information from thousands of crimes from over a decade. They find that the method allows human crime analysts to handle vast amounts of data and provides new insights into true patterns of crime committed in Cambridge…..(More)

Data scientists rejoice! There’s an online marketplace selling algorithms from academics


SiliconRepublic: “Algorithmia, an online marketplace that connects computer science researchers’ algorithms with developers who may have uses for them, has exited its private beta.

Algorithms are essential to our online experience. Google uses them to determine which search results are the most relevant. Facebook uses them to decide what should appear in your news feed. Netflix uses them to make movie recommendations.

Founded in 2013, Algorithmia could be described as an app store for algorithms, with over 800 of them available in its library. These algorithms provide the means of completing various tasks in the fields of machine learning, audio and visual processing, and computer vision.

Algorithmia found a way to monetise algorithms by creating a platform where academics can share their creations and charge a royalty fee per use, while developers and data scientists can request specific algorithms in return for a monetary reward. One such suggestion is for ‘punctuation prediction’, which would insert correct punctuation and capitalisation in speech-to-text translation.

While it’s not the first algorithm marketplace online, Algorithmia will accept and sell any type of algorithm and host them on its servers. What this means is that developers need only add a simple piece of code to their software in order to send a query to Algorithmia’s servers, so the algorithm itself doesn’t have to be integrated in its entirety….

Computer science researchers can spend years developing algorithms, only for them to be published in a scientific journal never to be read by software engineers.

Algorithmia intends to create a community space where academics and engineers can meet to discuss and refine these algorithms for practical use. A voting and commenting system on the site will allow users to engage and even share insights on how contributions can be improved.

To that end, Algorithmia’s ultimate goal is to advance the development of algorithms as well as their discovery and use….(More)”

Encyclopedia of Social Network Analysis and Mining


“The Encyclopedia of Social Network Analysis and Mining (ESNAM) is the first major reference work to integrate fundamental concepts and research directions in the areas of social networks and  applications to data mining. While ESNAM  reflects the state-of-the-art in  social network research, the field  had its start in the 1930s when fundamental issues in social network research were broadly defined. These communities were limited to relatively small numbers of nodes (actors) and links. More recently the advent of electronic communication, and in particular on-line communities, have created social networks of hitherto unimaginable sizes. People around the world are directly or indirectly connected by popular social networks established using web-based platforms rather than by physical proximity.

Reflecting the interdisciplinary nature of this unique field, the essential contributions of diverse disciplines, from computer science, mathematics, and statistics to sociology and behavioral science, are described among the 300 authoritative yet highly readable entries. Students will find a world of information and insight behind the familiar façade of the social networks in which they participate. Researchers and practitioners will benefit from a comprehensive perspective on the methodologies for analysis of constructed networks, and the data mining and machine learning techniques that have proved attractive for sophisticated knowledge discovery in complex applications. Also addressed is the application of social network methodologies to other domains, such as web networks and biological networks….(More)”

Making emotive games from open data


Katie Collins at WIRED: “Microsoft researcher Kati London’s aim is “to try to get people to think of data in terms of personalities, relationships and emotions”, she tells the audience at the Story Festival in London. Through Project Sentient Data, she uses her background in games development to create fun but meaningful experiences that bridge online interactions and things that are happening in the real world.
One such experience invited children to play against the real-time flow of London traffic through an online game called the Code of Everand. The aim was to test the road safety knowledge of 9-11 year olds and “make alertness something that kids valued”.
The core mechanic of the game was that of a normal world populated by little people, containing spirit channels that only kids could see and go through. Within these spirit channels, everything from lorries and cars from the streets became monsters. The children had to assess what kind of dangers the monsters posed and use their tools to dispel them.
“Games are great ways to blur and observe the ways people interact with real-world data,” says London.
In one of her earlier projects back in 2005, London used her knowledge of horticulture to bring artificial intelligence to plants. “Almost every workspace I go into has a half dead plant in it, so we gave plants the ability to tell us what they need.” It was, she says, an exercise in “humanising data” that led to further projects that saw her create self aware street signs and a dynamic city map that expressed shame neighbourhood by neighbourhood depending on the open dataset of public complaints in New York.
A further project turned complaint data into cartoons on Instagram every week. London praised the open data initiative in New York, but added that for people to access it, they had to know it existed and know where to find it. The cartoons were a “lightweight” form of “civic engagement” that helped to integrate hyperlocal issues into everyday conversation.
London also gamified community engagement through a project commissioned by the Knight Foundation called Macon Money….(More)”.

Cultures of Code


Brian Hayes in the American Scientist: “Kim studies parallel algorithms, designed for computers with thousands of processors. Chris builds computer simulations of fluids in motion, such as ocean currents. Dana creates software for visualizing geographic data. These three people have much in common. Computing is an essential part of their professional lives; they all spend time writing, testing, and debugging computer programs. They probably rely on many of the same tools, such as software for editing program text. If you were to look over their shoulders as they worked on their code, you might not be able to tell who was who.
Despite the similarities, however, Kim, Chris, and Dana were trained in different disciplines, and they belong to  different intellectual traditions and communities. Kim, the parallel algorithms specialist, is a professor in a university department of computer science. Chris, the fluids modeler, also lives in the academic world, but she is a physicist by training; sometimes she describes herself as a computational scientist (which is not the same thing as a computer scientist). Dana has been programming since junior high school but didn’t study computing in college; at the startup company where he works, his title is software developer.
These factional divisions run deeper than mere specializations. Kim, Chris, and Dana belong to different professional societies, go to different conferences, read different publications; their paths seldom cross. They represent different cultures. The resulting Balkanization of computing seems unwise and unhealthy, a recipe for reinventing wheels and making the same mistake three times over. Calls for unification go back at least 45 years, but the estrangement continues. As a student and admirer of all three fields, I find the standoff deeply frustrating.
Certain areas of computation are going through a period of extraordinary vigor and innovation. Machine learning, data analysis, and programming for the web have all made huge strides. Problems that stumped earlier generations, such as image recognition, finally seem to be yielding to new efforts. The successes have drawn more young people into the field; suddenly, everyone is “learning to code.” I am cheered by (and I cheer for) all these events, but I also want to whisper a question: Will the wave of excitement ever reach other corners of the computing universe?…
What’s the difference between computer science, computational science, and software development?…(More)”