Frontiers in Massive Data Analysis


New Report from the National Research Council: “From Facebook to Google searches to bookmarking a webpage in our browsers, today’s society has become one with an enormous amount of data. Some internet-based companies such as Yahoo! are even storing exabytes (10 to the 18 bytes) of data. Like these companies and the rest of the world, scientific communities are also generating large amounts of data-—mostly terabytes and in some cases near petabytes—from experiments, observations, and numerical simulation. However, the scientific community, along with defense enterprise, has been a leader in generating and using large data sets for many years. The issue that arises with this new type of large data is how to handle it—this includes sharing the data, enabling data security, working with different data formats and structures, dealing with the highly distributed data sources, and more.
Frontiers in Massive Data Analysis presents the Committee on the Analysis of Massive Data’s work to make sense of the current state of data analysis for mining of massive sets of data, to identify gaps in the current practice and to develop methods to fill these gaps. The committee thus examines the frontiers of research that is enabling the analysis of massive data which includes data representation and methods for including humans in the data-analysis loop. The report includes the committee’s recommendations, details concerning types of data that build into massive data, and information on the seven computational giants of massive data analysis.”

City Data: Big, Open and Linked


Working Paper by Mark S. Fox (University of Toronto): “Cities are moving towards policymaking based on data. They are publishing data using Open Data standards, linking data from disparate sources, allowing the crowd to update their data with Smart Phone Apps that use Open APIs, and applying “Big Data” Techniques to discover relationships that lead to greater efficiencies.
One Big City Data example is from New York City (Schönberger & Cukier, 2013). Building owners were illegally converting their buildings into rooming houses that contained 10 times the number people they were designed for. These buildings posed a number of problems, including fire hazards, drugs, crime, disease and pest infestations. There are over 900,000 properties in New York City and only 200 inspectors who received over 25,000 illegal conversion complaints per year. The challenge was to distinguish nuisance complaints from those worth investigating where current methods were resulting in only 13% of the inspections resulting in vacate orders.
New York’s Analytics team created a dataset that combined data from 19 agencies including buildings, preservation, police, fire, tax, and building permits. By combining data analysis with expertise gleaned from inspectors (e.g., buildings that recently received a building permit were less likely to be a problem as they were being well maintained), the team was able to develop a rating system for complaints. Based on their analysis of this data, they were able to rate complaints such that in 70% of their visits, inspectors issued vacate orders; a fivefold increase in efficiency…
This paper provides an introduction to the concepts that underlie Big City Data. It explains the concepts of Open, Unified, Linked and Grounded data that lie at the heart of the Semantic Web. It then builds on this by discussing Data Analytics, which includes Statistics, Pattern Recognition and Machine Learning. Finally we discuss Big Data as the extension of Data Analytics to the Cloud where massive amounts of computing power and storage are available for processing large data sets. We use city data to illustrate each.”

Microsensors help map crowdsourced pollution data


air-quality-egg-mapElena Craft in GreenBiz: Michael Heimbinder, a Brooklyn entrepreneur, hopes to empower individuals with his small-scale air quality monitoring system, AirCasting. The AirCasting system uses a mobile, Bluetooth-enabled air monitor not much larger than a smartphone to measure carbon dioxide, carbon monoxide, nitrogen dioxide, particulate matter and other pollutants. An accompanying Android app records and formats the information to an emissions map.
Alternatively, another instrument, the Air Quality Egg, comes pre-assembled ready to use. Innovative air monitoring systems, such as AirCasting or the Air Quality Egg, empower ordinary citizens to monitor the pollution they encounter daily and proactively address problematic sources of pollution.
This technology is part of a growing movement to enable the use of small sensors. In response to inquiries about small-sensor data, the EPA is researching the next generation of air measuring technologies. EPA experts are working with sensor developers to evaluate data quality and understand useful sensor applications. Through this ongoing collaboration, the EPA hopes to bolster measurements from conventional, stationary air-monitoring systems with data collected from individuals’ air quality microsensors….
Like many technologies emerging from the big data revolution and innovations in the energy sector, microsensing technology provides a wealth of high-quality data at a relatively low cost. It allows us to track previously undetected air pollution from traditional sources of urban smog, such as highways, and unconventional sources of pollution. Microsensing technology not only educates the public, but also helps to enlighten regulators so that policymakers can work from the facts to protect citizens’ health and welfare.

Capitol Words


CaptureAbout Capitol Words: “For every day Congress is in session, Capitol Words visualizes the most frequently used words in the Congressional Record, giving you an at-a-glance view of which issues lawmakers address on a daily, weekly, monthly and yearly basis. Capitol Words lets you see what are the most popular words spoken by lawmakers on the House and Senate floor.

Methodology

The contents of the Congressional Record are downloaded daily from the website of the Government Printing Office. The GPO distributes the Congressional Record in ZIP files containing the contents of the record in plain-text format.

Each text file is parsed and turned into an XML document, with things like the title and speaker marked up. The contents of each file are then split up into words and phrases — from one word to five.

The resulting data is saved to a search engine. Capitol Words has data from 1996 to the present.”

Open Data Directory – Use Cases and Requirements


Word Wide Web Foundation: “Today, we’re pleased to be publishing a report entitled “Open Data Directory: Use Cases and Requirements”. The full report can be downloaded here.

As we noted in April when we released a draft for comment, quality rich information and content references are a need when you are dealing with innovative environments such as Open Data, where sharing and reusing are necessary routines in order to advance, and to give Open Data initiatives the visibility and recognition they need.

Although only a few years ago it was nearly impossible to find information and examples of Open Government Data initiatives and their components, there are currently a growing and varied number of Open Data resources all over the Web.
Given the increasing number of Open Data-related activities all around the world, and the social, economic or cultural diversity within the different countries, no single person or organization could grasp the whole scope of such a huge amount of information.
Any Government or organization interested in Open Data would greatly benefit from the existing and growing knowledge base and resources, so this scenario represents an invaluable opportunity to construct a neutral and trustable central directory that can help us to structure references, share best practices, and, generally speaking, mobilize the global Open Data community around it….”

Analyzing the Analyzers


catAn Introspective Survey of Data Scientists and Their Work,By Harlan Harris, Sean Murphy, Marck Vaisman: “There has been intense excitement in recent years around activities labeled “data science,” “big data,” and “analytics.” However, the lack of clarity around these terms and, particularly, around the skill sets and capabilities of their practitioners has led to inefficient communication between “data scientists” and the organizations requiring their services. This lack of clarity has frequently led to missed opportunities. To address this issue, we surveyed several hundred practitioners via the Web to explore the varieties of skills, experiences, and viewpoints in the emerging data science community.

We used dimensionality reduction techniques to divide potential data scientists into five categories based on their self-ranked skill sets (Statistics, Math/Operations Research, Business, Programming, and Machine Learning/Big Data), and four categories based on their self-identification (Data Researchers, Data Businesspeople, Data Engineers, and Data Creatives). Further examining the respondents based on their division into these categories provided additional insights into the types of professional activities, educational background, and even scale of data used by different types of Data Scientists.
In this report, we combine our results with insights and data from others to provide a better understanding of the diversity of practitioners, and to argue for the value of clearer communication around roles, teams, and careers.”

Xeroc PARC Tackles Online Dating’s Biggest Conundrum


CertifeyeThe Physics arXiv Blog: “Online dating has changed the way people start relationships. In 2000, a few hundred thousand individuals were experimenting with online dating. Today, more than 40 million people have signed up to meet their dream man or woman online. That kind of success is reflected in the fact that this industry is currently worth some $1.9 billion in annual revenue.
Of course, nobody would claim that online dating is the perfect way to meet a mate. One problem in particular is whether to trust the information that a potential date has given. How do you know that this person isn’t being economical with the truth?…
The new approach is simple. The idea these guys have come up with is to use an app that connects to a person’s Facebook page (or other social network page) and then compare the information there with the information on the dating profile. If the data is the same, then it is certified. The beauty of this system is that the Facebook details are not open to external scrutiny—the app does not take, make public or display any information from the social network. It simply compares the information from the two sites.
Any discrepancy indicates that something, somewhere is wrong and the ambiguous details are not then certified….this process of certification gives users a greater sense of security because Facebook data is largely peer reviewed already.
Ref: arxiv.org/abs/1303.4155: Bootstrapping Trust in Online Dating: Social Verification of Online Dating Profiles”

Gamification: A Short History


Ty McCormick in Foreign Policy: “If you’re checking in on Foursquare or ramping up the “strength” of your LinkedIn profile, you’ve just been gamified — whether or not you know it. “Gamification,” today’s hottest business buzzword, is gaining traction everywhere from corporate boardrooms to jihadi chat forums, and its proponents say it can revolutionize just about anything, from education to cancer treatment to ending poverty. While the global market for gamification is expected to explode from $242 million in 2012 to $2.8 billion in 2016, according to market analysis firm M2 Research, there is a growing chorus of critics who think it’s little more than a marketing gimmick. So is the application of game mechanics to everyday life more than just a passing fad? You decide.
1910
Kellogg’s cereals offers its first “premium,” the Funny Jungleland Moving-Pictures book, free with every two boxes. Two years later, Cracker Jack starts putting prizes, from stickers to baseball cards, in its boxes of caramel-coated corn snacks. “A prize in every box” is an instant hit; over the next 100 years, Cracker Jack gives away more than 23 billion in-package treasures. By the 1950s, the concept of gamification is yet to be born, but its primary building block — fun — is motivating billions of consumers around the world.
1959
Duke University sociologist Donald F. Roy publishes “Banana Time,” an ethnographic study of garment workers in Chicago. Roy chronicles how workers use “fun” and “fooling” on the factory room floor — including a daily ritual game in which workers steal a banana — to stave off the “beast of monotony.” The notion that fun can enhance job satisfaction and productivity inspires reams of research on games in the workplace….”

How Open Data Can Fight Climate Change


New blog post by Joel Gurin, Founder and Editor, OpenDataNow.com: When people point to the value of Open Data from government, they often cite the importance of weather data from NOAA, the National Oceanic and Atmospheric Administration. That data has given us the Weather Channel, more accurate forecasts, and a number of weather-based companies. But the most impressive – and one of the best advertisements for government Open Data – may well be The Climate Corporation, headquartered in San Francisco.
Founded in 2006 under the name WeatherBill, The Climate Corporation was started to sell a better kind of weather insurance. But it’s grown into a company that could help farmers around the world plan around climate change, increase their crop yields, and become part of a new green revolution.
The company’s work is especially relevant in light of President Obama’s speech yesterday on new plans to fight climate change. We know that whatever we do to reduce carbon emissions now, we’ll still need to deal with changes that are already irreversible. The Climate Corporation’s work can be part of that solution…
The company has developed a new service, Climate.com, that is free to policyholders and available to others for a fee….
Their work may become part of a global Green Revolution 2.0. The U.S. Government’s satellite data doesn’t stop at the border: It covers the entire planet.  The Climate Corporation is now looking for ways to apply its work internationally, probably starting with Australia, which has relevant data of its own.
Start with insurance sales, end up by changing the world. The power of Open Data has never been clearer.”

Knight News Challenge on Open Gov


Press Release: “Knight Foundation today named eight projects as winners of the Knight News Challenge on Open Gov, awarding the recipients more than $3.2 million for their ideas.
The projects will provide new tools and approaches to improve the way people and governments interact. They tackle a range of issues from making it easier to open a local business to creating a simulator that helps citizens visualize the impact of public policies on communities….
Each of the winning projects offers a solution to a real-world need. They include:
Civic Insight: Providing up-to-date information on vacant properties so that communities can find ways to make tangible improvements to local spaces;
OpenCounter: Making it easier for residents to register and create new businesses by building open source software that governments can use to simplify the process;
Open Gov for the Rest of Us: Providing residents in low-income neighborhoods in Chicago with the tools to access and demand better data around issues important to them, like housing and education;
Outline.com: Launching a public policy simulator that helps people visualize the impact that public policies like health care reform and school budget changes might have on local economies and communities;
Oyez Project: Making state and appellate court documents freely available and useful to journalists, scholars and the public, by providing straightforward summaries of decisions, free audio recordings and more;
Procur.io: Making government contract bidding more transparent by simplifying the way smaller companies bid on government work;
GitMachines: Supporting government innovation by creating tools and servers that meet government regulations, so that developers can easily build and adopt new technology;
Plan in a Box: Making it easier to discover information about local planning projects, by creating a tool that governments and contractors can use to easily create websites with updates that also allow public input into the process.

Now in its sixth year, the Knight News Challenge accelerates media innovation by funding breakthrough ideas in news and information. Winners receive a share of $5 million in funding and support from Knight’s network of influential peers and advisors to help advance their ideas. Past News Challenge winners have created a lasting impact. They include: DocumentCloud, which analyzes and annotates public documents – turning them into data; Tools for OpenStreetMap, which makes it easier to contribute to the editable map of the world; and Safecast, which helps people measure air quality and became the leading provider of pollution data following the 2011 earthquake and tsunami in Japan.
For more, visit newschallenge.org and follow #newschallenge on Twitter.