Hachem, Sara et al in Research and Technologies for Society and Industry Leveraging a better tomorrow (RTSI): “While the design of smart city ICT systems of today is still largely focused on (and therefore limited to) passive sensing, the emergence of mobile crowd-sensing calls for more active citizen engagement in not only understanding but also shaping of our societies. The Urban Civics Internet of Things (IoT) middleware enables such involvement while effectively closing several feedback loops by including citizens in the decision-making process thus leading to smarter and healthier societies. We present our initial design and planned experimental evaluation of city-scale architecture components where data assimilation, actuation and citizen engagement are key enablers toward democratization of urban data, longer-term transparency, and accountability of urban development policies. All of these are building blocks of smart cities and societies….(More)”
Do We Need to Educate Open Data Users?
Tony Hirst at IODC: “Whilst promoting the publication of open data is a key, indeed necessary, ingredient in driving the global open data agenda, promoting initiatives that support the use of open data is perhaps an even more pressing need….
This, then, is the first issue we need to address: improving basic levels of literacy in interpreting – and manipulating (for example, sorting and grouping) – simple tables and charts. Sensemaking, in other words: what does the chart you’ve just produced actually say? What story does it tell? And there’s an added benefit that arises from learning to read and critique charts better – it makes you better at creating your own.
Associated with reading stories from data comes the reason for telling the story and putting the data to work. How does “data” help you make a decision, or track the impact of a particular intervention? (Your original question should also have informed the data you searched for in the first place). Here we have a need to develop basic skills in how to actually use data, from finding anomalies to hold publishers to account, to using the data as part of a positive advocacy campaign.
After a quick read, on site, of some of the stories the data might have to tell, there may be a need to do further analysis, or more elaborate visualization work. At this point, a range of technical craft skills often come into play, as well as statistical knowledge.
Many openly published datasets just aren’t that good – they’re “dirty”, full of misspellings, missing data, things in the wrong place or wrong format, even if the data they do contain is true. A significant amount of time that should be spent analyzing the data gets spent trying to clean the data set and get it into a form where it can be worked with. I would argue here that a data technician, with a wealth of craft knowledge about how to repair what is essentially a broken dataset, can play an important timesaving role here getting data into a state where an analyst can actually start to do their job analyzing the data.
But at the same time, there are a range of tools and techniques that can help the everyday user improve the quality of their data. Many of these tools require an element of programming knowledge, but less than you might at first think. In the Open University/FutureLean MOOC “Learn to Code for Data Analysis” we use an interactive notebook style of computing to show how you can use code literally one line at a time to perform powerful data cleaning, analysis, and visualization operations on a range of open datasets, including data from the World Bank and Comtrade.
Here, then, is yet another area where skills development may be required: statistical literacy. At its heart, statistics simply provide us with a range of tools for comparing sets of numbers. But knowing what comparisons to make, or the basis on which particular comparisons can be made, knowing what can be said about those comparisons or how they might be interpreted, in short, understanding what story the stats appear to be telling, can quickly become bewildering. Just as we need to improve sensemaking skills associated with reading charts, so to we need to develop skills in making sense of statistics, even if not actually producing those statistics ourselves.
As more data gets published, there are more opportunities for more people to make use of that data. In many cases, what’s likely to hold back that final data use is a skills gap: primary among these are the skills required to interpret simple datasets and the statistics associated with them associated with developing knowledge about how to make decisions or track progress based on that interpretation. However, the path to producing the statistics or visualizations used by the end-users from the originally published open data dataset may also be a windy one, requiring skills not only in analyzing data and uncovering – and then telling – the stories it contains, but also in more mundane technical operational concerns such as actually accessing, and cleaning, dirty datasets….(More)”
Open Government: Missing Questions
Vadym Pyrozhenko at Administration & Society: “This article places the Obama administration’s open government initiative within the context of evolution of the U.S. information society. It examines the concept of openness along the three dimensions of Daniel Bell’s social analysis of the postindustrial society: structure, polity, and culture. Four “missing questions” raise the challenge of the compatibility of public service values with the culture of openness, address the right balance between postindustrial information management practices and the capacity of public organizations to accomplish their missions, and ask to reconsider the idea that greater structural openness of public organizations will necessarily increase their democratic legitimacy….(More)”
Big Data and Privacy: Emerging Issues
O’Leary, Daniel E. at Intelligent Systems, IEEE : “The goals of big data and privacy are fundamentally opposed to each other. Big data and knowledge discovery are aimed reducing information asymmetries between organizations and the data sources, whereas privacy is aimed at maintaining information asymmetries of data sources. A number of different definitions of privacy are used to investigate some of the tensions between different characteristics of big data and potential privacy concerns. Specifically, the author examines the consequences of unevenness in big data, digital data going from local controlled settings to uncontrolled global settings, privacy effects of reputation monitoring systems, and inferring knowledge from social media. In addition, the author briefly analyzes two other emerging sources of big data: police cameras and stingray for location information….(More)”
Analyzing 1.1 Billion NYC Taxi and Uber Trips
Todd W. Schneider: “The New York City Taxi & Limousine Commission has released a staggeringly detailed historical dataset covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York. How bad is the rush hour traffic from Midtown to JFK? Where does the Bridge and Tunnel crowd hang out on Saturday nights? What time do investment bankers get to work? How has Uber changed the landscape for taxis? And could Bruce Willis and Samuel L. Jackson have made it from 72nd and Broadway to Wall Street in less than 30 minutes? The dataset addresses all of these questions and many more.
I mapped the coordinates of every trip to local census tracts and neighborhoods, then set about in an attempt to extract stories and meaning from the data. This post covers a lot, but for those who want to pursue more analysis on their own: everything in this post—the data, software, and code—is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.
Table of Contents
Open government data: Out of the box
The Economist on “The open-data revolution has not lived up to expectations. But it is only getting started…
The app that helped save Mr Rich’s leg is one of many that incorporate government data—in this case, supplied by four health agencies. Six years ago America became the first country to make all data collected by its government “open by default”, except for personal information and that related to national security. Almost 200,000 datasets from 170 outfits have been posted on the data.gov website. Nearly 70 other countries have also made their data available: mostly rich, well-governed ones, but also a few that are not, such as India (see chart). The Open Knowledge Foundation, a London-based group, reckons that over 1m datasets have been published on open-data portals using its CKAN software, developed in 2010.
Jakarta’s Participatory Budget
Ramda Yanurzha in GovInsider: “…This is a map of Musrenbang 2014 in Jakarta. Red is a no-go, green means the proposal is approved.
To give you a brief background, musrenbang is Indonesia’s flavor of participatory, bottom-up budgeting. The idea is that people can propose any development for their neighbourhood through a multi-stage budgeting process, thus actively participating in shaping the final budget for the city level, which will then determine the allocation for each city at the provincial level, and so on.
The catch is, I’m confident enough to say that not many people (especially in big cities) are actually aware of this process. While civic activists tirelessly lament that the process itself is neither inclusive nor transparent, I’m leaning towards a simpler explanation that most people simply couldn’t connect the dots.
People know that the public works agency fixed that 3-foot pothole last week. But it’s less clear how they can determine who is responsible for fixing a new streetlight in that dark alley and where the money comes from. Someone might have complain to the neighbourhood leader (Pak RT) and somehow the message gets through, but it’s very hard to trace how it got through. Just keep complaining to the black box until you don’t have to. There are very few people (mainly researchers) who get to see the whole picture.
This has now changed because the brand-new Jakarta open data portal provides musrenbang data from 2009. Who proposed what to whom, for how much, where it should be implemented (geotagged!), down to kelurahan/village level, and whether the proposal is accepted into the final city budget. For someone who advocates for better availability of open data in Indonesia and is eager to practice my data wrangling skill, it’s a goldmine.
Diving In

The data is also, as expected, incredibly messy. While surprisingly most of the projects proposed are geotagged, there are a lot of formatting inconsistencies that makes the clean up stage painful. Some of them are minor (m? meter? meter2? m2? meter persegi?) while others are perplexing (latitude: -6,547,843,512,000 – yes, that’s a value of more than a billion). Annoyingly, hundreds of proposals point to the center of the National Monument so it’s not exactly a representative dataset.
For fellow data wranglers, pull requests to improve the data are gladly welcome over here. Ibam generously wrote an RT extractor to yield further location data, and I’m looking into OpenStreetMap RW boundary data to create a reverse geocoder for the points.
A couple hours of scrubbing in OpenRefine yields me a dataset that is clean enough for me to generate the CartoDB map I embedded at the beginning of this piece. More precisely, it is a map of geotagged projects where each point is colored depending on whether it’s rejected or accepted.
Numbers and Patterns
40,511 proposals, some of them merged into broader ones, which gives us a grand total of 26,364 projects valued at over IDR 3,852,162,060,205, just over $250 million at the current exchange rate. This amount represents over 5% of Jakarta’s annual budget for 2015, with projects ranging from a IDR 27,500 (~$2) trash bin (that doesn’t sound right, does it?) in Sumur Batu to IDR 54 billion, 1.5 kilometer drainage improvement in Koja….(More)”
Will Open Data Policies Contribute to Solving Development Challenges?
Fabrizio Scrollini at IODC: “As the international open data charter gains momentum in the context of the wider development agenda related to the sustainable development goals set by the United Nations, a pertinent question to ask is: will open data policies contribute to solve development challenges? In this post I try to answer this question grounded in recent Latin American experience to contribute to a global debate.
Latin America has been exploring open data since 2013, when the first open data unconference (Abrelatam)and conference took place in Montevideo. In September 2015 in Santiago de Chile a vibrant community of activists, public servants, and entrepreneurs gathered in the third edition of Abrelatam and Condatos. It is now a more mature community. The days where it was sufficient to just open a few datasets and set up a portal are now gone. The focus of this meeting was on collaboration and use of data to address several social challenges.
Take for instance the health sector. Transparency in this sector is key to deliver better development goals. One of the panels at Condatos showed three different ways to use data to promote transparency and citizen empowerment in this sector. A tu servicio, a joint venture of DATA and the Uruguayan Ministry of Health helped to standardize and open public datasets that allowed around 30,000 users to improve the way they choose health providers. Government-civil society collaboration was crucial in this process in terms pooling resources and skills. The first prototype was only possible because some data was already open.
This contrasts with Cuidados Intensivos, a Peruvian endeavour aiming to provide key information about the health sector. Peruvian activists had to fill right to information requests, transform, and standardize data to eventually release it. Both experiences demanded a great deal of technical, policy, and communication craft. And both show the attitudes the public sector can take: either engaging or at the very best ignoring the potential of open data.
In the same sector look at a recent study dealing with Dengue and open data developed by our research initiative. If international organizations and countries were persuaded to adopt common standards for Dengue outbreaks, they could be potentially predicted if the right public data is available and standardized. Open data in this sector not only delivers accountability but also efficiency and foresight to allocate scarce resources.
Latin American countries – gathered in the open data group of the Red Gealc – acknowledge the increasing public value of open data. This group engaged constructively in Condatos with the principles enshrined in the charter and will foster the formalization of open data policies in the region. A data revolution won’t yield results if data is closed. When you open data you allow for several initiatives to emerge and show its value.
Once a certain level of maturity is reached in a particular sector, more than data is needed. Standards are crucial to ensure comparability and ease the collection, processing, and use of open government data. To foster and engage with open data users is also needed, as several strategies deployed by some Latin American cities show.
Coming back to our question: will open data policies contribute to solve development challenges? The Latin American experience shows evidence that it will….(More)”
Tackling quality concerns around (volunteered) big data
University of Twente: “… Improvements in online information communication and mobile location-aware technologies have led to a dramatic increase in the amount of volunteered geographic information (VGI) in recent years. The collection of volunteered data on geographic phenomena has a rich history worldwide. For example, the Christmas Bird Count has studied the impacts of climate change on spatial distribution and population trends of selected bird species in North America since 1900. Nowadays, several citizen observatories collect information about our environment. This information is complementary or, in some cases, essential to tackle a wide range of geographic problems.
Despite the wide applicability and acceptability of VGI in science, many studies argue that the quality of the observations remains a concern. Data collected by volunteers does not often follow scientific principles of sampling design, and levels of expertise vary among volunteers. This makes it hard for scientists to integrate VGI in their research.
Low quality, inconsistent, observations can bias analysis and modelling results because they are not representative for the variable studied, or because they decrease the ratio of signal to noise. Hence, the identification of inconsistent observations clearly benefits VGI-based applications and provide more robust datasets to the scientific community.
In their paper the researchers describe a novel automated workflow to identify inconsistencies in VGI. “Leveraging a digital control mechanism means we can give value to the millions of observations collected by volunteers” and “it allows a new kind of science where citizens can directly contribute to the analysis of global challenges like climate change” say Hamed Mehdipoor and Dr. Raul Zurita-Milla, who work at the Geo-Information Processing department of ITC….
While some inconsistent observations may reflect real, unusual events, the researchers demonstrated that these observations also bias the trends (advancement rates), in this case of the date of lilac flowering onset. This shows that identifying inconsistent observations is a pre-requisite for studying and interpreting the impact of climate change on the timing of life cycle events….(More)”
How Big Data is Helping to Tackle Climate Change
Bernard Marr at DataInformed: “Climate scientists have been gathering a great deal of data for a long time, but analytics technology’s catching up is comparatively recent. Now that cloud, distributed storage, and massive amounts of processing power are affordable for almost everyone, those data sets are being put to use. On top of that, the growing number of Internet of Things devices we are carrying around are adding to the amount of data we are collecting. And the rise of social media means more and more people are reporting environmental data and uploading photos and videos of their environment, which also can be analyzed for clues.
Perhaps one of the most ambitious projects that employ big data to study the environment is Microsoft’s Madingley, which is being developed with the intention of creating a simulation of all life on Earth. The project already provides a working simulation of the global carbon cycle, and it is hoped that, eventually, everything from deforestation to animal migration, pollution, and overfishing will be modeled in a real-time “virtual biosphere.” Just a few years ago, the idea of a simulation of the entire planet’s ecosphere would have seemed like ridiculous, pie-in-the-sky thinking. But today it’s something into which one of the world’s biggest companies is pouring serious money. Microsoft is doing this because it believes that analytical technology has finally caught up with the ability to collect and store data.
Another data giant that is developing tools to facilitate analysis of climate and ecological data is EMC. Working with scientists at Acadia National Park in Maine, the company has developed platforms to pull in crowd-sourced data from citizen science portals such as eBird and iNaturalist. This allows park administrators to monitor the impact of climate change on wildlife populations as well as to plan and implement conservation strategies.
Last year, the United Nations, under its Global Pulse data analytics initiative, launched the Big Data Climate Challenge, a competition aimed to promote innovate data-driven climate change projects. Among the first to receive recognition under the program is Global Forest Watch, which combines satellite imagery, crowd-sourced witness accounts, and public datasets to track deforestation around the world, which is believed to be a leading man-made cause of climate change. The project has been promoted as a way for ethical businesses to ensure that their supply chain is not complicit in deforestation.
Other initiatives are targeted at a more personal level, for example by analyzing transit routes that could be used for individual journeys, using Google Maps, and making recommendations based on carbon emissions for each route.
The idea of “smart cities” is central to the concept of the Internet of Things – the idea that everyday objects and tools are becoming increasingly connected, interactive, and intelligent, and capable of communicating with each other independently of humans. Many of the ideas put forward by smart-city pioneers are grounded in climate awareness, such as reducing carbon dioxide emissions and energy waste across urban areas. Smart metering allows utility companies to increase or restrict the flow of electricity, gas, or water to reduce waste and ensure adequate supply at peak periods. Public transport can be efficiently planned to avoid wasted journeys and provide a reliable service that will encourage citizens to leave their cars at home.
These examples raise an important point: It’s apparent that data – big or small – can tell us if, how, and why climate change is happening. But, of course, this is only really valuable to us if it also can tell us what we can do about it. Some projects, such as Weathersafe, which helps coffee growers adapt to changing weather patterns and soil conditions, are designed to help humans deal with climate change. Others are designed to tackle the problem at the root, by highlighting the factors that cause it in the first place and showing us how we can change our behavior to minimize damage….(More)”