Analyzing 1.1 Billion NYC Taxi and Uber Trips


Todd W. Schneider: “The New York City Taxi & Limousine Commission has released a staggeringly detailed historical dataset covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York. How bad is the rush hour traffic from Midtown to JFK? Where does the Bridge and Tunnel crowd hang out on Saturday nights? What time do investment bankers get to work? How has Uber changed the landscape for taxis? And could Bruce Willis and Samuel L. Jackson have made it from 72nd and Broadway to Wall Street in less than 30 minutes? The dataset addresses all of these questions and many more.

I mapped the coordinates of every trip to local census tracts and neighborhoods, then set about in an attempt to extract stories and meaning from the data. This post covers a lot, but for those who want to pursue more analysis on their own: everything in this post—the data, software, and code—is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Table of Contents

  1. Maps
  2. The Data
  3. Borough Trends, and the Rise of Uber
  4. Airport Traffic
  5. On the Realism of Die Hard 3
  6. How Does Weather Affect Taxi and Uber Ridership?
  7. NYC Late Night Taxi Index
  8. The Bridge and Tunnel Crowd
  9. Northside Williamsburg
  10. Privacy Concerns
  11. Investment Bankers
  12. Parting Thoughts…(More)

Open government data: Out of the box


The Economist on “The open-data revolution has not lived up to expectations. But it is only getting started…

The app that helped save Mr Rich’s leg is one of many that incorporate government data—in this case, supplied by four health agencies. Six years ago America became the first country to make all data collected by its government “open by default”, except for personal information and that related to national security. Almost 200,000 datasets from 170 outfits have been posted on the data.gov website. Nearly 70 other countries have also made their data available: mostly rich, well-governed ones, but also a few that are not, such as India (see chart). The Open Knowledge Foundation, a London-based group, reckons that over 1m datasets have been published on open-data portals using its CKAN software, developed in 2010.

Jakarta’s Participatory Budget


Ramda Yanurzha in GovInsider: “…This is a map of Musrenbang 2014 in Jakarta. Red is a no-go, green means the proposal is approved.

To give you a brief background, musrenbang is Indonesia’s flavor of participatory, bottom-up budgeting. The idea is that people can propose any development for their neighbourhood through a multi-stage budgeting process, thus actively participating in shaping the final budget for the city level, which will then determine the allocation for each city at the provincial level, and so on.

The catch is, I’m confident enough to say that not many people (especially in big cities) are actually aware of this process. While civic activists tirelessly lament that the process itself is neither inclusive nor transparent, I’m leaning towards a simpler explanation that most people simply couldn’t connect the dots.

People know that the public works agency fixed that 3-foot pothole last week. But it’s less clear how they can determine who is responsible for fixing a new streetlight in that dark alley and where the money comes from. Someone might have complain to the neighbourhood leader (Pak RT) and somehow the message gets through, but it’s very hard to trace how it got through. Just keep complaining to the black box until you don’t have to. There are very few people (mainly researchers) who get to see the whole picture.

This has now changed because the brand-new Jakarta open data portal provides musrenbang data from 2009. Who proposed what to whom, for how much, where it should be implemented (geotagged!), down to kelurahan/village level, and whether the proposal is accepted into the final city budget. For someone who advocates for better availability of open data in Indonesia and is eager to practice my data wrangling skill, it’s a goldmine.

Diving In

data screenshot
All the different units of goods proposed.

The data is also, as expected, incredibly messy. While surprisingly most of the projects proposed are geotagged, there are a lot of formatting inconsistencies that makes the clean up stage painful. Some of them are minor (m? meter? meter2? m2? meter persegi?) while others are perplexing (latitude: -6,547,843,512,000  –  yes, that’s a value of more than a billion). Annoyingly, hundreds of proposals point to the center of the National Monument so it’s not exactly a representative dataset.

For fellow data wranglers, pull requests to improve the data are gladly welcome over here. Ibam generously wrote an RT extractor to yield further location data, and I’m looking into OpenStreetMap RW boundary data to create a reverse geocoder for the points.

A couple hours of scrubbing in OpenRefine yields me a dataset that is clean enough for me to generate the CartoDB map I embedded at the beginning of this piece. More precisely, it is a map of geotagged projects where each point is colored depending on whether it’s rejected or accepted.

Numbers and Patterns

40,511 proposals, some of them merged into broader ones, which gives us a grand total of 26,364 projects valued at over IDR 3,852,162,060,205, just over $250 million at the current exchange rate. This amount represents over 5% of Jakarta’s annual budget for 2015, with projects ranging from a IDR 27,500 (~$2) trash bin (that doesn’t sound right, does it?) in Sumur Batu to IDR 54 billion, 1.5 kilometer drainage improvement in Koja….(More)”

RethinkCityHall.org


Press Release (Boston): “Mayor Martin J. Walsh today announced the launch of RethinkCityHall.org, a website designed to encourage civic participation in the City Hall campus plan study, a one-year comprehensive planning process that will serve as a roadmap for the operation and design improvements to City Hall and the plaza.

This announcement is one of three interrelated efforts that the City is pursuing to reinvigorate and bring new life to both City Hall and City Hall Plaza.   As part of the Campus Plan Request for Qualifications (RFQ) that was released on June 8, 2015, the City has selected Utile, a local architecture and planning firm, to partner with the city to lead the campus plan study.  Utile is teamed with Grimshaw Architects and Reed Hilderbrand for the design phases of the effort.

“I am excited to have Utile on board as we work to identify ways to activate our civic spaces,” said Mayor Walsh. “As we progress in the planning process, it is important to take inventory of all of our assets to be able to identify opportunities for improvement. This study will help us develop a thoughtful and forward-thinking plan to reimagine City Hall and the plaza as thriving, healthy and innovative civic spaces.”

“We are energized by Mayor Walsh’s challenge and are excited to work with the various constituencies to develop an innovative plan,” said Tim Love, a principal at Utile. “Thinking about the functional, programmatic and experiential aspects of both the building and plaza provides the opportunity to fundamentally rethink City Hall.”

Both the City and Utile are committed to an open and interactive process that engages members of the public, community groups, professional organizations, and as part of that effort the website will include information about stakeholder meetings and public forums. Additionally, the website will be updated on an ongoing basis with the research, analysis, concepts and design scenarios generated by the consultant team….(More)”

Will Open Data Policies Contribute to Solving Development Challenges?


Fabrizio Scrollini at IODC: “As the international open data charter  gains momentum  in the context of the wider development agenda related to the sustainable development goals set by the United Nations, a pertinent question to ask is: will open data policies contribute to solve development challenges? In this post  I try to answer this question grounded in recent Latin American experience to contribute to a global debate.

Latin America has been exploring open data since 2013, when  the first open data unconference (Abrelatam)and  conference took place in Montevideo. In September 2015 in Santiago de Chile a vibrant community of activists, public servants, and entrepreneurs gathered  in the third edition of Abrelatam and Condatos. It is now a more mature community. The days where it was sufficient to  just open a few datasets and set  up a portal are now gone. The focus of this meeting was on collaboration and use of data to address several social challenges.

Take for instance the health sector. Transparency in this sector is key to deliver better development goals. One of the panels at Condatos showed three different ways to use data to promote transparency and citizen empowerment in this sector. A tu servicio, a joint venture of DATA  and the Uruguayan Ministry of Health helped to standardize and open public datasets that allowed around 30,000 users to improve the way they choose health providers. Government-civil society collaboration was crucial in this process in terms pooling resources and skills. The first prototype was only possible because some data was already open.

This contrasts with Cuidados Intensivos, a Peruvian endeavour  aiming to provide key information about the health sector. Peruvian activists had to fill right to information requests, transform, and standardize data to eventually release it. Both experiences demanded a great deal of technical, policy, and communication craft. And both show the attitudes the public sector can take: either engaging or at the very best ignoring the potential of open data.

In the same sector look at a recent study dealing with Dengue and open data developed by our research initiative. If international organizations and countries were persuaded to adopt common standards for Dengue outbreaks, they could be potentially predicted if the right public data is available and standardized. Open data in this sector not only delivers accountability but also efficiency and foresight to allocate scarce resources.

Latin American countries – gathered in the open data group of the Red Gealc – acknowledge the increasing public value of open data. This group engaged constructively in Condatos with the principles enshrined in the charter and will foster the formalization of open data policies in the region. A data revolution won’t yield results if data is closed. When you open data you allow for several initiatives to emerge and show its value.

Once a certain level of maturity is reached in a particular sector, more than data is needed.  Standards are crucial to ensure comparability and ease the collection, processing, and use of open government data. To foster and engage with open data users is also needed,  as several strategies deployed by some Latin American cities show.

Coming back to our question: will open data policies contribute to solve development challenges?  The Latin American experience shows evidence that  it will….(More)”

Batea: a Wikipedia hack for medical students


Tom Sullivan at HealthCareIT: “Medical students use Wikipedia in great numbers, but what if it were a more trusted source of information?

That’s the idea behind Batea, a piece of software that essentially collects data from clinical reference URLs medical students visit, then aggregates that information to share with WikiProject Medicine, such that relevant medical editors can glean insights about how best to enhance Wikipedia’s medical content.

Batea takes its name from the Spanish name for gold pan, according to Fred Trotter, a data journalist at DocGraph.

“It’s a data mining project,” Trotter explained, “so we wanted a short term that positively referenced mining.”

DocGraph built Batea with support from the Robert Wood Johnson Foundation and, prior to releasing it on Tuesday, operated beta testing pilots of the browser extension at the University of California, San Francisco and the University of Texas, Houston.

UCSF, for instance, has what Trotter described as “a unique program where medical students edit Wikipedia for credit. They helped us tremendously in testing the alpha versions of the software.”

Wikipedia houses some 25,000 medical articles that receive more than 200 million views each month, according to the DocGraph announcement, while 8,000 pharmacology articles are read more than 40 million times a month.

DocGraph is encouraging medical students around the country to download the Batea extension – and anonymously donate their clinical-related browsing history. Should Batea gain critical mass, the potential exists for it to substantively enhance Wikipedia….(More)”

Fudging Nudging: Why ‘Libertarian Paternalism’ is the Contradiction It Claims It’s Not


Paper by Heidi M. Hurd: “In this piece I argue that so-called “libertarian paternalism” is as self-contradictory as it sounds. The theory of libertarian paternalism originally advanced by Richard Thaler and Cass Sunstein, and given further defense by Sunstein alone, is itself just a sexy ad campaign designed to nudge gullible readers into thinking that there is no conflict between libertarianism and welfare utilitarianism. But no one should lose sight of the fact that welfare utilitarianism just is welfare utilitarianism only if it sacrifices individual liberty whenever it is at odds with maximizing societal welfare. And thus no one who believes that people have rights to craft their own lives through the exercise of their own choices ought to be duped into thinking that just because paternalistic nudges are cleverly manipulative and often invisible, rather than overtly coercive, standard welfare utilitarianism can lay claim to being libertarian.

After outlining four distinct strains of libertarian theory and sketching their mutual incompatibility with so-called “libertarian paternalism,” I go on to demonstrate at some length how the two most prevalent strains — namely, opportunity set libertarianism and motivational libertarianism — make paternalistically-motivated nudges abuses of state power. As I argue, opportunity set libertarians should recognize nudges for what they are — namely, state incursions into the sphere of liberty in which individual choice is a matter of moral right, the boundaries of which are rightly defined, in part, by permissions to do actions that do not maximize welfare. And motivational libertarians should similarly recognize nudges for what they are — namely, illicitly motivated forms of legislative intervention that insult autonomy no less than do flat bans that leave citizens with no choice but to substitute the state’s agenda for their own. As I conclude, whatever its name, a political theory that recommends to state officials the use of “nudges” as means of ensuring that citizens’ advance the state’s understanding of their own best interests is no more compatible with libertarianism than is a theory that recommends more coercive means of paternalism….(More)”

Of Remixology: Ethics and Aesthetics after Remix


New book by David J. Gunkel : “Remix—or the practice of recombining preexisting content—has proliferated across media both digital and analog. Fans celebrate it as a revolutionary new creative practice; critics characterize it as a lazy and cheap (and often illegal) recycling of other people’s work. In Of Remixology, David Gunkel argues that to understand remix, we need to change the terms of the debate. The two sides of the remix controversy, Gunkel contends, share certain underlying values—originality, innovation, artistic integrity. And each side seeks to protect these values from the threat that is represented by the other. In reevaluating these shared philosophical assumptions, Gunkel not only provides a new way to understand remix, he also offers an innovative theory of moral and aesthetic value for the twenty-first century.

In a section called “Premix,” Gunkel examines the terminology of remix (including “collage,” “sample,” “bootleg,” and “mashup”) and its material preconditions, the technology of recording. In “Remix,” he takes on the distinction between original and copy; makes a case for repetition; and considers the question of authorship in a world of seemingly endless recompiled and repurposed content. Finally, in “Postmix,” Gunkel outlines a new theory of moral and aesthetic value that can accommodate remix and its cultural significance, remixing—or reconfiguring and recombining—traditional philosophical approaches in the process….(More)”

Tackling quality concerns around (volunteered) big data


University of Twente: “… Improvements in online information communication and mobile location-aware technologies have led to a dramatic increase in the amount of volunteered geographic information (VGI) in recent years. The collection of volunteered data on geographic phenomena has a rich history worldwide. For example, the Christmas Bird Count has studied the impacts of climate change on spatial distribution and population trends of selected bird species in North America since 1900. Nowadays, several citizen observatories collect information about our environment. This information is complementary or, in some cases, essential to tackle a wide range of geographic problems.

Despite the wide applicability and acceptability of VGI in science, many studies argue that the quality of the observations remains a concern. Data collected by volunteers does not often follow scientific principles of sampling design, and levels of expertise vary among volunteers. This makes it hard for scientists to integrate VGI in their research.

Low quality, inconsistent, observations can bias analysis and modelling results because they are not representative for the variable studied, or because they decrease the ratio of signal to noise. Hence, the identification of inconsistent observations clearly benefits VGI-based applications and provide more robust datasets to the scientific community.

In their paper the researchers describe a novel automated workflow to identify inconsistencies in VGI. “Leveraging a digital control mechanism means we can give value to the millions of observations collected by volunteers” and “it allows a new kind of science where citizens can directly contribute to the analysis of global challenges like climate change” say Hamed Mehdipoor and Dr. Raul Zurita-Milla, who work at the Geo-Information Processing department of ITC….

While some inconsistent observations may reflect real, unusual events, the researchers demonstrated that these observations also bias the trends (advancement rates), in this case of the date of lilac flowering onset. This shows that identifying inconsistent observations is a pre-requisite for studying and interpreting the impact of climate change on the timing of life cycle events….(More)”

How Big Data is Helping to Tackle Climate Change


Bernard Marr at DataInformed: “Climate scientists have been gathering a great deal of data for a long time, but analytics technology’s catching up is comparatively recent. Now that cloud, distributed storage, and massive amounts of processing power are affordable for almost everyone, those data sets are being put to use. On top of that, the growing number of Internet of Things devices we are carrying around are adding to the amount of data we are collecting. And the rise of social media means more and more people are reporting environmental data and uploading photos and videos of their environment, which also can be analyzed for clues.

Perhaps one of the most ambitious projects that employ big data to study the environment is Microsoft’s Madingley, which is being developed with the intention of creating a simulation of all life on Earth. The project already provides a working simulation of the global carbon cycle, and it is hoped that, eventually, everything from deforestation to animal migration, pollution, and overfishing will be modeled in a real-time “virtual biosphere.” Just a few years ago, the idea of a simulation of the entire planet’s ecosphere would have seemed like ridiculous, pie-in-the-sky thinking. But today it’s something into which one of the world’s biggest companies is pouring serious money. Microsoft is doing this because it believes that analytical technology has finally caught up with the ability to collect and store data.

Another data giant that is developing tools to facilitate analysis of climate and ecological data is EMC. Working with scientists at Acadia National Park in Maine, the company has developed platforms to pull in crowd-sourced data from citizen science portals such as eBird and iNaturalist. This allows park administrators to monitor the impact of climate change on wildlife populations as well as to plan and implement conservation strategies.

Last year, the United Nations, under its Global Pulse data analytics initiative, launched the Big Data Climate Challenge, a competition aimed to promote innovate data-driven climate change projects. Among the first to receive recognition under the program is Global Forest Watch, which combines satellite imagery, crowd-sourced witness accounts, and public datasets to track deforestation around the world, which is believed to be a leading man-made cause of climate change. The project has been promoted as a way for ethical businesses to ensure that their supply chain is not complicit in deforestation.

Other initiatives are targeted at a more personal level, for example by analyzing transit routes that could be used for individual journeys, using Google Maps, and making recommendations based on carbon emissions for each route.

The idea of “smart cities” is central to the concept of the Internet of Things – the idea that everyday objects and tools are becoming increasingly connected, interactive, and intelligent, and capable of communicating with each other independently of humans. Many of the ideas put forward by smart-city pioneers are grounded in climate awareness, such as reducing carbon dioxide emissions and energy waste across urban areas. Smart metering allows utility companies to increase or restrict the flow of electricity, gas, or water to reduce waste and ensure adequate supply at peak periods. Public transport can be efficiently planned to avoid wasted journeys and provide a reliable service that will encourage citizens to leave their cars at home.

These examples raise an important point: It’s apparent that data – big or small – can tell us if, how, and why climate change is happening. But, of course, this is only really valuable to us if it also can tell us what we can do about it. Some projects, such as Weathersafe, which helps coffee growers adapt to changing weather patterns and soil conditions, are designed to help humans deal with climate change. Others are designed to tackle the problem at the root, by highlighting the factors that cause it in the first place and showing us how we can change our behavior to minimize damage….(More)”