Analyzing 1.1 Billion NYC Taxi and Uber Trips


Todd W. Schneider: “The New York City Taxi & Limousine Commission has released a staggeringly detailed historical dataset covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York. How bad is the rush hour traffic from Midtown to JFK? Where does the Bridge and Tunnel crowd hang out on Saturday nights? What time do investment bankers get to work? How has Uber changed the landscape for taxis? And could Bruce Willis and Samuel L. Jackson have made it from 72nd and Broadway to Wall Street in less than 30 minutes? The dataset addresses all of these questions and many more.

I mapped the coordinates of every trip to local census tracts and neighborhoods, then set about in an attempt to extract stories and meaning from the data. This post covers a lot, but for those who want to pursue more analysis on their own: everything in this post—the data, software, and code—is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Table of Contents

  1. Maps
  2. The Data
  3. Borough Trends, and the Rise of Uber
  4. Airport Traffic
  5. On the Realism of Die Hard 3
  6. How Does Weather Affect Taxi and Uber Ridership?
  7. NYC Late Night Taxi Index
  8. The Bridge and Tunnel Crowd
  9. Northside Williamsburg
  10. Privacy Concerns
  11. Investment Bankers
  12. Parting Thoughts…(More)

RethinkCityHall.org


Press Release (Boston): “Mayor Martin J. Walsh today announced the launch of RethinkCityHall.org, a website designed to encourage civic participation in the City Hall campus plan study, a one-year comprehensive planning process that will serve as a roadmap for the operation and design improvements to City Hall and the plaza.

This announcement is one of three interrelated efforts that the City is pursuing to reinvigorate and bring new life to both City Hall and City Hall Plaza.   As part of the Campus Plan Request for Qualifications (RFQ) that was released on June 8, 2015, the City has selected Utile, a local architecture and planning firm, to partner with the city to lead the campus plan study.  Utile is teamed with Grimshaw Architects and Reed Hilderbrand for the design phases of the effort.

“I am excited to have Utile on board as we work to identify ways to activate our civic spaces,” said Mayor Walsh. “As we progress in the planning process, it is important to take inventory of all of our assets to be able to identify opportunities for improvement. This study will help us develop a thoughtful and forward-thinking plan to reimagine City Hall and the plaza as thriving, healthy and innovative civic spaces.”

“We are energized by Mayor Walsh’s challenge and are excited to work with the various constituencies to develop an innovative plan,” said Tim Love, a principal at Utile. “Thinking about the functional, programmatic and experiential aspects of both the building and plaza provides the opportunity to fundamentally rethink City Hall.”

Both the City and Utile are committed to an open and interactive process that engages members of the public, community groups, professional organizations, and as part of that effort the website will include information about stakeholder meetings and public forums. Additionally, the website will be updated on an ongoing basis with the research, analysis, concepts and design scenarios generated by the consultant team….(More)”

Will Open Data Policies Contribute to Solving Development Challenges?


Fabrizio Scrollini at IODC: “As the international open data charter  gains momentum  in the context of the wider development agenda related to the sustainable development goals set by the United Nations, a pertinent question to ask is: will open data policies contribute to solve development challenges? In this post  I try to answer this question grounded in recent Latin American experience to contribute to a global debate.

Latin America has been exploring open data since 2013, when  the first open data unconference (Abrelatam)and  conference took place in Montevideo. In September 2015 in Santiago de Chile a vibrant community of activists, public servants, and entrepreneurs gathered  in the third edition of Abrelatam and Condatos. It is now a more mature community. The days where it was sufficient to  just open a few datasets and set  up a portal are now gone. The focus of this meeting was on collaboration and use of data to address several social challenges.

Take for instance the health sector. Transparency in this sector is key to deliver better development goals. One of the panels at Condatos showed three different ways to use data to promote transparency and citizen empowerment in this sector. A tu servicio, a joint venture of DATA  and the Uruguayan Ministry of Health helped to standardize and open public datasets that allowed around 30,000 users to improve the way they choose health providers. Government-civil society collaboration was crucial in this process in terms pooling resources and skills. The first prototype was only possible because some data was already open.

This contrasts with Cuidados Intensivos, a Peruvian endeavour  aiming to provide key information about the health sector. Peruvian activists had to fill right to information requests, transform, and standardize data to eventually release it. Both experiences demanded a great deal of technical, policy, and communication craft. And both show the attitudes the public sector can take: either engaging or at the very best ignoring the potential of open data.

In the same sector look at a recent study dealing with Dengue and open data developed by our research initiative. If international organizations and countries were persuaded to adopt common standards for Dengue outbreaks, they could be potentially predicted if the right public data is available and standardized. Open data in this sector not only delivers accountability but also efficiency and foresight to allocate scarce resources.

Latin American countries – gathered in the open data group of the Red Gealc – acknowledge the increasing public value of open data. This group engaged constructively in Condatos with the principles enshrined in the charter and will foster the formalization of open data policies in the region. A data revolution won’t yield results if data is closed. When you open data you allow for several initiatives to emerge and show its value.

Once a certain level of maturity is reached in a particular sector, more than data is needed.  Standards are crucial to ensure comparability and ease the collection, processing, and use of open government data. To foster and engage with open data users is also needed,  as several strategies deployed by some Latin American cities show.

Coming back to our question: will open data policies contribute to solve development challenges?  The Latin American experience shows evidence that  it will….(More)”

Batea: a Wikipedia hack for medical students


Tom Sullivan at HealthCareIT: “Medical students use Wikipedia in great numbers, but what if it were a more trusted source of information?

That’s the idea behind Batea, a piece of software that essentially collects data from clinical reference URLs medical students visit, then aggregates that information to share with WikiProject Medicine, such that relevant medical editors can glean insights about how best to enhance Wikipedia’s medical content.

Batea takes its name from the Spanish name for gold pan, according to Fred Trotter, a data journalist at DocGraph.

“It’s a data mining project,” Trotter explained, “so we wanted a short term that positively referenced mining.”

DocGraph built Batea with support from the Robert Wood Johnson Foundation and, prior to releasing it on Tuesday, operated beta testing pilots of the browser extension at the University of California, San Francisco and the University of Texas, Houston.

UCSF, for instance, has what Trotter described as “a unique program where medical students edit Wikipedia for credit. They helped us tremendously in testing the alpha versions of the software.”

Wikipedia houses some 25,000 medical articles that receive more than 200 million views each month, according to the DocGraph announcement, while 8,000 pharmacology articles are read more than 40 million times a month.

DocGraph is encouraging medical students around the country to download the Batea extension – and anonymously donate their clinical-related browsing history. Should Batea gain critical mass, the potential exists for it to substantively enhance Wikipedia….(More)”

Tackling quality concerns around (volunteered) big data


University of Twente: “… Improvements in online information communication and mobile location-aware technologies have led to a dramatic increase in the amount of volunteered geographic information (VGI) in recent years. The collection of volunteered data on geographic phenomena has a rich history worldwide. For example, the Christmas Bird Count has studied the impacts of climate change on spatial distribution and population trends of selected bird species in North America since 1900. Nowadays, several citizen observatories collect information about our environment. This information is complementary or, in some cases, essential to tackle a wide range of geographic problems.

Despite the wide applicability and acceptability of VGI in science, many studies argue that the quality of the observations remains a concern. Data collected by volunteers does not often follow scientific principles of sampling design, and levels of expertise vary among volunteers. This makes it hard for scientists to integrate VGI in their research.

Low quality, inconsistent, observations can bias analysis and modelling results because they are not representative for the variable studied, or because they decrease the ratio of signal to noise. Hence, the identification of inconsistent observations clearly benefits VGI-based applications and provide more robust datasets to the scientific community.

In their paper the researchers describe a novel automated workflow to identify inconsistencies in VGI. “Leveraging a digital control mechanism means we can give value to the millions of observations collected by volunteers” and “it allows a new kind of science where citizens can directly contribute to the analysis of global challenges like climate change” say Hamed Mehdipoor and Dr. Raul Zurita-Milla, who work at the Geo-Information Processing department of ITC….

While some inconsistent observations may reflect real, unusual events, the researchers demonstrated that these observations also bias the trends (advancement rates), in this case of the date of lilac flowering onset. This shows that identifying inconsistent observations is a pre-requisite for studying and interpreting the impact of climate change on the timing of life cycle events….(More)”

The War on Campus Sexual Assault Goes Digital


As the problem of sexual assault on college campuses has become a hot-button issue for school administrators and federal education regulators, one question keeps coming up: Why don’t more students report attacks?

According to a recent study of 27 schools, about one-quarter of female undergraduates and students who identified as queer or transgender said they had experienced nonconsensual sex or touching since entering college, but most of the students said they did not report it to school officials or support services.

Some felt the incidents weren’t serious enough. Others said they did not think anyone would believe them or they feared negative social consequences. Some felt it would be too emotionally difficult.

Now, in an effort to give students additional options — and to provide schools with more concrete data — a nonprofit software start-up in San Francisco called Sexual Health Innovations has developed an online reporting system for campus sexual violence.

Students at participating colleges can use its site, called Callisto, to record details of an assault anonymously. The site saves and time-stamps those records. That allows students to decide later whether they want to formally file reports with their schools — identifying themselves by their school-issued email addresses — or download their information and take it directly to the police. The site also offers a matching system in which a user can elect to file a report with the school electronically only if someone else names the same assailant.

Callisto’s hypothesis is that some college students — who already socialize, study and shop online — will be more likely initially to document a sexual assault on a third-party site than to report it to school officials on the phone or in person.

“If you have to walk into a building to report, you can only go at certain times of day and you’re not certain who you have to talk to, how many people you have to talk to, what they will ask,” Jessica Ladd, the nonprofit’s founder and chief executive, said in a recent interview in New York. “Whereas online, you can fill out a form at any time of day or night from anywhere and push a button.”

Callisto is part of a wave of apps and sites that tackle different facets of the sexual assault problem on campus. Some colleges and universities have introduced third-party mobile apps that enable students to see maps of local crime hot spots, report suspicious activity, request a ride from campus security services or allow their friends to track their movements virtually as they walk home. Many schools now ask students to participate in online or in-person training programs that present different situations involving sexual assault, relationship violence and issues of consent…..(More)”

The promise and perils of predictive policing based on big data


H. V. Jagadish in the Conversation: “Police departments, like everyone else, would like to be more effective while spending less. Given the tremendous attention to big data in recent years, and the value it has provided in fields ranging from astronomy to medicine, it should be no surprise that police departments are using data analysis to inform deployment of scarce resources. Enter the era of what is called “predictive policing.”

Some form of predictive policing is likely now in force in a city near you.Memphis was an early adopter. Cities from Minneapolis to Miami have embraced predictive policing. Time magazine named predictive policing (with particular reference to the city of Santa Cruz) one of the 50 best inventions of 2011. New York City Police Commissioner William Bratton recently said that predictive policing is “the wave of the future.”

The term “predictive policing” suggests that the police can anticipate a crime and be there to stop it before it happens and/or apprehend the culprits right away. As the Los Angeles Times points out, it depends on “sophisticated computer analysis of information about previous crimes, to predict where and when crimes will occur.”

At a very basic level, it’s easy for anyone to read a crime map and identify neighborhoods with higher crime rates. It’s also easy to recognize that burglars tend to target businesses at night, when they are unoccupied, and to target homes during the day, when residents are away at work. The challenge is to take a combination of dozens of such factors to determine where crimes are more likely to happen and who is more likely to commit them. Predictive policing algorithms are getting increasingly good at such analysis. Indeed, such was the premise of the movie Minority Report, in which the police can arrest and convict murderers before they commit their crime.

Predicting a crime with certainty is something that science fiction can have a field day with. But as a data scientist, I can assure you that in reality we can come nowhere close to certainty, even with advanced technology. To begin with, predictions can be only as good as the input data, and quite often these input data have errors.

But even with perfect, error-free input data and unbiased processing, ultimately what the algorithms are determining are correlations. Even if we have perfect knowledge of your troubled childhood, your socializing with gang members, your lack of steady employment, your wacko posts on social media and your recent gun purchases, all that the best algorithm can do is to say it is likely, but not certain, that you will commit a violent crime. After all, to believe such predictions as guaranteed is to deny free will….

What data can do is give us probabilities, rather than certainty. Good data coupled with good analysis can give us very good estimates of probability. If you sum probabilities over many instances, you can usually get a robust estimate of the total.

For example, data analysis can provide a probability that a particular house will be broken into on a particular day based on historical records for similar houses in that neighborhood on similar days. An insurance company may add this up over all days in a year to decide how much to charge for insuring that house….(More)”

Questioning Smart Urbanism: Is Data-Driven Governance a Panacea?


 at the Chicago Policy Review: “In the era of data explosion, urban planners are increasingly relying on real-time, streaming data generated by “smart” devices to assist with city management. “Smart cities,” referring to cities that implement pervasive and ubiquitous computing in urban planning, are widely discussed in academia, business, and government. These cities are characterized not only by their use of technology but also by their innovation-driven economies and collaborative, data-driven city governance. Smart urbanism can seem like an effective strategy to create more efficient, sustainable, productive, and open cities. However, there are emerging concerns about the potential risks in the long-term development of smart cities, including political neutrality of big data, technocratic governance, technological lock-ins, data and network security, and privacy risks.

In a study entitled, “The Real-Time City? Big Data and Smart Urbanism,” Rob Kitchin provides a critical reflection on the potential negative effects of data-driven city governance on social development—a topic he claims deserves greater governmental, academic, and social attention.

In contrast to traditional datasets that rely on samples or are aggregated to a coarse scale, “big data” is huge in volume, high in velocity, and diverse in variety. Since the early 2000s, there has been explosive growth in data volume due to the rapid development and implementation of technology infrastructure, including networks, information management, and data storage. Big data can be generated from directed, automated, and volunteered sources. Automated data generation is of particular interest to urban planners. One example Kitchin cites is urban sensor networks, which allow city governments to monitor the movements and statuses of individuals, materials, and structures throughout the urban environment by analyzing real-time data.

With the huge amount of streaming data collected by smart infrastructure, many city governments use real-time analysis to manage different aspects of city operations. There has been a recent trend in centralizing data streams into a single hub, integrating all kinds of surveillance and analytics. These one-stop data centers make it easier for analysts to cross-reference data, spot patterns, identify problems, and allocate resources. The data are also often accessible by field workers via operations platforms. In London and some other cities, real-time data are visualized on “city dashboards” and communicated to citizens, providing convenient access to city information.

However, the real-time city is not a flawless solution to all the problems faced by city managers. The primary concern is the politics of big, urban data. Although raw data are often perceived as neutral and objective, no data are free of bias; the collection of data is a subjective process that can be shaped by various confounding factors. The presentation of data can also be manipulated to answer a specific question or enact a particular political vision….(More)”

The Power of Nudges, for Good and Bad


Richard H. Thaler in the New York Times: “Nudges, small design changes that can markedly affect individual behavior, have been catching on. These techniques rely on insights from behavioral science, and when used ethically, they can be very helpful. But we need to be sure that they aren’t being employed to sway people to make bad decisions that they will later regret.

Whenever I’m asked to autograph a copy of “Nudge,” the book I wrote with Cass Sunstein, the Harvard law professor, I sign it, “Nudge for good.” Unfortunately, that is meant as a plea, not an expectation.

Three principles should guide the use of nudges:

■ All nudging should be transparent and never misleading.

■ It should be as easy as possible to opt out of the nudge, preferably with as little as one mouse click.

■ There should be good reason to believe that the behavior being encouraged will improve the welfare of those being nudged.
As far as I know, the government teams in Britain and the United States that have focused on nudging have followed these guidelines scrupulously. But the private sector is another matter. In this domain, I see much more troubling behavior.

For example, last spring I received an email telling me that the first prominent review of a new book of mine had appeared: It was in The Times of London. Eager to read the review, I clicked on a hyperlink, only to run into a pay wall. Still, I was tempted by an offer to take out a one-month trial subscription for the price of just £1. As both a consumer and producer of newspaper articles, I have no beef with pay walls. But before signing up, I read the fine print. As expected, I would have to provide credit card information and would be automatically enrolled as a subscriber when the trial period expired. The subscription rate would then be £26 (about $40) a month. That wasn’t a concern because I did not intend to become a paying subscriber. I just wanted to read that one article.

But the details turned me off. To cancel, I had to give 15 days’ notice, so the one-month trial offer actually was good for just two weeks. What’s more, I would have to call London, during British business hours, and not on a toll-free number. That was both annoying and worrying. As an absent-minded American professor, I figured there was a good chance I would end up subscribing for several months, and that reading the article would end up costing me at least £100….

These examples are not unusual. Many companies are nudging purely for their own profit and not in customers’ best interests. In a recent column in The New York Times, Robert Shiller called such behavior “phishing.” Mr. Shiller and George Akerlof, both Nobel-winning economists, have written a book on the subject, “Phishing for Phools.”

Some argue that phishing — or evil nudging — is more dangerous in government than in the private sector. The argument is that government is a monopoly with coercive power, while we have more choice in the private sector over which newspapers we read and which airlines we fly.

I think this distinction is overstated. In a democracy, if a government creates bad policies, it can be voted out of office. Competition in the private sector, however, can easily work to encourage phishing rather than stifle it.

One example is the mortgage industry in the early 2000s. Borrowers were encouraged to take out loans that they could not repay when real estate prices fell. Competition did not eliminate this practice, because it was hard for anyone to make money selling the advice “Don’t take that loan.”

As customers, we can help one another by resisting these come-ons. The more we turn down questionable offers like trip insurance and scrutinize “one month” trials, the less incentive companies will have to use such schemes. Conversely, if customers reward firms that act in our best interests, more such outfits will survive and flourish, and the options available to us will improve….(More)

Open Data as Open Educational Resources: Case studies of emerging practice


Book edited by Javiera Atenas and Leo Havemann: “…is the outcome of a collective effort that has its origins in the 5th Open Knowledge Open Education Working Group call, in which the idea of using Open Data in schools was mentioned. It occurred to us that Open Data and open educational resources seemed to us almost to exist in separate open worlds.

We decided to seek out evidence in the use of open data as OER, initially by conducting a bibliographical search. As we could not find published evidence, we decided to ask educators if they were in fact, using open data in this way, and wrote a post for this blog (with Ernesto Priego) explaining our perspective, called The 21st Century’s Raw Material: Using Open Data as Open Educational Resources. We ended the post with a link to an exploratory survey, the results of which indicated a need for more awareness of the existence and potential value of Open Data amongst educators…..

the case studies themselves. They have been provided by scholars and practitioners from different disciplines and countries, and they reflect different approaches to the use of open data. The first case study presents an approach to educating both teachers and students in the use of open data for civil monitoring via Scuola di OpenCoesione in Italy, and has been written by Chiara Ciociola and Luigi Reggi. The second case, by Tim Coughlan from the Open University, UK, showcases practical applications in the use of local and contextualised open data for the development of apps. The third case, written by Katie Shamash, Juan Pablo Alperin & Alessandra Bordini from Simon Fraser University, Canada, demonstrates how publishing students can engage, through data analysis, in very current debates around scholarly communications and be encouraged to publish their own findings. The fourth case by Alan Dix from Talis and University of Birmingham, UK, and Geoffrey Ellis from University of Konstanz, Germany, is unique because the data discussed in this case is self-produced, indeed ‘quantified self’ data, which was used with students as material for class discussion and, separately, as source data for another student’s dissertation project. Finally, the fifth case, presented by Virginia Power from University of the West of England, UK, examines strategies to develop data and statistical literacies in future librarians and knowledge managers, aiming to support and extend their theoretical understanding of the concept of the ‘knowledge society’ through the use of Open Data….(More)

The book can be downloaded here Open Data as Open Educational Resources