Looking for the Needle in a Stack of Needles: Tracking Shadow Economic Activities in the Age of Big Data


Manju Bansal in MIT Technology Review: “The undocumented guys hanging out in the home-improvement-store parking lot looking for day labor, the neighborhood kids running a lemonade stand, and Al Qaeda terrorists plotting to do harm all have one thing in common: They operate in the underground economy, a shadowy zone where businesses, both legitimate and less so, transact in the currency of opportunity, away from traditional institutions and their watchful eyes.
One might think that this alternative economy is limited to markets that are low on the Transparency International rankings (such as sub-Saharan Africa and South Asia, for instance). However, a recent University of Wisconsin report estimates the value of the underground economy in the United States at about $2 trillion, about 15% of the total U.S. GDP. And a 2013 study coauthored by Friedrich Schneider, a noted authority on global shadow economies, estimated the European Union’s underground economy at more than 18% of GDP, or a whopping 2.1 trillion euros. More than two-thirds of the underground activity came from the most developed countries, including Germany, France, Italy, Spain, and the United Kingdom.
Underground economic activity is a multifaceted phenomenon, with implications across the board for national security, tax collections, public-sector services, and more. It includes the activity of any business that relies primarily on old-fashioned cash for most transactions — ranging from legitimate businesses (including lemonade stands) to drug cartels and organized crime.
Though it’s often soiled, heavy to lug around, and easy to lose to theft, cash is still king simply because it is so easy to hide from the authorities. With the help of the right bank or financial institution, “dirty” money can easily be laundered and come out looking fresh and clean, or at least legitimate. Case in point is the global bank HSBC, which agreed to pay U.S. regulators $1.9 billion in fines to settle charges of money laundering on behalf of Mexican drug cartels. According to a U.S. Senate subcommittee report, that process involved transferring $7 billion in cash from the bank’s branches in Mexico to those in the United States. Just for reference, each $100 bill weighs one gram, so to transfer $7 billion, HSBC had to physically transport 70 metric tons of cash across the U.S.-Mexican border.
The Financial Action Task Force, an intergovernmental body established in 1989, has estimated the total amount of money laundered worldwide to be around 2% to 5% of global GDP. Many of these transactions seem, at first glance, to be perfectly legitimate. Therein lies the conundrum for a banker or a government official: How do you identify, track, control, and, one hopes, prosecute money launderers, when they are hiding in plain sight and their business is couched in networked layers of perfectly defensible legitimacy?
Enter big-data tools, such as those provided by SynerScope, a Holland-based startup that is a member of the SAP Startup Focus program. This company’s solutions help unravel the complex networks hidden behind the layers of transactions and interactions.
Networks, good or bad, are near omnipresent in almost any form of organized human activity and particularly in banking and insurance. SynerScope takes data from both structured and unstructured data fields and transforms these into interactive computer visuals that display graphic patterns that humans can use to quickly make sense of information. Spotting of deviations in complex networked processes can easily be put to use in fraud detection for insurance, banking, e-commerce, and forensic accounting.
SynerScope’s approach to big-data business intelligence is centered on data-intense compute and visualization that extend the human “sense-making” capacity in much the same way that a telescope or microscope extends human vision.
To understand how SynerScope helps authorities track and halt money laundering, it’s important to understand how the networked laundering process works. It typically involves three stages.
1. In the initial, or placement, stage, launderers introduce their illegal profits into the financial system. This might be done by breaking up large amounts of cash into less-conspicuous smaller sums that are then deposited directly into a bank account, or by purchasing a series of monetary instruments (checks, money orders) that are then collected and deposited into accounts at other locations.
2. After the funds have entered the financial system, the launderer commences the second stage, called layering, which uses a series of conversions or transfers to distance the funds from their sources. The funds might be channeled through the purchase and sales of investment instruments, or the launderer might simply wire the funds through a series of accounts at various banks worldwide. 
Such use of widely scattered accounts for laundering is especially prevalent in those jurisdictions that do not cooperate in anti-money-laundering investigations. Sometimes the launderer disguises the transfers as payments for goods or services.
3. Having successfully processed the criminal profits through the first two phases, the launderer then proceeds to the third stage, integration, in which the funds re-enter the legitimate economy. The launderer might invest the funds in real estate, luxury assets, or business ventures.
Current detection tools compare individual transactions against preset profiles and rules. Sophisticated criminals quickly learn how to make their illicit transactions look normal for such systems. As a result, rules and profiles need constant and costly updating.
But SynerScope’s flexible visual analysis uses a network angle to detect money laundering. It shows the structure of the entire network with data coming in from millions of transactions, a structure that launderers cannot control. With just a few mouse clicks, SynerScope’s relation and sequence views reveal structural interrelationships and interdependencies. When those patterns are mapped on a time scale, it becomes virtually impossible to hide abnormal flows.

SynerScope’s relation and sequence views reveal structural and temporal transaction patterns which make it virtually impossible to hide abnormal money flows.”

Using data to treat the sickest and most expensive patients


Dan Gorenstein for Marketplace (radio):  “Driving to a big data conference a few weeks back, Dr. Jeffrey Brenner brought his compact SUV to a full stop – in the middle of a short highway entrance ramp in downtown Philadelphia…

Here’s what you need to know about Dr. Jeffrey BrennerHe really likes to figure out how things work. And he’s willing to go to extremes to do it – so far that he’s risking his health policy celebrity status.
Perhaps it’s not the smartest move from a guy who just last fall was named a MacArthur Genius, but this month, Brenner began to test his theory for treating some of the sickest and most expensive patients.
“We can actually take the sickest and most complicated patients, go to their bedside, go to their home, go with them to their appointments and help them for about 90 days and dramatically improve outcomes and reduce cost,” he says.
That’s the theory anyway. Like many ideas when it comes to treating the sickest patients, there’s little data to back up that it works.
Brenner’s willing to risk his reputation precisely because he’s not positive his approach for treating folks who cycle in and out of the healthcare system — “super-utilizers” — actually works.
“It’s really easy for me at this point having gotten a MacArthur award to simply declare what we do works and to drive this work forward without rigorously testing it,” Brenner said. “We are not going to do that,” he said. “We don’t think that’s the right thing to do. So we are going to do a randomized controlled trial on our work and prove whether it works and how well it works.”
Helping lower costs and improve care for the super-utilizers is one of the most pressing policy questions in healthcare today. And given its importance, there is a striking lack of data in the field.
People like to call randomized controlled trials (RCTs) the gold standard of scientific testing because two groups are randomly assigned – one gets the treatment, while the other doesn’t – and researchers closely monitor differences.
But a 2012 British Medical Journal article found over the last 25 years, a total of six RCTs have focused on care delivery for super-utilizers.

Randomized Clinical Trials (RCTs)

…Every major health insurance company – Medicare and Medicaid, too – has spent billions on programs for super-utilizers. The absence of rigorous evidence raises the question: Is all this effort built on health policy quicksand?
Not being 100 percent sure can be dangerous, says Duke behavioral scientist Peter Ubel, particularly in healthcare.
Ubel said back in the 1980s and 90s doctors prescribed certain drugs for irregular heartbeats. The medication, he said, made those weird rhythms go away, leaving beautiful-looking EKGs.
“But no one had tested whether people receiving these drugs actually lived longer, and many people thought, ‘Why would you do that? We can look at their cardiogram and see that they’re getting better,’” Ubel said. “Finally when somebody put that evidence to the test of a randomized trial, it turned out that these drugs killed people.”
WellPoint’s Nussbaum said he hoped Brenner’s project would inspire others to follow his lead and insert data into the discussion.
“I believe more people should be bold in challenging the status quo of our delivery system,” Nussbaum said. “The Jeff Brenners of the world should be embraced. We should be advocating for them to take on these studies.”
So why aren’t more healthcare luminaries putting their brilliance to the test? There are a couple of reasons.
Harvard economist Kate Baicker said until now there have been few personal incentives pushing people.
“If you’re focused on branding and spreading your brand, you have no incentive to say, ‘How good is my brand after all?’” she said.
And Venrock healthcare venture capitalist Bob Kocher said no one would fault Brenner if he put his brand before science, an age-old practice in this business.
“Healthcare has benefitted from the fact that you don’t understand it. It’s a bit of an art, and it hasn’t been a science,” he said. “You made money in healthcare by putting a banner outside your building saying you are a top something without having to justify whether you really are top at whatever you do.”
Duke’s Ubel said it’s too easy – and frankly, wrong – to say the main reason doctors avoid these rigorous studies is because they’re afraid to lose money and status. He said doctors aren’t immune from the very human trap of being sure their own ideas are right.
He says psychologists call it confirmation bias.
“Everything you see is filtered through your hopes, your expectations and your pre-existing beliefs,” Ubel said. “And that’s why I might look at a grilled cheese sandwich and see a grilled cheese sandwich and you might see an image of Jesus,” he says.
Even with all these hurdles, MIT economist Amy Finkelstein – who is running the RCT with Brenner – sees change coming.
“Providers have a lot more incentive now than they use to,” she said. “They have much more skin in the game.”
Finkelstein said hospital readmission penalties and new ways to pay doctors are bringing market incentives that have long been missing.
Brenner said he accepts that the truth of what he’s doing in Camden may be messier than the myth.

Digital Humanitarians


New book by Patrick Meier on how big data is changing humanitarian response: “The overflow of information generated during disasters can be as paralyzing to humanitarian response as the lack of information. This flash flood of information when amplified by social media and satellite imagery is increasingly referred to as Big Data—or Big Crisis Data. Making sense of Big Crisis Data during disasters is proving an impossible challenge for traditional humanitarian organizations, which explains why they’re increasingly turning to Digital Humanitarians.
Who exactly are these Digital Humanitarians? They’re you, me, all of us. Digital Humanitarians are volunteers and professionals from the world over and from all walks of life. What do they share in common? The desire to make a difference, and they do that by rapidly mobilizing online in collaboration with international humanitarian organizations. They make sense of vast volumes of social media and satellite imagery in virtually real-time to support relief efforts worldwide. How? They craft and leverage ingenious crowdsourcing solutions with trail-blazing insights from artificial intelligence.
In sum, this book charts the sudden and spectacular rise of Digital Humanitarians by sharing their remarkable, real-life stories, highlighting how their humanity coupled with innovative solutions to Big Data is changing humanitarian response forever. Digital Humanitarians will make you think differently about what it means to be humanitarian and will invite you to join the journey online.
Clicker here to be notified when the book becomes available. For speaking requests, please email [email protected].”

Passage Of The DATA Act Is A Major Advance In Government Transparency


OpEd by Hudson Hollister in Forbes: “Even as the debate over official secrecy grows on Capitol Hill, basic information about our government’s spending remains hidden in plain sight.
Information that is technically public — federal finance, awards, and expenditures — is effectively locked within a disconnected disclosure system that relies on outdated paper-based technology. Budgets, grants, contracts, and disbursements are reported manually and separately, using forms and spreadsheets. Researchers seeking insights into federal spending must invest time and resources crafting data sets out of these documents. Without common data standards across all government spending, analyses of cross-agency spending trends require endless conversions of apples to oranges.
For a nation whose tech industry leads the world, there is no reason to allow this antiquated system to persist.
That’s why we’re excited to welcome Thursday’s unanimous Senate approval of the Digital Accountability and Transparency Act — known as the DATA Act.
The DATA Act will mandate government-wide standards for federal spending data. It will also require agencies to publish this information online, fully searchable and open to everyone.
Watchdogs and transparency advocates from across the political spectrum have endorsed the DATA Act because all Americans will benefit from clear, accessible information about how their tax dollars are being spent.
It is darkly appropriate that the only organized opposition to this bill took place behind closed doors. In January, Senate sponsors Mark Warner (D-VA) and Rob Portman (R-OH) rejected amendments offered privately by the White House Office of Management and Budget. These nonpublic proposals would have gutted the DATA Act’s key data standards requirement. But Warner and Portman went public with their opposition, and Republicans and Democrats agreed to keep a strong standards mandate.
We now await swift action by the House of Representatives to pass this bill and put it on the President’s desk.
The tech industry is already delivering the technology and expertise that will use federal spending data, once it is open and standardized, to solve problems.
If the DATA Act is fully enforced, citizens will be able to track government spending on a particular contractor or from a particular program, payment by payment. Agencies will be able to deploy sophisticated Big Data analytics to illuminate, and eliminate, waste and fraud. And states and universities will be able to automate their complex federal grant reporting tasks, freeing up more tax dollars for their intended use. Our industry can perform these tasks — as soon as we get the data.
Chairman Earl Devaney’s Recovery Accountability and Transparency Board proved this is possible. Starting in 2009, the Recovery Board applied data standards to track stimulus spending. Our members’ software used that data to help inspectors general prevent and recover over $100 million in spending on suspicious grantees and contractors. The DATA Act applies that approach across the whole of government spending.
Congress is now poised to pass this landmark legislative mandate to transform spending from disconnected documents into open data. Next , the executive branch must implement that mandate.
So our Coalition’s work continues. We will press the Treasury Department and the White House to adopt robust, durable, and nonproprietary data standards for federal spending.
And we won’t stop with spending transparency. The American people deserve access to open data across all areas of government activity — financial regulatory reporting, legislative actions, judicial filings, and much more….”

Let’s get geeks into government


Gillian Tett in the Financial Times: “Fifteen years ago, Brett Goldstein seemed to be just another tech entrepreneur. He was working as IT director of OpenTable, then a start-up website for restaurant bookings. The company was thriving – and subsequently did a very successful initial public offering. Life looked very sweet for Goldstein. But when the World Trade Center was attacked in 2001, Goldstein had a moment of epiphany. “I spent seven years working in a startup but, directly after 9/11, I knew I didn’t want my whole story to be about how I helped people make restaurant reservations. I wanted to work in public service, to give something back,” he recalls – not just by throwing cash into a charity tin, but by doing public service. So he swerved: in 2006, he attended the Chicago police academy and then worked for a year as a cop in one of the city’s toughest neighbourhoods. Later he pulled the disparate parts of his life together and used his number-crunching skills to build the first predictive data system for the Chicago police (and one of the first in any western police force), to indicate where crime was likely to break out.

This was such a success that Goldstein was asked by Rahm Emanuel, the city’s mayor, to create predictive data systems for the wider Chicago government. The fruits of this effort – which include a website known as “WindyGrid” – went live a couple of years ago, to considerable acclaim inside the techie scene.

This tale might seem unremarkable. We are all used to hearing politicians, business leaders and management consultants declare that the computing revolution is transforming our lives. And as my colleague Tim Harford pointed out in these pages last week, the idea of using big data is now wildly fashionable in the business and academic worlds….

In America when top bankers become rich, they often want to “give back” by having a second career in public service: just think of all those Wall Street financiers who have popped up at the US Treasury in recent years. But hoodie-wearing geeks do not usually do the same. Sure, there are some former techie business leaders who are indirectly helping government. Steve Case, a co-founder of AOL, has supported White House projects to boost entrepreneurship and combat joblessness. Tech entrepreneurs also make huge donations to philanthropy. Facebook’s Mark Zuckerberg, for example, has given funds to Newark education. And the whizz-kids have also occasionally been summoned by the White House in times of crisis. When there was a disastrous launch of the government’s healthcare website late last year, the Obama administration enlisted the help of some of the techies who had been involved with the president’s election campaign.

But what you do not see is many tech entrepreneurs doing what Goldstein did: deciding to spend a few years in public service, as a government employee. There aren’t many Zuckerberg types striding along the corridors of federal or local government.
. . .
It is not difficult to work out why. To most young entrepreneurs, the idea of working in a state bureaucracy sounds like utter hell. But if there was ever a time when it might make sense for more techies to give back by doing stints of public service, that moment is now. The civilian public sector badly needs savvier tech skills (just look at the disaster of that healthcare website for evidence of this). And as the sector’s founders become wealthier and more powerful, they need to show that they remain connected to society as a whole. It would be smart political sense.
So I applaud what Goldstein has done. I also welcome that he is now trying to persuade his peers to do the same, and that places such as the University of Chicago (where he teaches) and New York University are trying to get more young techies to think about working for government in between doing those dazzling IPOs. “It is important to see more tech entrepreneurs in public service. I am always encouraging people I know to do a ‘stint in government”. I tell them that giving back cannot just be about giving money; we need people from the tech world to actually work in government, “ Goldstein says.

But what is really needed is for more technology CEOs and leaders to get involved by actively talking about the value of public service – or even encouraging their employees to interrupt their private-sector careers with the occasional spell as a government employee (even if it is not in a sector quite as challenging as the police). Who knows? Maybe it could be Sheryl Sandberg’s next big campaigning mission. After all, if she does ever jump back to Washington, that could have a powerful demonstration effect for techie women and men. And shake DC a little too.”

Eight (No, Nine!) Problems With Big Data


Gary Marcus and Ernest Davis in the New York Times: “BIG data is suddenly everywhere. Everyone seems to be collecting it, analyzing it, making money from it and celebrating (or fearing) its powers. Whether we’re talking about analyzing zillions of Google search queries to predict flu outbreaks, or zillions of phone records to detect signs of terrorist activity, or zillions of airline stats to find the best time to buy plane tickets, big data is on the case. By combining the power of modern computing with the plentiful data of the digital era, it promises to solve virtually any problem — crime, public health, the evolution of grammar, the perils of dating — just by crunching the numbers.

Or so its champions allege. “In the next two decades,” the journalist Patrick Tucker writes in the latest big data manifesto, “The Naked Future,” “we will be able to predict huge areas of the future with far greater accuracy than ever before in human history, including events long thought to be beyond the realm of human inference.” Statistical correlations have never sounded so good.

Is big data really all it’s cracked up to be? There is no doubt that big data is a valuable tool that has already had a critical impact in certain areas. For instance, almost every successful artificial intelligence computer program in the last 20 years, from Google’s search engine to the I.B.M. “Jeopardy!” champion Watson, has involved the substantial crunching of large bodies of data. But precisely because of its newfound popularity and growing use, we need to be levelheaded about what big data can — and can’t — do.

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Second, big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement. Molecular biologists, for example, would very much like to be able to infer the three-dimensional structure of proteins from their underlying DNA sequence, and scientists working on the problem use big data as one tool among many. But no scientist thinks you can solve this problem by crunching data alone, no matter how powerful the statistical analysis; you will always need to start with an analysis that relies on an understanding of physics and biochemistry.

Third, many tools that are based on big data can be easily gamed. For example, big data programs for grading student essays often rely on measures like sentence length and word sophistication, which are found to correlate well with the scores given by human graders. But once students figure out how such a program works, they start writing long sentences and using obscure words, rather than learning how to actually formulate and write clear, coherent text. Even Google’s celebrated search engine, rightly seen as a big data success story, is not immune to “Google bombing” and “spamdexing,” wily techniques for artificially elevating website search placement.

Fourth, even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem. Consider Google Flu Trends, once the poster child for big data. In 2009, Google reported — to considerable fanfare — that by analyzing flu-related search queries, it had been able to detect the spread of the flu as accurately and more quickly than the Centers for Disease Control and Prevention. A few years later, though, Google Flu Trends began to falter; for the last two years it has made more bad predictions than good ones.

As a recent article in the journal Science explained, one major contributing cause of the failures of Google Flu Trends may have been that the Google search engine itself constantly changes, such that patterns in data collected at one time do not necessarily apply to data collected at another time. As the statistician Kaiser Fung has noted, collections of big data that rely on web hits often merge data that was collected in different ways and with different purposes — sometimes to ill effect. It can be risky to draw conclusions from data sets of this kind.

A fifth concern might be called the echo-chamber effect, which also stems from the fact that much of big data comes from the web. Whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound. Consider translation programs like Google Translate, which draw on many pairs of parallel texts from different languages — for example, the same Wikipedia entry in two different languages — to discern the patterns of translation between those languages. This is a perfectly reasonable strategy, except for the fact that with some of the less common languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error.

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

Seventh, big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions. In the past few months, for instance, there have been two separate attempts to rank people in terms of their “historical importance” or “cultural contributions,” based on data drawn from Wikipedia. One is the book “Who’s Bigger? Where Historical Figures Really Rank,” by the computer scientist Steven Skiena and the engineer Charles Ward. The other is an M.I.T. Media Lab project called Pantheon.

Both efforts get many things right — Jesus, Lincoln and Shakespeare were surely important people — but both also make some egregious errors. “Who’s Bigger?” claims that Francis Scott Key was the 19th most important poet in history; Pantheon has claimed that Nostradamus was the 20th most important writer in history, well ahead of Jane Austen (78th) and George Eliot (380th). Worse, both projects suggest a misleading degree of scientific precision with evaluations that are inherently vague, or even meaningless. Big data can reduce anything to a single number, but you shouldn’t be fooled by the appearance of exactitude.

FINALLY, big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common. For instance, programs that use big data to deal with text, such as search engines and translation programs, often rely heavily on something called trigrams: sequences of three words in a row (like “in a row”). Reliable statistical information can be compiled about common trigrams, precisely because they appear frequently. But no existing body of data will ever be large enough to include all the trigrams that people might use, because of the continuing inventiveness of language.

To select an example more or less at random, a book review that the actor Rob Lowe recently wrote for this newspaper contained nine trigrams such as “dumbed-down escapist fare” that had never before appeared anywhere in all the petabytes of text indexed by Google. To witness the limitations that big data can have with novelty, Google-translate “dumbed-down escapist fare” into German and then back into English: out comes the incoherent “scaled-flight fare.” That is a long way from what Mr. Lowe intended — and from big data’s aspirations for translation.

Wait, we almost forgot one last problem: the hype….

Smart cities are here today — and getting smarter


Computer World: “Smart cities aren’t a science fiction, far-off-in-the-future concept. They’re here today, with municipal governments already using technologies that include wireless networks, big data/analytics, mobile applications, Web portals, social media, sensors/tracking products and other tools.
These smart city efforts have lofty goals: Enhancing the quality of life for citizens, improving government processes and reducing energy consumption, among others. Indeed, cities are already seeing some tangible benefits.
But creating a smart city comes with daunting challenges, including the need to provide effective data security and privacy, and to ensure that myriad departments work in harmony.

The global urban population is expected to grow approximately 1.5% per year between 2025 and 2030, mostly in developing countries, according to the World Health Organization.

What makes a city smart? As with any buzz term, the definition varies. But in general, it refers to using information and communications technologies to deliver sustainable economic development and a higher quality of life, while engaging citizens and effectively managing natural resources.
Making cities smarter will become increasingly important. For the first time ever, the majority of the world’s population resides in a city, and this proportion continues to grow, according to the World Health Organization, the coordinating authority for health within the United Nations.
A hundred years ago, two out of every 10 people lived in an urban area, the organization says. As recently as 1990, less than 40% of the global population lived in a city — but by 2010 more than half of all people lived in an urban area. By 2050, the proportion of city dwellers is expected to rise to 70%.
As many city populations continue to grow, here’s what five U.S. cities are doing to help manage it all:

Scottsdale, Ariz.

The city of Scottsdale, Ariz., has several initiatives underway.
One is MyScottsdale, a mobile application the city deployed in the summer of 2013 that allows citizens to report cracked sidewalks, broken street lights and traffic lights, road and sewer issues, graffiti and other problems in the community….”

Open Data: What Is It and Why Should You Care?


Jason Shueh at Government Technology: “Though the debate about open data in government is an evolving one, it is indisputably here to stay — it can be heard in both houses of Congress, in state legislatures, and in city halls around the nation.
Already, 39 states and 46 localities provide data sets to data.gov, the federal government’s online open data repository. And 30 jurisdictions, including the federal government, have taken the additional step of institutionalizing their practices in formal open data policies.
Though the term “open data” is spoken of frequently — and has been since President Obama took office in 2009 — what it is and why it’s important isn’t always clear. That’s understandable, perhaps, given that open data lacks a unified definition.
“People tend to conflate it with big data,” said Emily Shaw, the national policy manager at the Sunlight Foundation, “and I think it’s useful to think about how it’s different from big data in the sense that open data is the idea that public information should be accessible to the public online.”
Shaw said the foundation, a Washington, D.C., non-profit advocacy group promoting open and transparent government, believes the term open data can be applied to a variety of information created or collected by public entities. Among the benefits of open data are improved measurement of policies, better government efficiency, deeper analytical insights, greater citizen participation, and a boost to local companies by way of products and services that use government data (think civic apps and software programs).
“The way I personally think of open data,” Shaw said, “is that it is a manifestation of the idea of open government.”

What Makes Data Open

For governments hoping to adopt open data in policy and in practice, simply making data available to the public isn’t enough to make that data useful. Open data, though straightforward in principle, requires a specific approach based on the agency or organization releasing it, the kind of data being released and, perhaps most importantly, its targeted audience.
According to the foundation’s California Open Data Handbook, published in collaboration with Stewards of Change Institute, a national group supporting innovation in human services, data must first be both “technically open” and “legally open.” The guide defines the terms in this way:
Technically open: [data] available in a machine-readable standard format, which means it can be retrieved and meaningfully processed by a computer application
Legally open: [data] explicitly licensed in a way that permits commercial and non-commercial use and re-use without restrictions.
Technically open means that data is easily accessible to its intended audience. If the intended users are developers and programmers, Shaw said, the data should be presented within an application programming interface (API); if it’s intended for researchers in academia, data might be structured in a bulk download; and if it’s aimed at the average citizen, data should be available without requiring software purchases.
….

4 Steps to Open Data

Creating open data isn’t without its complexities. There are many tasks that need to happen before an open data project ever begins. A full endorsement from leadership is paramount. Adding the project into the work flow is another. And allaying fears and misunderstandings is expected with any government project.
After the basic table stakes are placed, the handbook prescribes four steps: choosing a set of data, attaching an open license, making it available through a proper format and ensuring the data is discoverable.
1. Choose a Data Set
Choosing a data set can appear daunting, but it doesn’t have to be. Shaw said ample resources are available from the foundation and others on how to get started with this — see our list of open data resources for more information. In the case of selecting a data set, or sets, she referred to the foundation’s recently updated guidelines that urge identifying data sets based on goals and the demand from citizen feedback.
2. Attach an Open License
Open licenses dispel ambiguity and encourage use. However, they need to be proactive, and this means users should not be forced to request the information in order to use it — a common symptom of data accessed through the Freedom of Information Act. Tips for reference can be found at Opendefinition.org, a site that has a list of examples and links to open licenses that meet the definition of open use.
3. Format the Data to Your Audience
As previously stated, Shaw recommends tailoring the format of data to the audience, with the ideal being that data is packaged in formats that can be digested by all users: developers, civic hackers, department staff, researchers and citizens. This could mean it’s put into APIs, spreadsheet docs, text and zip files, FTP servers and torrent networking systems (a way to download files from different sources). The file type and the system for download all depends on the audience.
“Part of learning about what formats government should offer data in is to engage with the prospective users,” Shaw said.
4. Make it Discoverable
If open data is strewn across multiple download links and wedged into various nooks and crannies of a website, it probably won’t be found. Shaw recommends a centralized hub that acts as a one-stop shop for all open data downloads. In many jurisdictions, these Web pages and websites have been called “portals;” they are the online repositories for a jurisdiction’s open data publishing.
“It is important for thinking about how people can become aware of what their governments hold. If the government doesn’t make it easy for people to know what kinds of data is publicly available on the website, it doesn’t matter what format it’s in,” Shaw said. She pointed to public participation — a recurring theme in open data development — to incorporate into the process to improve accessibility.
 
Examples of portals, can be found in numerous cities across the U.S., such as San Francisco, New York, Los Angeles, Chicago and Sacramento, Calif.
Visit page 2 of our story for open data resources, and page 3 for open data file formats.

Big data: are we making a big mistake?


Tim Harford in the Financial Times: “Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”. Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”…
But big data do not solve the problem that has obsessed statisticians and scientists for centuries: the problem of insight, of inferring what is going on, and figuring out how we might intervene to change a system for the better.
“We have a new resource here,” says Professor David Hand of Imperial College London. “But nobody wants ‘data’. What they want are the answers.”
To use big data to produce such answers will require large strides in statistical methods.
“It’s the wild west right now,” says Patrick Wolfe of UCL. “People who are clever and driven will twist and turn and use every tool to get sense out of these data sets, and that’s cool. But we’re flying a little bit blind at the moment.”
Statisticians are scrambling to develop new methods to seize the opportunity of big data. Such new methods are essential but they will work by building on the old statistical lessons, not by ignoring them.
Recall big data’s four articles of faith. Uncanny accuracy is easy to overrate if we simply ignore false positives, as with Target’s pregnancy predictor. The claim that causation has been “knocked off its pedestal” is fine if we are making predictions in a stable environment but not if the world is changing (as with Flu Trends) or if we ourselves hope to change it. The promise that “N = All”, and therefore that sampling bias does not matter, is simply not true in most cases that count. As for the idea that “with enough data, the numbers speak for themselves” – that seems hopelessly naive in data sets where spurious patterns vastly outnumber genuine discoveries.
“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.”

Potholes and Big Data: Crowdsourcing Our Way to Better Government


Phil Simon in Wired: “Big Data is transforming many industries and functions within organizations with relatively limited budgets.
Consider Thomas M. Menino, up until recently Boston’s longest-serving mayor. At some point in the past few years, Menino realized that it was no longer 1950. Perhaps he was hobnobbing with some techies from MIT at dinner one night. Whatever his motivation, he decided that there just had to be a better, more cost-effective way to maintain and fix the city’s roads. Maybe smartphones could help the city take a more proactive approach to road maintenance.
To that end, in July 2012, the Mayor’s Office of New Urban Mechanics launched a new project called Street Bump, an app that allows drivers to automatically report the road hazards to the city as soon as they hear that unfortunate “thud,” with their smartphones doing all the work.
The app’s developers say their work has already sparked interest from other cities in the U.S., Europe, Africa and elsewhere that are imagining other ways to harness the technology.
Before they even start their trip, drivers using Street Bump fire up the app, then set their smartphones either on the dashboard or in a cup holder. The app takes care of the rest, using the phone’s accelerometer — a motion detector — to sense when a bump is hit. GPS records the location, and the phone transmits it to an AWS remote server.
But that’s not the end of the story. It turned out that the first version of the app reported far too many false positives (i.e., phantom potholes). This finding no doubt gave ammunition to the many naysayers who believe that technology will never be able to do what people can and that things are just fine as they are, thank you. Street Bump 1.0 “collected lots of data but couldn’t differentiate between potholes and other bumps.” After all, your smartphone or cell phone isn’t inert; it moves in the car naturally because the car is moving. And what about the scores of people whose phones “move” because they check their messages at a stoplight?
To their credit, Menino and his motley crew weren’t entirely discouraged by this initial setback. In their gut, they knew that they were on to something. The idea and potential of the Street Bump app were worth pursuing and refining, even if the first version was a bit lacking. Plus, they have plenty of examples from which to learn. It’s not like the iPad, iPod, and iPhone haven’t evolved considerably over time.
Enter InnoCentive, a Massachusetts-based firm specializing in open innovation and crowdsourcing. The City of Boston contracted InnoCentive to improve Street Bump and reduce the amount of tail chasing. The company accepted the challenge and essentially turned it into a contest, a process sometimes called gamification. InnoCentive offered a network of 400,000 experts a share of $25,000 in prize money donated by Liberty Mutual.
Almost immediately, the ideas to improve Street Bump poured in from unexpected places. This crowd had wisdom. Ultimately, the best suggestions came from:

  • A group of hackers in Somerville, Massachusetts, that promotes community education and research
  • The head of the mathematics department at Grand Valley State University in Allendale, MI.
  • An anonymous software engineer

…Crowdsourcing roadside maintenance isn’t just cool. Increasingly, projects like Street Bump are resulting in substantial savings — and better government.”