Todd Sherer, Chief Executive Officer of The Michael J. Fox Foundation for Parkinson’s Research, in Forbes: “he problem is, we all still work in a system that feeds on secrecy and competition. It’s hard enough work just to dream up win/win collaborative structures; getting them off the ground can feel like pushing a boulder up a hill. Yet there is no doubt that the realities of today’s research environment — everything from the accumulation of big data to the ever-shrinking availability of funds — demand new models for collaboration. Call it “collaboration 2.0.”…I share a few recent examples in the hope of increasing the reach of these initiatives, inspiring others like them, and encouraging frank commentary on how they’re working.
Open-Access Data
The successes of collaborations in the traditional sense, coupled with advanced techniques such as genomic sequencing, have yielded masses of data. Consortia of clinical sites around the world are working together to collect and characterize data and biospecimens through standardized methods, leading to ever-larger pools — more like Great Lakes — of data. Study investigators draw their own conclusions, but there is so much more to discover than any individual lab has the bandwidth for….
Crowdsourcing
A great way to grow engagement with resources you’re willing to share? Ask for it. Collaboration 2.0 casts a wide net. We dipped our toe in the crowdsourcing waters earlier this year with our Parkinson’s Data Challenge, which asked anyone interested to download a set of data that had been collected from PD patients and controls using smart phones. …
Cross-Disciplinary Collaboration 2.0
The more we uncover about the interconnectedness and complexity of the human system, the more proof we are gathering that findings and treatments for one disease may provide invaluable insights for others. We’ve seen some really intriguing crosstalk between the Parkinson’s and Alzheimer’s disease research communities recently…
The results should be: More ideas. More discovery. Better health.”
Crowd-Sourcing the Nation: Now a National Effort
Release from the U.S. Department of the Interior, U.S. Geological Survey: “The mapping crowd-sourcing program, known as The National Map Corps (TNMCorps), encourages citizens to collect structures data by adding new features, removing obsolete points, and correcting existing data for The National Map database. Structures being mapped in the project include schools, hospitals, post offices, police stations and other important public buildings.
Since the start of the project in 2012, more than 780 volunteers have made in excess of 13,000 contributions. In addition to basic editing, a second volunteer peer review process greatly enhances the quality of data provided back to The National Map. A few months ago, volunteers in 35 states were actively involved. This final release of states opens up the entire country for volunteer structures enhancement.
To show appreciation of our volunteer’s efforts, The National Map Corps has instituted a recognition program that awards “virtual” badges to volunteers. The badges consist of a series of antique surveying instruments ranging from the Order of the Surveyor’s Chain (25 – 50 points) to the Theodolite Assemblage (2000+ points). Additionally, volunteers are publically acclaimed (with permission) via Twitter, Facebook and Google+….
Tools on TNMCorps website explain how a volunteer can edit any area, regardless of their familiarity with the selected structures, and becoming a volunteer for TNMCorps is easy; go to The National Map Corps website to learn more and to sign up as a volunteer. If you have access to the Internet and are willing to dedicate some time to editing map data, we hope you will consider participating!”
Five myths about big data
Samuel Arbesman, senior scholar at the Ewing Marion Kauffman Foundation and the author of “The Half-Life of Facts” in the Washington Post: “Big data holds the promise of harnessing huge amounts of information to help us better understand the world. But when talking about big data, there’s a tendency to fall into hyperbole. It is what compels contrarians to write such tweets as “Big Data, n.: the belief that any sufficiently large pile of s— contains a pony.” Let’s deflate the hype.
1. “Big data” has a clear definition.
The term “big data” has been in circulation since at least the 1990s, when it is believed to have originated in Silicon Valley. IBM offers a seemingly simple definition: Big data is characterized by the four V’s of volume, variety, velocity and veracity. But the term is thrown around so often, in so many contexts — science, marketing, politics, sports — that its meaning has become vague and ambiguous….
2. Big data is new.
By many accounts, big data exploded onto the scene quite recently. “If wonks were fashionistas, big data would be this season’s hot new color,” a Reuters report quipped last year. In a May 2011 report, the McKinsey Global Institute declared big data “the next frontier for innovation, competition, and productivity.”
It’s true that today we can mine massive amounts of data — textual, social, scientific and otherwise — using complex algorithms and computer power. But big data has been around for a long time. It’s just that exhaustive datasets were more exhausting to compile and study in the days when “computer” meant a person who performed calculations….
3. Big data is revolutionary.
In their new book, “Big Data: A Revolution That Will Transform How We Live, Work, and Think,”Viktor Mayer-Schonberger and Kenneth Cukier compare “the current data deluge” to the transformation brought about by the Gutenberg printing press.
If you want more precise advertising directed toward you, then yes, big data is revolutionary. Generally, though, it’s likely to have a modest and gradual impact on our lives….
4. Bigger data is better.
In science, some admittedly mind-blowing big-data analyses are being done. In business, companies are being told to “embrace big data before your competitors do.” But big data is not automatically better.
Really big datasets can be a mess. Unless researchers and analysts can reduce the number of variables and make the data more manageable, they get quantity without a whole lot of quality. Give me some quality medium data over bad big data any day…
5. Big data means the end of scientific theories.
Chris Anderson argued in a 2008 Wired essay that big data renders the scientific method obsolete: Throw enough data at an advanced machine-learning technique, and all the correlations and relationships will simply jump out. We’ll understand everything.
But you can’t just go fishing for correlations and hope they will explain the world. If you’re not careful, you’ll end up with spurious correlations. Even more important, to contend with the “why” of things, we still need ideas, hypotheses and theories. If you don’t have good questions, your results can be silly and meaningless.
Having more data won’t substitute for thinking hard, recognizing anomalies and exploring deep truths.”
Announcing Project Open Data from Cloudant Labs
Yuriy Dybskiy from Cloudant: “There has been an emerging pattern over the last few years of more and more government datasets becoming available for public access. Earlier this year, the White House announced official policy on such data – Project Open Data.
Available resources
Here are four resources on the topic:
- Tim Berners-Lee: Open, Linked Data for a Global Community – [10 min video]
- Rufus Pollock: Open Data – How We Got Here and Where We’re Going – [24 min video]
- Open Knowledge Foundation Datasets – http://data.okfn.org/data
- Max Ogden: Project
dat
– collaborative data – [github repo]
One of the main challenges is access to the datasets. If only there were a database that had easy access to its data baked right in it.
Luckily, there is CouchDB and Cloudant, which share the same APIs to access data via HTTP. This makes for a really great option to store interesting datasets.
Cloudant Open Data
Today we are happy to announce a Cloudant Labs project – Cloudant Open Data!
Several datasets are available at the moment, for example, businesses_sf – data regarding businesses registered in San Francisco and sf_pd_incidents – a collection of incident reports (criminal and non-criminal) made by the San Francisco Police Department.
We’ll add more, but if you have one you’d like us to add faster – drop us a line at [email protected]
Create an account and play with these datasets yourself”
Improved Governance? Exploring the Results of Peru's Participatory Budgeting Process
Paper by Stephanie McNulty for the 2013 Annual Meeting of the American Political Science Association (Aug. 29-Sept. 1, 2013): “Can a nationally mandated participatory budget process change the nature of local governance? Passed in 2003 to mandate participatory budgeting in all districts and regions of Peru, Peru’s National PB Law has garnered international attention from proponents of participatory governance. However, to date, the results of the process have not been widely documented. Presenting data that have been gathered through fieldwork, online databases, and primary documents, this paper explores the results of Peru’s PB after ten years of implementation. The paper finds that results are limited. While there are a significant number of actors engaged in the process, the PB is still dominated by elite actors that do not represent the diversity of the civil society sector in Peru. Participants approve important “pro-poor” projects, but they are not always executed. Finally, two important indicators of governance, sub-national conflict and trust in local institutions, have not improved over time. Until Peruvian politicians make a concerted effort to move beyond politics as usual, results will continue to be limited”
Defense Against National Vulnerabilities in Public Data
DOD/DARPA Notice (See also Foreign Policy article): “OBJECTIVE: Investigate the national security threat posed by public data available either for purchase or through open sources. Based on principles of data science, develop tools to characterize and assess the nature, persistence, and quality of the data. Develop tools for the rapid anonymization and de-anonymization of data sources. Develop framework and tools to measure the national security impact of public data and to defend against the malicious use of public data against national interests.
DESCRIPTION: The vulnerabilities to individuals from a data compromise are well known and documented now as “identity theft.” These include regular stories published in the news and research journals documenting the loss of personally identifiable information by corporations and governments around the world. Current trends in social media and commerce, with voluntary disclosure of personal information, create other potential vulnerabilities for individuals participating heavily in the digital world. The Netflix Challenge in 2009 was launched with the goal of creating better customer pick prediction algorithms for the movie service [1]. An unintended consequence of the Netflix Challenge was the discovery that it was possible to de-anonymize the entire contest data set with very little additional data. This de-anonymization led to a federal lawsuit and the cancellation of the sequel challenge [2]. The purpose of this topic is to understand the national level vulnerabilities that may be exploited through the use of public data available in the open or for purchase.
Could a modestly funded group deliver nation-state type effects using only public data?…”
The official link for this solicitation is: www.acq.osd.mil/osbp/sbir/solicitations/sbir20133.
Smartphones As Weather Surveillance Systems
Tom Simonite in MIT Technology Review: “You probably never think about the temperature of your smartphone’s battery, but it turns out to provide an interesting method for tracking outdoor air temperature. It’s a discovery that adds to other evidence that mobile apps could provide a new way to measure what’s happening in the atmosphere and improve weather forecasting.
Startup OpenSignal, whose app crowdsources data on cellphone reception, first noticed in 2012 that changes in battery temperature correlated with those outdoors. On Tuesday, they published a scientific paper on that technique in a geophysics journal and announced that the technique will be used to interpret data from a weather crowdsourcing app. OpenSignal originally started collecting data on battery temperatures to try and understand the connections between signal strength and how quickly a device chews through its battery.
OpenSignal’s crowdsourced weather-tracking effort joins another accidentally enabled by smartphones. A project called PressureNET that collects air pressure data by taking advantage of the fact many Android phones have a barometer inside to aid their GPS function (see “App Feeds Scientists Atmospheric Data From Thousands of Smartphones”). Cliff Mass, an atmospheric scientist at the University of Washington, is working to incorporate PressureNET data into weather models that usually rely on data from weather stations. He believes that smartphones could provide valuable data from places where there are no weather stations, if enough people start sharing data using apps like PressureNET.
Other research suggests that logging changes in cell network signal strength perceived by smartphones could provide yet more weather data. In February researchers in the Netherlands produced detailed maps of rainfall compiled by monitoring fluctuations in the signal strength measured by cellular network masts, caused by water droplets in the atmosphere.”
Hackers Called Into Civic Duty
Wall Street Journal: “Cash-strapped cities are turning to an unusual source to improve their online services on the cheap: helpful hackers, who use city data to create tools tracking everything from real-time subway delays to where to get a free flu shot near your home and information about a contentious school-closing plan.
Hackers have been popularly portrayed as giving fits to national-security officials and credit-card companies, but the term also refers to people who like to write their own computer programs and help solve a variety of problems. Recently, hackers have begun working with cities to find ways of building applications, or apps, that make use of data—which gets stripped of personally identifiable information—that municipalities are collecting anyway in the regular course of governance….Last year, Chicago Mayor Rahm Emanuel signed an executive order mandating the city make available all data not protected by privacy laws. Today, the city has nearly 950 data sets publicly available, the most of any U.S. city, according to Code for America, a nonprofit that promotes openness in government.”
Guidelines for Open Data Policies
“The Sunlight Foundation created this living document to present a broad vision of the kinds of challenges that open data policies can actively address.
A few general notes: Although some provisions may carry more importance or heft than others, these Guidelines are not ranked in order of priority, but organized to help define What Data Should be Public, How to Make Data Public, and How to Implement Policy — three key elements of any legislation, executive order, or other policy seeking to include language about open data. Further, it’s worth repeating that these provisions are only a guide. As such, they do not address every question one should consider in preparing a policy. Instead, these provisions attempt to answer the specific question: What can or should an open data policy do?”
Data is Inert — It’s What You Do With It That Counts
Kevin Merritt, CEO and Founder, Socrata, in NextGov: “In its infancy, the open data movement was mostly about offering catalogs of government data online that concerned citizens and civic activists could download. But now, a wide variety of external stakeholders are using open data to deliver new applications and services. At the same time, governments themselves are harnessing open data to drive better decision-making.
In a relatively short period of time, open data has evolved from serving as fodder for data publishing to fuel for open innovation.
One of the keys to making this transformation truly work, however, is our ability to re-instrument or re-tool underlying business systems and processes so managers can receive open data in consumable forms on a regular, continuous basis in real-time….”