Christine L. Borgman at ERCIM News: “Data sharing has become policy enforced by governments, funding agencies, journals, and other stakeholders. Arguments in favor include leveraging investments in research, reducing the need to collect new data, addressing new research questions by reusing or combining extant data, and reproducing research, which would lead to greater accountability, transparency, and less fraud. Arguments against data sharing rarely are expressed in public fora, so popular is the idea. Much of the scholarship on data practices attempts to understand the socio-technical barriers to sharing, with goals to design infrastructures, policies, and cultural interventions that will overcome these barriers.
However, data sharing and reuse are common practice in only a few fields. Astronomy and genomics in the sciences, survey research in the social sciences, and archaeology in the humanities are the typical exemplars, which remain the exceptions rather than the rule. The lack of success of data sharing policies, despite accelerating enforcement over the last decade, indicates the need not just for a much deeper understanding of the roles of data in contemporary science but also for developing new models of scientific practice. Science progressed for centuries without data sharing policies. Why is data sharing deemed so important to scientific progress now? How might scientific practice be different if these policies were in place several generations ago?
Enthusiasm for “big data” and for data sharing are obscuring the complexity of data in scholarship and the challenges for stewardship. Data practices are local, varying from field to field, individual to individual, and country to country. Studying data is a means to observe how rapidly the landscape of scholarly work in the sciences, social sciences, and the humanities is changing. Inside the black box of data is a plethora of research, technology, and policy issues. Data are best understood as representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship. Rarely do they stand alone, separable from software, protocols, lab and field conditions, and other context. The lack of agreement on what constitutes data underlies the difficulties in sharing, releasing, or reusing research data.
Concerns for data sharing and open access raise broader questions about what data to keep, what to share, when, how, and with whom. Open data is sometimes viewed simply as releasing data without payment of fees. In research contexts, open data may pose complex issues of licensing, ownership, responsibility, standards, interoperability, and legal harmonization. To scholars, data can be assets, liabilities, or both. Data have utilitarian value as evidence, but they also serve social and symbolic purposes for control, barter, credit, and prestige. Incentives for scientific advancement often run counter to those for sharing data.
….
Rather than assume that data sharing is almost always a “good thing” and that doing so will promote the progress of science, more critical questions should be asked: What are the data? What is the utility of sharing or releasing data, and to whom? Who invests the resources in releasing those data and in making them useful to others? When, how, why, and how often are those data reused? Who benefits from what kinds of data transfer, when, and how? What resources must potential re-users invest in discovering, interpreting, processing, and analyzing data to make them reusable? Which data are most important to release, when, by what criteria, to whom, and why? What investments must be made in knowledge infrastructures, including people, institutions, technologies, and repositories, to sustain access to data that are released? Who will make those investments, and for whose benefit?
Only when these questions are addressed by scientists, scholars, data professionals, librarians, archivists, funding agencies, repositories, publishers, policy makers, and other stakeholders in research will satisfactory answers arise to the problems of data sharing…(More)”.
Breaking Public Administrations’ Data Silos. The Case of Open-DAI, and a Comparison between Open Data Platforms.
Paper by Raimondo Iemma, Federico Morando, and Michele Osella: “An open reuse of public data and tools can turn the government into a powerful ‘platform’ also involving external innovators. However, the typical information system of a public agency is not open by design. Several public administrations have started adopting technical solutions to overcome this issue, typically in the form of middleware layers operating as ‘buses’ between data centres and the outside world. Open-DAI is an open source platform designed to expose data as services, directly pulling from legacy databases of the data holder. The platform is the result of an ongoing project funded under the EU ICT PSP call 2011. We present the rationale and features of Open-DAI, also through a comparison with three other open data platforms: the Socrata Open Data portal, CKAN, and ENGAGE….(More)”
Open data could turn Europe’s digital desert into a digital rainforest
Joanna Roberts interviews Dirk Helbing, Professor of Computational Social Science at ETH Zurich at Horizon: “…If we want to be competitive, Europe needs to find its own way. How can we differentiate ourselves and make things better? I believe Europe should not engage in the locked data strategy that we see in all these huge IT giants. Instead, Europe should engage in open data, open innovation, and value-sensitive design, particularly approaches that support informational self-determination. So everyone can use this data, generate new kinds of data, and build applications on top. This is going to create ever more possibilities for everyone else, so in a sense that will turn a digital desert into a digital rainforest full of opportunities for everyone, with a rich information ecosystem.’…
The Internet of Things is the next big emerging information communication technology. It’s based on sensors. In smartphones there are about 15 sensors; for light, for noise, for location, for all sorts of things. You could also buy additional external sensors for humidity, for chemical substances and almost anything that comes to your mind. So basically this allows us to measure the environment and all the features of our physical, biological, economic, social and technological environment.
‘Imagine if there was one company in the world controlling all the sensors and collecting all the information. I think that might potentially be a dystopian surveillance nightmare, because you couldn’t take a single step or speak a single word without it being recorded. Therefore, if we want the Internet of Things to be consistent with a stable democracy then I believe we need to run it as a citizen web, which means to create and manage the planetary nervous system together. The citizens themselves would buy the sensors and activate them or not, would decide themselves what sensor data they would share with whom and for what purpose, so informational self-determination would be at the heart, and everyone would be in control of their own data.’….
A lot of exciting things will become possible. We would have a real-time picture of the world and we could use this data to be more aware of what the implications of our decisions and actions are. We could avoid mistakes and discover opportunities we would otherwise have missed. We will also be able to measure what’s going on in our society and economy and why. In this way, we will eventually identify the hidden forces that determine the success or failure of a company, of our economy or even our society….(More)”
Making emotive games from open data
Katie Collins at WIRED: “Microsoft researcher Kati London’s aim is “to try to get people to think of data in terms of personalities, relationships and emotions”, she tells the audience at the Story Festival in London. Through Project Sentient Data, she uses her background in games development to create fun but meaningful experiences that bridge online interactions and things that are happening in the real world.
One such experience invited children to play against the real-time flow of London traffic through an online game called the Code of Everand. The aim was to test the road safety knowledge of 9-11 year olds and “make alertness something that kids valued”.
The core mechanic of the game was that of a normal world populated by little people, containing spirit channels that only kids could see and go through. Within these spirit channels, everything from lorries and cars from the streets became monsters. The children had to assess what kind of dangers the monsters posed and use their tools to dispel them.
“Games are great ways to blur and observe the ways people interact with real-world data,” says London.
In one of her earlier projects back in 2005, London used her knowledge of horticulture to bring artificial intelligence to plants. “Almost every workspace I go into has a half dead plant in it, so we gave plants the ability to tell us what they need.” It was, she says, an exercise in “humanising data” that led to further projects that saw her create self aware street signs and a dynamic city map that expressed shame neighbourhood by neighbourhood depending on the open dataset of public complaints in New York.
A further project turned complaint data into cartoons on Instagram every week. London praised the open data initiative in New York, but added that for people to access it, they had to know it existed and know where to find it. The cartoons were a “lightweight” form of “civic engagement” that helped to integrate hyperlocal issues into everyday conversation.
London also gamified community engagement through a project commissioned by the Knight Foundation called Macon Money….(More)”.
Data for good
Key Findings
- Citizens Advice (CAB) and Data Kind partnered to develop the Civic Dashboard. A tool which mines data from CAB consultations to understand emerging social issues in the UK.
- Shooting Star Chase volunteers streamlined the referral paths of how children come to be at the hospices saving up to £90,000 for children’s hospices around the country by refining the referral system.
- In a study of open grant funding data, NCVO identified 33,000 ‘below the radar organisations’ not currently registered in registers and databases on the third sector
- In their social media analysis of tweets related to the Somerset Floods, Demos found that 39,000 tweets were related to social action
New ways of capturing, sharing and analysing data have the potential to transform how community and voluntary sector organisations work and how social action happens. However, while analysing and using data is core to how some of the world’s fastest growing businesses understand their customers and develop new products and services, civil society organisations are still some way off from making the most of this potential.
Over the last 12 months Nesta has grant funded a number of research projects that explore two dimensions of how big and open data can be used for the common good. Firstly, how it can be used by charities to develop better products and services and secondly, how it can help those interested in civil society better understand social action and civil society activity.
- Citizens Advice Bureau (CAB) and Datakind, a global community of data scientists interested in how data can be used for a social purpose, were grant funded to explore how a datadriven approach to mining the rich data that CAB holds on social issues in the UK could be used to develop a real–time dashboard to identify emerging social issues. The project also explored how data–driven methods could better help other charities such as St Mungo’s and Buttle UK, and how data could be shared more effectively between charities as part of this process, to create collaborative data–driven projects.
- Five organisations (The RSA, Cardiff University, The Demos Centre for Analysis of Social Media, NCVO and European Alternatives) were grant funded to explore how data–driven methods, such as open data analysis and social media analysis, can help us understand informal social action, often referred to as ‘below the radar activity’ in new ways.
This paper is not the definitive story of the opportunities in using big and open data for the common good, but it can hopefully provide insight on what can be done and lessons for others interested in exploring the opportunities in these methods….(More).”
Unleashing the Power of Data to Serve the American People
“Memorandum: Unleashing the Power of Data to Serve the American People
To: The American People
From: Dr. DJ Patil, Deputy U.S. CTO for Data Policy and Chief Data Scientist
….While there is a rich history of companies using data to their competitive advantage, the disproportionate beneficiaries of big data and data science have been Internet technologies like social media, search, and e-commerce. Yet transformative uses of data in other spheres are just around the corner. Precision medicine and other forms of smarter health care delivery, individualized education, and the “Internet of Things” (which refers to devices like cars or thermostats communicating with each other using embedded sensors linked through wired and wireless networks) are just a few of the ways in which innovative data science applications will transform our future.
The Obama administration has embraced the use of data to improve the operation of the U.S. government and the interactions that people have with it. On May 9, 2013, President Obama signed Executive Order 13642, which made open and machine-readable data the new default for government information. Over the past few years, the Administration has launched a number of Open Data Initiatives aimed at scaling up open data efforts across the government, helping make troves of valuable data — data that taxpayers have already paid for — easily accessible to anyone. In fact, I used data made available by the National Oceanic and Atmospheric Administration to improve numerical methods of weather forecasting as part of my doctoral work. So I know firsthand just how valuable this data can be — it helped get me through school!
Given the substantial benefits that responsibly and creatively deployed data can provide to us and our nation, it is essential that we work together to push the frontiers of data science. Given the importance this Administration has placed on data, along with the momentum that has been created, now is a unique time to establish a legacy of data supporting the public good. That is why, after a long time in the private sector, I am returning to the federal government as the Deputy Chief Technology Officer for Data Policy and Chief Data Scientist.
Organizations are increasingly realizing that in order to maximize their benefit from data, they require dedicated leadership with the relevant skills. Many corporations, local governments, federal agencies, and others have already created such a role, which is usually called the Chief Data Officer (CDO) or the Chief Data Scientist (CDS). The role of an organization’s CDO or CDS is to help their organization acquire, process, and leverage data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.
The Role of the First-Ever U.S. Chief Data Scientist
Similarly, my role as the U.S. CDS will be to responsibly source, process, and leverage data in a timely fashion to enable transparency, provide security, and foster innovation for the benefit of the American public, in order to maximize the nation’s return on its investment in data.
So what specifically am I here to do? As I start, I plan to focus on these four activities:
…(More)”
Amid Open Data Push, Agencies Feel Urge for Analytics
NextGov: “Federal agencies, thanks to their unique missions, have long been collectors of valuable, vital and, no doubt, arcane data. Under a nearly two-year-old executive order from President Barack Obama, agencies are releasing more of this data in machine-readable formats to the public and entrepreneurs than ever before.
But agencies still need a little help parsing through this data for their own purposes. They are turning to industry, academia and outside researchers for cutting-edge analytics tools to parse through their data to derive insights and to use those insights to drive decision-making.
Take the U.S. Agency for International Development, for example. The agency administers U.S. foreign aid programs aimed at ending extreme poverty and helping support democratic societies around the globe.
Under the agency’s own recent open data policy, it’s started collecting reams of data from its overseas missions. Starting Oct. 1, organizations doing development work on the ground – including through grants and contracts – have been directed to also collect data generated by their work and submit it to back to agency headquarters. Teams go through the data, scrub it to remove sensitive material and then publish it.
The data spans the gamut from information on land ownership in South Sudan to livestock demographics in Senegal and HIV prevention activities in Zambia….The agency took the first step in solving that problem with a Jan. 20 request for information from outside groups for cutting-edge data analytics tools.
“Operating units within USAID are sometimes constrained by existing capacity to transform data into insights that could inform development programming,” the RFI stated.
The RFI queries industry on their capabilities in data mining and social media analytics and forecasting and systems modeling.
USAID is far from alone in its quest for data-driven decision-making.
A Jan. 26 RFI from the Transportation Department’s Federal Highway Administration also seeks innovative ideas from industry for “advanced analytical capabilities.”…(More)”
'From Atoms to Bits': A Visual History of American Ideas
Derek Thompson in The Atlantic: “A new paper employs a simple technique—counting words in patent texts—to trace the history of American invention, from chemistry to computers….in a new paper, Mikko Packalen at the University of Waterloo and Jay Bhattacharya of Stanford University, devised a brilliant way to address this question empirically. In short, they counted words in patent texts.
In a series of papers studying the history of American innovation, Packalen and Bhattacharya indexed every one-word, two-word, and three-word phrase that appeared in more than 4 million patent texts in the last 175 years. To focus their search on truly new concepts, they recorded the year those phrases first appeared in a patent. Finally, they ranked each concept’s popularity based on how many times it reappeared in later patents. Essentially, they trawled the billion-word literature of patents to document the birth-year and the lifespan of American concepts, from “plastic” to “world wide web” and “instant messaging.”
Here are the 20 most popular sequences of words in each decade from the 1840s to the 2000s. You can see polymerase chain reactions in the middle of the 1980s stack. Since the timeline, as it appears in the paper, is too wide to be visible on this article page, I’ve chopped it up and inserted the color code both above and below the timeline….
Another theme of Packalen and Bhattacharya’s research is that innovation has become more collaborative. Indeed, computers have not only taken over the world of inventions, but also they have changed the geography of innovation, Bhattacharya said. Larger cities have historically held an innovative advantage, because (the theory goes) their density of smarties speeds up debate on the merits of new ideas, which are often born raw and poorly understood. But the researchers found that in the last few decades, larger cities are no more likely to produce new ideas in patents than smaller cities that can just as easily connect online with their co-authors. “Perhaps due to the Internet, the advantage of larger cities appears to be eroding,” Packalen wrote in an email….(More)”
Dataset Inventorying Tool
Waldo Jaquith at US Open Data: “Today we’re releasing Let Me Get That Data For You (LMGTDFY), a free, open source tool that quickly and automatically creates a machine-readable inventory of all the data files found on a given website.
When government agencies create an open data repository, they need to start by inventorying the data that the agency is already publishing on their website. This is a laborious process. It means searching their own site with a query like this:
site:example.gov filetype:csv OR filetype:xls OR filetype:json
Then they have to read through all of the results, download all of the files, and create a spreadsheet that they can load into their repository. It’s a lot of work, and as a result it too often goes undone, resulting in a data repository that doesn’t actually contain all of that government‘s data.
Realizing that this was a common problem, we hired Silicon Valley Software Group to create a tool to automate the inventorying process. We worked with Dan Schultz and Ted Han, who created a system built on Django and Celery, using Microsoft’s great Bing Search API as its data source. The result is a free, installable tool, which produces a CSV file that lists all CSV, XML, JSON, XLS, XLSX, XML, and Shapefiles found on a given domain name.
We use this tool to power our new Let Me Get That Data For You website. We’re trying to keep our site within Bing’s free usage tier, so we’re limiting results to 300 datasets per site….(More)”
The Tricky Task of Rating Neighborhoods on 'Livability'
Tanvi Misra at CityLab: “Jokubas Neciunas was looking to buy an apartment almost two years back in Vilnius, Lithuania. He consulted real estate platforms and government data to help him decide the best option for him. In the process, he realized that there was a lot of information out there, but no one was really using it very well.
Fast-forward two years, and Neciunas and his colleagues have created PlaceILive.com—a start-up trying to leverage open data from cities and information from social media to create a holistic, accessible tool that measures the “livability” of any apartment or house in a city.
“Smart cities are the ones that have smart citizens,” says PlaceILive co-founder Sarunas Legeckas.
The team recognizes that foraging for relevant information in the trenches of open data might not be for everyone. So they tried to “spice it up” by creating a visually appealing, user-friendly portal for people looking for a new home to buy or rent. The creators hope PlaceILive becomes a one-stop platform where people find ratings on every quality-of-life metric important to them before their housing hunt begins.
In its beta form, the site features five cities—New York, Chicago, San Francisco, London and Berlin. Once you click on the New York portal, for instance, you can search for the place you want to know about by borough, zip code, or address. I pulled up Brooklyn….The index is calculated using a variety of public information sources (from transit agencies, police departments, and the Census, for instance) as well as other available data (from the likes of Google, Socrata, and Foursquare)….(More)”
