“Data on the Web” Best Practices


W3C First Public Working Draft: “…The best practices described below have been developed to encourage and enable the continued expansion of the Web as a medium for the exchange of data. The growth of open data by governments across the world [OKFN-INDEX], the increasing publication of research data encouraged by organizations like the Research Data Alliance [RDA], the harvesting and analysis of social media, crowd-sourcing of information, the provision of important cultural heritage collections such as at the Bibliothèque nationale de France [BNF] and the sustained growth in the Linked Open Data Cloud [LODC], provide some examples of this phenomenon.

In broad terms, data publishers aim to share data either openly or with controlled access. Data consumers (who may also be producers themselves) want to be able to find and use data, especially if it is accurate, regularly updated and guaranteed to be available at all times. This creates a fundamental need for a common understanding between data publishers and data consumers. Without this agreement, data publishers’ efforts may be incompatible with data consumers’ desires.

Publishing data on the Web creates new challenges, such as how to represent, describe and make data available in a way that it will be easy to find and to understand. In this context, it becomes crucial to provide guidance to publishers that will improve consistency in the way data is managed, thus promoting the re-use of data and also to foster trust in the data among developers, whatever technology they choose to use, increasing the potential for genuine innovation.

This document sets out a series of best practices that will help publishers and consumers face the new challenges and opportunities posed by data on the Web.

Best practices cover different aspects related to data publishing and consumption, like data formats, data access, data identification and metadata. In order to delimit the scope and elicit the required features for Data on the Web Best Practices, the DWBP working group compiled a set of use cases [UCR] that represent scenarios of how data is commonly published on the Web and how it is used. The set of requirements derived from these use cases were used to guide the development of the best practice.

The Best Practices proposed in this document are intended to serve a more general purpose than the practices suggested in Best Practices for Publishing Linked Data [LD-BP] since it is domain-independent and whilst it recommends the use of Linked Data, it also promotes best practices for data on the web in formats such as CSV and JSON. The Best Practices related to the use of vocabularies incorporate practices that stem from Best Practices for Publishing Linked Data where appropriate….(More)

States Use Big Data to Nab Tax Fraudsters


at Governing: “It’s tax season again. For most of us, that means undergoing the laborious and thankless task of assembling financial records and calculating taxes for state and federal returns. But for a small group of us, tax season is profit season. It’s the time of year when fraudsters busy themselves with stealing identities and electronically submitting fraudulent tax returns for refunds.
Nobody knows for sure just how much tax return fraud is committed, but the amount is rising fast. According to the U.S. Treasury, the number of identified fraudulent federal returns has increased by 40 percent from 2011 to 2012, an increase of more than $4 billion. Ten years ago, New York state stopped refunds on 50,000 fraudulently filed tax returns. Last year, the number of stopped refunds was 250,000, according to Nonie Manion, executive deputy commissioner for the state’s Department of Taxation and Finance….
To combat the problem, state revenue and tax agencies are using software programs to sift through mounds of data and detect patterns that would indicate when a return is not valid. Just about every state with a tax fraud detection program already compares tax return data with information from other state agencies and private firms to spot incorrect mailing addresses and stolen identities. Because so many returns are filed electronically, fraud spotting systems look for suspicious Internet protocol (IP) addresses. For example, tax auditors in New York noticed that similar IP addresses in Fort Lauderdale, Fla., were submitting a series of returns for refunds. When the state couldn’t match the returns with any employer data, they were flagged for further scrutiny and  ultimately found to be fraudulent.
High-tech analytics is one way states keep up with the war on fraud. The other is accurate data. The third component is well trained staff. But it takes time and money to put together the technology and the expertise to combat the growing sophistication of fraudsters….(More)”

Evaluating Complex Social Initiatives


Srik Gopal at Stanford Social Innovation Review: “…the California Endowment (TCE) .. and ..The Grand Rapids Community Foundation (GRCF) …are just two funders who are actively “shifting the lens” in terms of how they are looking at evaluation in the light of complexity. They are building on the recognition that a traditional approach to evaluation—assessing specific effects of a defined program according to a set of pre-determined outcomes, often in a way that connects those outcomes back to the initiative—is increasingly falling short. There is a clear message emerging that evaluation needs to accommodate complexity, not “assume it away.”

My colleagues at FSG and I have, based on our work with TCE, GRCF, and numerous other clients, articulated a set of nine “propositions” in a recent practice brief that are helpful in guiding how we conceptualize, design, and implement evaluations of complex initiatives. We derived these propositions from what we now know (based on the emerging field of complexity science) as distinctive characteristics of complex systems. We know, for example, that complex systems are always changing, often in unpredictable ways; they are never static. Hence, we need to design evaluations so that they are adaptive, flexible, and iterative, not rigid and cookie-cutter.

Below are three of the propositions in more detail, along with tools and methods that can help apply the proposition in practice.

It is important to note that many of the traditional tools and methods that form the backbone of sound evaluations—such as interviews, focus groups, and surveys—are still relevant. We would, however, suggest that organizations adapt those methods to reflect a complexity orientation. For example, interviews should explore the role of context; we should not confine them to the initiative’s boundaries. Focus groups should seek to understand local adaptation, not just adherence. And surveys should probe for relationships and interdependencies, not just perceived outcomes. In addition to traditional methods, we suggest incorporating newer, innovative techniques that provide a richer narrative, including:

  • Systems mapping—an iterative, often participatory process of graphically representing a system, including its components and connections
  • Appreciative inquiry—a group process that inquires into, identifies, and further develops the best of “what is” in organizations
  • Design thinking—a user-centered approach to developing new solutions to abstract, ill-defined, or complex problems… (More)”

New million dollar fund for participatory budgeting in South Australia


Medha Basu at Future Gov: “A new programme in South Australia is allowing citizens to determine which community projects should get funding.

The Fund My Community programme has a pool of AU$1 million (US$782,130) to fund projects by non-profit organisations aimed at supporting disadvantaged South Australians.

Organisations can nominate their projects for funding from this pool and anyone in the state can vote for the projects on the YourSAy web site.

All information about the projects submitted by the organisations will be available online to make the process transparent. “We hope that by providing the community with the right information about grant applications, people will support projects that will have the biggest impact in addressing disadvantage across South Australia,” the Fund My Community web site says.

The window to nominate community projects for funding is open until 2 April. Eligible applications will be opened for community assessment from 23 April to 4 May. The outcome will be announced and grants will be given out in June. See the full timeline here:

Fund my Community South Australia

There is a catch here though. The projects that receive the most support from the community are suggested for funding, but due to “a legal requirement” the final decision and grant approval comes from the Board of the Charitable and Social Welfare Fund, according to the YourSAy web site….(More)”

New research project to map the impact of open budget data


Jonathan Gray at Open Knowledge: “…a new research project to examine the impact of open budget data, undertaken as a collaboration between Open Knowledge and the Digital Methods Initiative at the University of Amsterdam, supported by the Global Initiative for Financial Transparency (GIFT).

The project will include an empirical mapping of who is active around open budget data around the world, and what the main issues, opportunities and challenges are according to different actors. On the basis of this mapping it will provide a review of the various definitions and conceptions of open budget data, arguments for why it matters, best practises for publication and engagement, as well as applications and outcomes in different countries around the world.

As well as drawing on Open Knowledge’s extensive experience and expertise around open budget data (through projects such as Open Spending), it will utilise innovative tools and methods developed at the University of Amsterdam to harness evidence from the web, social media and collections of documents to inform and enrich our analysis.

As part of this project we’re launching a collaborative bibliography of existing research and literature on open budget data and associated topics which we hope will become a useful resource for other organisations, advocates, policy-makers, and researchers working in this area. If you have suggestions for items to add, please do get in touch.

This project follows on from other research projects we’ve conducted around this area – including on data standards for fiscal transparency, on technology for transparent and accountable public finance, and on mapping the open spending community….(More)”

CrowdFlower Launches Open Data Project


Anthony Ha at Techcrunch: “Crowdsourcing company CrowdFlower allows businesses to tap into a distributed workforce of 5 million contributors for basic tasks like sentiment analysis. Today it’s releasing some of that data to the public through its new Data for Everyone initiative…. hope is to turn CrowdFlower into a central repository where open data can be found by researchers and entrepreneurs. (Factual was another startup trying to become a hub for open data, though in recent years, it’s become more focused on gathering location data to power mobile ads.)…

As for the data that’s available now, …There’s a lot of Twitter sentiment analysis covering things like from attitudes towards brands and products, yogurt (?), and climate change. Among the more recent data sets, I was particularly taken in the gender breakdown of who’s been on the cover of Time magazine and, yes, the analysis of who thought the dress (you know the one) was gold and white versus blue and black…. (More)”

Crowdsourcing America’s cybersecurity is an idea so crazy it might just work


at the Washington Post: “One idea that’s starting to bubble up from Silicon Valley is the concept of crowdsourcing cybersecurity. As Silicon Valley venture capitalist Robert R. Ackerman, Jr. has pointed out, due to “the interconnectedness of our society in cyberspace,” cyber networks are best viewed as an asset that we all have a shared responsibility to protect. Push on that concept hard enough and you can see how many of the core ideas from Silicon Valley – crowdsourcing, open source software, social networking, and the creative commons – can all be applied to cybersecurity.

Silicon Valley venture capitalists are already starting to fund companies that describe themselves as crowdsourcing cybersecurity. For example, take Synack, a “crowd security intelligence” company that received $7.5 million in funding from Kleiner Perkins (one of Silicon Valley’s heavyweight venture capital firms), Allegis Ventures, and Google Ventures in 2014. Synack’s two founders are ex-NSA employees, and they are using that experience to inform an entirely new type of business model. Synack recruits and vets a global network of “white hat hackers,” and then offers their services to companies worried about their cyber networks. For a fee, these hackers are able to find and repair any security risks.

So how would crowdsourced national cybersecurity work in practice?

For one, there would be free and transparent sharing of computer code used to detect cyber threats between the government and private sector. In December, the U.S. Army Research Lab added a bit of free source code, a “network forensic analysis network” known as Dshell, to the mega-popular code sharing site GitHub. Already, there have been 100 downloads and more than 2,000 unique visitors. The goal, says William Glodek of the U.S. Army Research Laboratory, is for this shared code to “help facilitate the transition of knowledge and understanding to our partners in academia and industry who face the same problems.”

This open sourcing of cyber defense would be enhanced with a scaled-up program of recruiting “white hat hackers” to become officially part of the government’s cybersecurity efforts. Popular annual events such as the DEF CON hacking conference could be used to recruit talented cyber sleuths to work alongside the government.

There have already been examples of communities where people facing a common cyber threat gather together to share intelligence. Perhaps the best-known example is the Conficker Working Group, a security coalition that was formed in late 2008 to share intelligence about malicious Conficker malware. Another example is the Financial Services Information Sharing and Analysis Center, which was created by presidential mandate in 1998 to share intelligence about cyber threats to the nation’s financial system.

Of course, there are some drawbacks to this crowdsourcing idea. For one, such a collaborative approach to cybersecurity might open the door to government cyber defenses being infiltrated by the enemy. Ackerman makes the point that you never really know who’s contributing to any community. Even on a site such as Github, it’s theoretically possible that an ISIS hacker or someone like Edward Snowden could download the code, reverse engineer it, and then use it to insert “Trojan Horses” intended for military targets into the code….  (More)

If Data Sharing is the Answer, What is the Question?


Christine L. Borgman at ERCIM News: “Data sharing has become policy enforced by governments, funding agencies, journals, and other stakeholders. Arguments in favor include leveraging investments in research, reducing the need to collect new data, addressing new research questions by reusing or combining extant data, and reproducing research, which would lead to greater accountability, transparency, and less fraud. Arguments against data sharing rarely are expressed in public fora, so popular is the idea. Much of the scholarship on data practices attempts to understand the socio-technical barriers to sharing, with goals to design infrastructures, policies, and cultural interventions that will overcome these barriers.
However, data sharing and reuse are common practice in only a few fields. Astronomy and genomics in the sciences, survey research in the social sciences, and archaeology in the humanities are the typical exemplars, which remain the exceptions rather than the rule. The lack of success of data sharing policies, despite accelerating enforcement over the last decade, indicates the need not just for a much deeper understanding of the roles of data in contemporary science but also for developing new models of scientific practice. Science progressed for centuries without data sharing policies. Why is data sharing deemed so important to scientific progress now? How might scientific practice be different if these policies were in place several generations ago?
Enthusiasm for “big data” and for data sharing are obscuring the complexity of data in scholarship and the challenges for stewardship. Data practices are local, varying from field to field, individual to individual, and country to country. Studying data is a means to observe how rapidly the landscape of scholarly work in the sciences, social sciences, and the humanities is changing. Inside the black box of data is a plethora of research, technology, and policy issues. Data are best understood as representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship. Rarely do they stand alone, separable from software, protocols, lab and field conditions, and other context. The lack of agreement on what constitutes data underlies the difficulties in sharing, releasing, or reusing research data.
Concerns for data sharing and open access raise broader questions about what data to keep, what to share, when, how, and with whom. Open data is sometimes viewed simply as releasing data without payment of fees. In research contexts, open data may pose complex issues of licensing, ownership, responsibility, standards, interoperability, and legal harmonization. To scholars, data can be assets, liabilities, or both. Data have utilitarian value as evidence, but they also serve social and symbolic purposes for control, barter, credit, and prestige. Incentives for scientific advancement often run counter to those for sharing data.
….
Rather than assume that data sharing is almost always a “good thing” and that doing so will promote the progress of science, more critical questions should be asked: What are the data? What is the utility of sharing or releasing data, and to whom? Who invests the resources in releasing those data and in making them useful to others? When, how, why, and how often are those data reused? Who benefits from what kinds of data transfer, when, and how? What resources must potential re-users invest in discovering, interpreting, processing, and analyzing data to make them reusable? Which data are most important to release, when, by what criteria, to whom, and why? What investments must be made in knowledge infrastructures, including people, institutions, technologies, and repositories, to sustain access to data that are released? Who will make those investments, and for whose benefit?
Only when these questions are addressed by scientists, scholars, data professionals, librarians, archivists, funding agencies, repositories, publishers, policy makers, and other stakeholders in research will satisfactory answers arise to the problems of data sharing…(More)”.

Breaking Public Administrations’ Data Silos. The Case of Open-DAI, and a Comparison between Open Data Platforms.


Paper by Raimondo Iemma, Federico Morando, and Michele Osella: “An open reuse of public data and tools can turn the government into a powerful ‘platform’ also involving external innovators. However, the typical information system of a public agency is not open by design. Several public administrations have started adopting technical solutions to overcome this issue, typically in the form of middleware layers operating as ‘buses’ between data centres and the outside world. Open-DAI is an open source platform designed to expose data as services, directly pulling from legacy databases of the data holder. The platform is the result of an ongoing project funded under the EU ICT PSP call 2011. We present the rationale and features of Open-DAI, also through a comparison with three other open data platforms: the Socrata Open Data portal, CKAN, and ENGAGE….(More)”

US government and private sector developing ‘precrime’ system to anticipate cyber-attacks


Martin Anderson at The Stack: “The USA’s Office of the Director of National Intelligence (ODNI) is soliciting the involvement of the private and academic sectors in developing a new ‘precrime’ computer system capable of predicting cyber-incursions before they happen, based on the processing of ‘massive data streams from diverse data sets’ – including social media and possibly deanonymised Bitcoin transactions….
At its core the predictive technologies to be developed in association with the private sector and academia over 3-5 years are charged with the mission ‘to invest in high-risk/high-payoff research that has the potential to provide the U.S. with an overwhelming intelligence advantage over our future adversaries’.
The R&D program is intended to generate completely automated, human-free prediction systems for four categories of event: unauthorised access, Denial of Service (DoS), malicious code and scans and probes which are seeking access to systems.
The CAUSE project is an unclassified program, and participating companies and organisations will not be granted access to NSA intercepts. The scope of the project, in any case, seems focused on the analysis of publicly available Big Data, including web searches, social media exchanges and trawling ungovernable avalanches of information in which clues to future maleficent actions are believed to be discernible.
Program manager Robert Rahmer says: “It is anticipated that teams will be multidisciplinary and might include computer scientists, data scientists, social and behavioral scientists, mathematicians, statisticians, content extraction experts, information theorists, and cyber-security subject matter experts having applied experience with cyber capabilities,”
Battelle, one of the concerns interested in participating in CAUSE, is interested in employing Hadoop and Apache Spark as an approach to the data mountain, and includes in its preliminary proposal an intent to ‘de-anonymize Bitcoin sale/purchase activity to capture communication exchanges more accurately within threat-actor forums…’.
Identifying and categorising quality signal in the ‘white noise’ of Big Data is a central plank in CAUSE, and IARPA maintains several offices to deal with different aspects of it. Its pointedly-named ‘Office for Anticipating Surprise’  frames the CAUSE project best, since it initiated it. The OAS is occupied with ‘Detecting and forecasting the emergence of new technical capabilities’, ‘Early warning of social and economic crises, disease outbreaks, insider threats, and cyber attacks’ and ‘Probabilistic forecasts of major geopolitical trends and rare events’.
Another concerned department is The Office of Incisive Analysis, which is attempting to break down the ‘data static’ problem into manageable mission stages:
1) Large data volumes and varieties – “Providing powerful new sources of information from massive, noisy data that currently overwhelm analysts”
2) Social-Cultural and Linguistic Factors – “Analyzing language and speech to produce insights into groups and organizations. “
3) Improving Analytic Processes – “Dramatic enhancements to the analytic process at the individual and group level. “
The Office of Smart Collection develops ‘new sensor and transmission technologies, with the seeking of ‘Innovative approaches to gain access to denied environments’ as part of its core mission, while the Office of Safe and Secure Operations concerns itself with ‘Revolutionary advances in science and engineering to solve problems intractable with today’s computers’.
The CAUSE program, which attracted 150 developers, organisations, academics and private companies to the initial event, will announce specific figures about funding later in the year, and practice ‘predictions’ from participants will begin in the summer, in an accelerating and stage-managed program over five years….(More)”