“The CIO Council Innovation Committee has released its first Open Data case study, The Data Disclosure Decision, showcasing the Department of Education (Education) Disclosure Review Board.
The Department of Education is a national warehouse for open data across a decentralized educational system, managing and exchanging education related data from across the country. Education collects large amounts of aggregate data at the state, district, and school level, disaggregated by a number of demographic variables. A majority of the data Education collects is considered personally identifiable information (PII), making data disclosure avoidance plans a mandatory component of Education’s data releases. With their expansive data sets and a need to protect sensitive information, Education quickly realized the need to organize and standardize their data disclosure protocol.
Education formally established the Data Disclosure Board with Secretary of Education Arne Duncan signing their Charter in August 2013. Since its inception, the Disclosure Review Board has recognized substantial successes and has greatly increased the volume and quality of data being released. Education’s Disclosure Review Board is continually learning through its open data journey and improving their approach through cultural change and leadership buy-in.
Learn more about Education’s Data Review Board’s story by reading The Data Disclosure Decision where you will find the full account of their experience and what they learned along the way. Read The Data Disclosure Decision “
New portal to crowdsource captions, transcripts of old photos, national archives
They are invited to contribute to an upcoming portal that will carry some 3,000 unidentified photographs dating back to the late 1800s, and 3,000 pages of Straits Settlement records including letters written during Sir Stamford Raffles’ administration of Singapore.
These are collections from the Government and individuals waiting to be “tagged” on the new portal – The Citizen Archivist Project at www.nas.gov.sg/citizenarchivist….
Without tagging – such as by photo captioning and digital transcription – these records cannot be searched. There are over 140,000 photos and about one million pages of Straits Settlements Records in total that cannot be searched today.
These records date back to the 1800s, and include letters written during Sir Stamford Raffles’ administration in Singapore.
“The key challenge is that they were written in elaborate cursive penmanship which is not machine-readable,” said Dr Yaacob, adding that the knowledge and wisdom of the public can be tapped on to make these documents more accessible.
Mr Arthur Fong (West Coast GRC) had asked how the Government could get young people interested in history, and Dr Yaacob said this initiative was something they would enjoy.
Portal users must first log in using their existing Facebook, Google or National Library Board accounts. Contributions will be saved in users’ profiles, automatically created upon signing in.
Transcript contributions on the portal work in similar ways to Wikipedia; contributed text will be uploaded immediately on the portal.
However, the National Archives will take up to three days to review photo caption contributions. Approved captions will be uploaded on its website at www.nas.gov.sg/archivesonline….(More)”
How Open Is University Data?
Daniel Castro at GovTech: “Many states now support open data, or data that’s made freely available without restriction in a nonproprietary, machine-readable format, to increase government transparency, improve public accountability and participation, and unlock opportunities for civic innovation. To date, 10 states have adopted open data policies, via executive order or legislation, and 24 states have built open data portals. But while many agencies have joined the open data movement, state colleges and universities have largely ignored this opportunity. To remedy this, policymakers should consider how to extend open data policies to state colleges and universities.
There are many potential benefits of open data for higher education. First, it can help prospective students and their parents better understand the value of different degree programs. One way to control rising higher ed costs is to create more informed consumers. The feds are already pushing for such changes. President Obama and Education Secretary Arne Duncan called for schools to make more information publicly available about the costs of obtaining a college degree, and the White House launched the College Scorecard, an online tool to compare data about the average tuition cost, size of loan payments and loan default rate for different schools.
But students deserve more detailed information. Prospective students should be able to decide where to attend and what to study based on historical data like program costs, percentage of students completing the program and how long they take to do so, and what kind of earning power they have after graduating.
Second, open data can aid better fiscal oversight and accountability of university operations. In 2014, states provided about $76 billion in support for higher ed, yet few colleges and universities have adopted open data policies to increase the transparency of their budgets. Contrast this with California cities like Oakland, Palo Alto and Los Angeles, which created online tools to let others explore and visualize their budgets. Additional oversight, including from the public, could help reduce fraud, waste and abuse in higher education, save taxpayers money and create more opportunities for public participation in state budgeting.
Third, open data can be a valuable resource for producing innovations that make universities a better place to work and study. Large campuses are basically small cities, and many cities have found open data useful for improving public safety and optimizing transportation services. Universities hold much untapped data: course catalogs, syllabi, bus schedules, campus menus, campus directories, faculty evaluations, etc. Creating portals to release these data sets and building application programming interfaces to access this information would give developers direct access to data that students, faculty, alumni and other stakeholders could use to build apps and services to improve the college experience….(More)”
Tweets Can Predict Health Insurance Exchange Enrollment
PennMedicine: “An increase in Twitter sentiment (the positivity or negativity of tweets) is associated with an increase in state-level enrollment in the Affordable Care Act’s (ACA) health insurance marketplaces — a phenomenon that points to use of the social media platform as a real-time gauge of public opinion and provides a way for marketplaces to quickly identify enrollment changes and emerging issues. Although Twitter has been previously used to measure public perception on a range of health topics, this study, led by researchers at the Perelman School of Medicine at the University of Pennsylvania and published online in the Journal of Medical Internet Research, is the first to look at its relationship with the new national health insurance marketplace enrollment.
The study examined 977,303 ACA and “Obamacare”-related tweets — along with those directed toward the Twitter handle for HealthCare.gov and the 17 state-based marketplace Twitter accounts — in March 2014, then tested a correlation of Twitter sentiment with marketplace enrollment by state. Tweet sentiment was determined using the National Research Council (NRC) sentiment lexicon, which contains more than 54,000 words with corresponding sentiment weights ranging from positive to negative. For example, the word “excellent” has a positive sentiment weight, and is more positive than the word “good,” but the word “awful” is negative. Using this lexicon, researchers found that a .10 increase in the sentiment of tweets was associated with a nine percent increase in health insurance marketplace enrollment at the state level. While a .10 increase may seem small, these numbers indicate a significant correlation between Twitter sentiment and enrollment based on a continuum of sentiment scores that were examined over a million tweets.
“The correlation between Twitter sentiment and the number of eligible individuals who enrolled in a marketplace plan highlights the potential for Twitter to be a real-time monitoring strategy for future enrollment periods,” said first author Charlene A. Wong, MD, a Robert Wood Johnson Foundation Clinical Scholar and Fellow in Penn’s Leonard Davis Institute of Health Economics. “This would be especially valuable for quickly identifying emerging issues and making adjustments, instead of having to wait weeks or months for that information to be released in enrollment reports, for example.”…(More)”
Encyclopedia of Social Network Analysis and Mining
“The Encyclopedia of Social Network Analysis and Mining (ESNAM) is the first major reference work to integrate fundamental concepts and research directions in the areas of social networks and applications to data mining. While ESNAM reflects the state-of-the-art in social network research, the field had its start in the 1930s when fundamental issues in social network research were broadly defined. These communities were limited to relatively small numbers of nodes (actors) and links. More recently the advent of electronic communication, and in particular on-line communities, have created social networks of hitherto unimaginable sizes. People around the world are directly or indirectly connected by popular social networks established using web-based platforms rather than by physical proximity.
Reflecting the interdisciplinary nature of this unique field, the essential contributions of diverse disciplines, from computer science, mathematics, and statistics to sociology and behavioral science, are described among the 300 authoritative yet highly readable entries. Students will find a world of information and insight behind the familiar façade of the social networks in which they participate. Researchers and practitioners will benefit from a comprehensive perspective on the methodologies for analysis of constructed networks, and the data mining and machine learning techniques that have proved attractive for sophisticated knowledge discovery in complex applications. Also addressed is the application of social network methodologies to other domains, such as web networks and biological networks….(More)”
Philadelphia’s Newly Upgraded Open Data Portal
Michael Grass at Government Executive: “If you’re looking for streets where vending is prohibited in the city of Philadelphia, the city’s newly upgraded open data portal has that information. If you’re looking for information on reported bicycle thefts, the city’s open data portal has that information, too. Same goes for the city’s budget.
Philadelphia’s recently relaunched open data portal, Open Data Philly, has 264 data sets, applications and APIs available for the public to access and use. Much of that information comes from municipal sources.
“The redesign of OpenDataPhilly will increase access to available data, thereby enabling our citizens to become more engaged and knowledgeable and our government more accountable,” Mayor Michael Nutter said in a statement last month.
But Philadelphia’s open data portal isn’t just designed to unlock datasets at City Hall.
The city’s universities, cultural and non-profit organizations and commercial entities are part of the portal as well. Portal users interested in historic maps of the city can access the Philadelphia GeoHistory Network, a project of Philadelphia’s Athenaeum Museum, which maintains a tool where layers of historic maps can overlaid on an interactive Google map.
You can even find a list of current happy hour specials, courtesy of DrinkPhilly….(More)”
“Data on the Web” Best Practices
W3C First Public Working Draft: “…The best practices described below have been developed to encourage and enable the continued expansion of the Web as a medium for the exchange of data. The growth of open data by governments across the world [OKFN-INDEX], the increasing publication of research data encouraged by organizations like the Research Data Alliance [RDA], the harvesting and analysis of social media, crowd-sourcing of information, the provision of important cultural heritage collections such as at the Bibliothèque nationale de France [BNF] and the sustained growth in the Linked Open Data Cloud [LODC], provide some examples of this phenomenon.
In broad terms, data publishers aim to share data either openly or with controlled access. Data consumers (who may also be producers themselves) want to be able to find and use data, especially if it is accurate, regularly updated and guaranteed to be available at all times. This creates a fundamental need for a common understanding between data publishers and data consumers. Without this agreement, data publishers’ efforts may be incompatible with data consumers’ desires.
Publishing data on the Web creates new challenges, such as how to represent, describe and make data available in a way that it will be easy to find and to understand. In this context, it becomes crucial to provide guidance to publishers that will improve consistency in the way data is managed, thus promoting the re-use of data and also to foster trust in the data among developers, whatever technology they choose to use, increasing the potential for genuine innovation.
This document sets out a series of best practices that will help publishers and consumers face the new challenges and opportunities posed by data on the Web.
Best practices cover different aspects related to data publishing and consumption, like data formats, data access, data identification and metadata. In order to delimit the scope and elicit the required features for Data on the Web Best Practices, the DWBP working group compiled a set of use cases [UCR] that represent scenarios of how data is commonly published on the Web and how it is used. The set of requirements derived from these use cases were used to guide the development of the best practice.
The Best Practices proposed in this document are intended to serve a more general purpose than the practices suggested in Best Practices for Publishing Linked Data [LD-BP] since it is domain-independent and whilst it recommends the use of Linked Data, it also promotes best practices for data on the web in formats such as CSV and JSON. The Best Practices related to the use of vocabularies incorporate practices that stem from Best Practices for Publishing Linked Data where appropriate….(More)
States Use Big Data to Nab Tax Fraudsters
Governing: “It’s tax season again. For most of us, that means undergoing the laborious and thankless task of assembling financial records and calculating taxes for state and federal returns. But for a small group of us, tax season is profit season. It’s the time of year when fraudsters busy themselves with stealing identities and electronically submitting fraudulent tax returns for refunds.
Nobody knows for sure just how much tax return fraud is committed, but the amount is rising fast. According to the U.S. Treasury, the number of identified fraudulent federal returns has increased by 40 percent from 2011 to 2012, an increase of more than $4 billion. Ten years ago, New York state stopped refunds on 50,000 fraudulently filed tax returns. Last year, the number of stopped refunds was 250,000, according to Nonie Manion, executive deputy commissioner for the state’s Department of Taxation and Finance….
To combat the problem, state revenue and tax agencies are using software programs to sift through mounds of data and detect patterns that would indicate when a return is not valid. Just about every state with a tax fraud detection program already compares tax return data with information from other state agencies and private firms to spot incorrect mailing addresses and stolen identities. Because so many returns are filed electronically, fraud spotting systems look for suspicious Internet protocol (IP) addresses. For example, tax auditors in New York noticed that similar IP addresses in Fort Lauderdale, Fla., were submitting a series of returns for refunds. When the state couldn’t match the returns with any employer data, they were flagged for further scrutiny and ultimately found to be fraudulent.
High-tech analytics is one way states keep up with the war on fraud. The other is accurate data. The third component is well trained staff. But it takes time and money to put together the technology and the expertise to combat the growing sophistication of fraudsters….(More)”
New research project to map the impact of open budget data
Jonathan Gray at Open Knowledge: “…a new research project to examine the impact of open budget data, undertaken as a collaboration between Open Knowledge and the Digital Methods Initiative at the University of Amsterdam, supported by the Global Initiative for Financial Transparency (GIFT).
The project will include an empirical mapping of who is active around open budget data around the world, and what the main issues, opportunities and challenges are according to different actors. On the basis of this mapping it will provide a review of the various definitions and conceptions of open budget data, arguments for why it matters, best practises for publication and engagement, as well as applications and outcomes in different countries around the world.
As well as drawing on Open Knowledge’s extensive experience and expertise around open budget data (through projects such as Open Spending), it will utilise innovative tools and methods developed at the University of Amsterdam to harness evidence from the web, social media and collections of documents to inform and enrich our analysis.
As part of this project we’re launching a collaborative bibliography of existing research and literature on open budget data and associated topics which we hope will become a useful resource for other organisations, advocates, policy-makers, and researchers working in this area. If you have suggestions for items to add, please do get in touch.
This project follows on from other research projects we’ve conducted around this area – including on data standards for fiscal transparency, on technology for transparent and accountable public finance, and on mapping the open spending community….(More)”
CrowdFlower Launches Open Data Project
Anthony Ha at Techcrunch: “Crowdsourcing company CrowdFlower allows businesses to tap into a distributed workforce of 5 million contributors for basic tasks like sentiment analysis. Today it’s releasing some of that data to the public through its new Data for Everyone initiative…. hope is to turn CrowdFlower into a central repository where open data can be found by researchers and entrepreneurs. (Factual was another startup trying to become a hub for open data, though in recent years, it’s become more focused on gathering location data to power mobile ads.)…
As for the data that’s available now, …There’s a lot of Twitter sentiment analysis covering things like from attitudes towards brands and products, yogurt (?), and climate change. Among the more recent data sets, I was particularly taken in the gender breakdown of who’s been on the cover of Time magazine and, yes, the analysis of who thought the dress (you know the one) was gold and white versus blue and black…. (More)”