Do We Need to Educate Open Data Users?


Tony Hirst at IODC: “Whilst promoting the publication of open data is a key, indeed necessary, ingredient in driving the global open data agenda, promoting initiatives that support the use of open data is perhaps an even more pressing need….

This, then, is the first issue we need to address: improving basic levels of literacy in interpreting  – and manipulating (for example, sorting and grouping) – simple tables and charts. Sensemaking, in other words: what does the chart you’ve just produced actually say? What story does it tell? And there’s an added benefit that arises from learning to read and critique charts better – it makes you better at creating your own.

Associated with reading stories from data comes the reason for telling the story and putting the data to work. How does “data” help you make a decision, or track the impact of a particular intervention? (Your original question should also have informed the data you searched for in the first place). Here we have a need to develop basic skills in how to actually use data, from finding anomalies to hold publishers to account, to using the data as part of a positive advocacy campaign.

After a quick read, on site, of some of the stories the data might have to tell, there may be a need to do further analysis, or more elaborate visualization work. At this point, a range of technical craft skills often come into play, as well as statistical knowledge.

Many openly published datasets just aren’t that good – they’re “dirty”, full of misspellings, missing data, things in the wrong place or wrong format, even if the data they do contain is true. A significant amount of time that should be spent analyzing the data gets spent trying to clean the data set and get it into a form where it can be worked with. I would argue here that a data technician, with a wealth of craft knowledge about how to repair what is essentially a broken dataset, can play an important timesaving role here getting data into a state where an analyst can actually start to do their job analyzing the data.

But at the same time, there are a range of tools and techniques that can help the everyday user improve the quality of their data. Many of these tools require an element of programming knowledge, but less than you might at first think. In the Open University/FutureLean MOOC “Learn to Code for Data Analysis” we use an interactive notebook style of computing to show how you can use code literally one line at a time to perform powerful data cleaning, analysis, and visualization operations on a range of open datasets, including data from the World Bank and Comtrade.

Here, then, is yet another area where skills development may be required: statistical literacy. At its heart, statistics simply provide us with a range of tools for comparing sets of numbers. But knowing what comparisons to make, or the basis on which particular comparisons can be made, knowing what can be said about those comparisons or how they might be interpreted, in short, understanding what story the stats appear to be telling, can quickly become bewildering. Just as we need to improve sensemaking skills associated with reading charts, so to we need to develop skills in making sense of statistics, even if not actually producing those statistics ourselves.

As more data gets published, there are more opportunities for more people to make use of that data. In many cases, what’s likely to hold back that final data use is a skills gap: primary among these are the skills required to interpret simple datasets and the statistics associated with them associated with developing knowledge about how to make decisions or track progress based on that interpretation. However, the path to producing the statistics or visualizations used by the end-users from the originally published open data dataset may also be a windy one, requiring skills not only in analyzing data and uncovering – and then telling – the stories it contains, but also in more mundane technical operational concerns such as actually accessing, and cleaning, dirty datasets….(More)”