A process of making datasets raw in three steps: reformatting, cleaning, and ungrounding (Denis and Goeta).

Hundreds of thousands of datasets are now made available via numerous channels from both public and private domains. Based on the stage of processing, these datasets can be categorized as either raw data or processed data. According to an Open Government Data principle, raw data (or primary data) “are published as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.” While processed data is data that has been through some sort of adulteration, categorization, codification, aggregation, and other similar processes.

A large amount of data that is made publicly available come in processed form. For example, population, trade, and budget data are often presented in aggregated forms, preventing researchers from understanding the underlying stories behind these data, such as the differences in patterns or trends when gender, location, or other variables come into factor. Therefore, a rawification process is oftentimes needed in order for a dataset to be useful for a more detailed, secondary, and valuable analysis.

Jérôme Denis and Samuel Goëta define ‘rawification’ as a process of reformatting, cleaning, and ungrounding data in order to obtain a truly ‘raw’ datasets.

According to Denis and Goëta, reformatting data means making sure that data that has been opened can also be easily readable by the users. This is usually achieved by reformatting the data so that it can be read and manipulated by most processing programs. One of the most commonly used formats is CSV (Comma Separated Values).

The next step in a rawification process is cleaning. In this stage, cleaning means correcting mistakes within the datasets, which include but are not limited to, redundancies and incoherence. In many cases, datasets can have multiple entries for the same item, for example ‘New York University’ and ‘NYU’ might be interpreted as two different entities, or ‘the GovLab’ and ‘the Governance Lab’ might experience a similar issue. Cleaning helps address issues like this.

The final step in a rawification process is ungrounding, which means taking out any ties or links from previous data use. Such ties include color coding, comments, and subcategories. This way the datasets can be purely raw and free of all associations and bias.

Opening up data is a clear step for increasing public access to information held within institutions. However, in order to ensure the utility of that data for those accessing it, a rawification process will likely be necessary.

Additional resources: