Index: Secondary Uses of Personal Data


By Alexandra Shaw, Andrew Zahuranec, Andrew Young, Stefaan Verhulst

The Living Library Index–inspired by the Harper’s Index–provides important statistics and highlights global trends in governance innovation. This installment focuses on public perceptions regarding secondary uses of personal data (or the re-use of data initially collected for a different purpose). It provides a summary of societal perspectives toward personal data usage, sharing, and control. It is not meant to be comprehensive–rather, it intends to illustrate conflicting, and often confusing, attitudes toward the re-use of personal data. 

Please share any additional, illustrative statistics on data, or other issues at the nexus of technology and governance, with us at info@thelivinglib.org

Data ownership and control 

  • Percentage of Americans who say it is “very important” they control information collected about them: 74% – 2016
  • Americans who think that today’s privacy laws are not good enough at protecting people’s privacy online: 68% – 2016
  • Americans who say they have “a lot” of control over how companies collect and use their information: 9% – 2015
  • In a survey of 507 online shoppers, the number of respondents who indicated they don’t want brands tracking their location: 62% – 2015
  • In a survey of 507 online shoppers, the amount who “prefer offers that are targeted to where they are and what they are doing:” 60% – 2015 
  • Number of surveyed American consumers willing to provide data to corporations under the following conditions: 
    • “Data about my social concerns to better connect me with non-profit organizations that advance those causes:” 19% – 2018
    • “Data about my DNA to help me uncover any hereditary illnesses:” 21% – 2018
    • “Data about my interests and hobbies to receive relevant information and offers from online sellers:” 32% – 2018
    • “Data about my location to help me find the fastest route to my destination:” 40% – 2018
    • “My email address to receive exclusive offers from my favorite brands:”  56% – 2018  

Consumer Attitudes 

  • Academic study participants willing to donate personal data to research if it could lead to public good: 60% – 2014
  • Academic study participants willing to share personal data for research purposes in the interest of public good: 25% – 2014
  • Percentage who expect companies to “treat [them] like an individual, not as a member of some segment like ‘millennials’ or ‘suburban mothers:’” 74% – 2018 
    • Percentage who believe that brands should understand a “consumer’s individual situation (e.g. marital status, age, location, etc.)” when they’re being marketed to: 70% – 2018 Number who are “more annoyed” by companies now compared to 5 years ago: 40% – 2018Percentage worried their data is shared across companies without their permission: 88% – 2018Amount worried about a brand’s ability to track their behavior while on the brand’s website, app, or neither: 75% – 2018 
  • Consumers globally who expect brands to anticipate needs before they arise: 33%  – 2018 
  • Surveyed residents of the United Kingdom who identify as:
    • “Data pragmatists” willing to share personal data “under the right circumstances:” 58% – 2017
    • “Fundamentalists,” who would not share personal data for better services: 24% – 2017
    • Respondents who think data sharing is part of participating in the modern economy: 62% – 2018
    • Respondents who believe that data sharing benefits enterprises more than consumers: 75% – 2018
    • People who want more control over their data that enterprises collect: 84% – 2018
    • Percentage “unconcerned” about personal data protection: 18% – 2018
  • Percentage of Americans who think that government should do more to regulate large technology companies: 55% – 2018
  • Registered American voters who trust broadband companies with personal data “a great deal” or “a fair amount”: 43% – 2017
  • Americans who report experiencing a major data breach: 64% – 2017
  • Number of Americans who believe that their personal data is less secure than it was 5 years ago: 49% – 2019
  • Amount of surveyed American citizens who consider trust in a company an important factor for sharing data: 54% – 2018

Convenience

Microsoft’s 2015 Consumer Data Value Exchange Report attempts to understand consumer attitudes on the exchange of personal data across the global markets of Australia, Brazil, Canada, Colombia, Egypt, Germany, Kenya, Mexico, Nigeria, Spain, South Africa, United Kingdom and the United States. From their survey of 16,500 users, they find:

  • The most popular incentives for sharing data are: 
    • Cash rewards: 64% – 2015
    • Significant discounts: 49% – 2015
    • Streamlined processes: 29% – 2015
    • New ideas: 28% – 2015
  • Respondents who would prefer to see more ads to get new services: 34% – 2015
  • Respondents willing to share search terms for a service that enabled fewer steps to get things done: 70% – 2015 
  • Respondents willing to share activity data for such an improvement: 82% – 2015
  • Respondents willing to share their gender for “a service that inspires something new based on others like them:” 79% – 2015

A 2015 Pew Research Center survey presented Americans with several data-sharing scenarios related to convenience. Participants could respond: “acceptable,” “it depends,” or “not acceptable” to the following scenarios: 

  • Share health information to get access to personal health records and arrange appointments more easily:
    • Acceptable: 52% – 2015
    • It depends: 20% – 2015
    • Not acceptable: 26% – 2015
  • Share data for discounted auto insurance rates: 
    • Acceptable: 37% – 2015
    • It depends: 16% – 2015
    • Not acceptable: 45% – 2015
  • Share data for free social media services: 
    • Acceptable: 33% – 2015
    • It depends: 15% – 2015
    • Not acceptable: 51% – 2015
  • Share data on smart thermostats for cheaper energy bills: 
    • Acceptable: 33% – 2015
    • It depends: 15% – 2015
    • Not acceptable: 51% – 2015

Other Studies

  • Surveyed banking and insurance customers who would exchange personal data for:
    • Targeted auto insurance premiums: 64% – 2019
    • Better life insurance premiums for healthy lifestyle choices: 52% – 2019 
  • Surveyed banking and insurance customers willing to share data specifically related to income, location and lifestyle habits to: 
    • Secure faster loan approvals: 81.3% – 2019
    • Lower the chances of injury or loss: 79.7% – 2019 
    • Receive discounts on non-insurance products or services: 74.6% – 2019
    • Receive text alerts related to banking account activity: 59.8% – 2019 
    • Get saving advice based on spending patterns: 56.6% – 2019
  • In a survey of over 7,000 members of the public around the globe, respondents indicated:
    • They thought “smartphone and tablet apps used for navigation, chat, and news that can access your contacts, photos, and browsing history” is “creepy;” 16% – 2016
    • Emailing a friend about a trip to Paris and receiving advertisements for hotels, restaurants and excursions in Paris is “creepy:” 32% – 2016
    • A free fitness-tracking device that monitors your well-being and sends a monthly report to you and your employer is “creepy:” 45% – 2016
    • A telematics device that allows emergency services to track your vehicle is “creepy:” 78% – 2016
  • The number of British residents who do not want to work with virtual agents of any kind: 48% – 2017
  • Americans who disagree that “if companies give me a discount, it is a fair exchange for them to collect information about me without my knowing”: 91% – 2015

Data Brokers, Intermediaries, and Third Parties 

  • Americans who consider it acceptable for a grocery store to offer a free loyalty card in exchange for selling their shopping data to third parties: 47% – 2016
  • Number of people who know that “searches, site visits and purchases” are reviewed without consent:  55% – 2015
  • The number of people in 1991 who wanted companies to ask them for permission first before collecting their personal information and selling that data to intermediaries: 93% – 1991
    • Number of Americans who “would be very concerned if the company at which their data were stored sold it to another party:” 90% – 2008
    • Percentage of Americans who think it’s unacceptable for their grocery store to share their shopping data with third parties in exchange for a free loyalty card: 32% – 2016
  • Percentage of Americans who think that government needs to do more to regulate advertisers: 64% – 2016
    • Number of Americans who “want to have control over what marketers can learn about” them online: 84% – 2015
    • Percentage of Americans who think they have no power over marketers to figure out what they’re learning about them: 58% – 2015
  • Registered American voters who are “somewhat uncomfortable” or “very uncomfortable” with companies like Internet service providers or websites using personal data to recommend stories, articles, or videos:  56% – 2017
  • Registered American voters who are “somewhat uncomfortable” or “very uncomfortable” with companies like Internet service providers or websites selling their personal information to third parties for advertising purposes: 64% – 2017

Personal Health Data

The Robert Wood Johnson Foundation’s 2014 Health Data Exploration Project Report analyzes attitudes about personal health data (PHD). PHD is self-tracking data related to health that is traceable through wearable devices and sensors. The three major stakeholder groups involved in using PHD for public good are users, companies that track the users’ data, and researchers. 

  • Overall Respondents:
    • Percentage who believe anonymity is “very” or “extremely” important: 67% – 2014
    • Percentage who “probably would” or “definitely would” share their personal data with researchers: 78% – 2014
    • Percentage who believe that they own—or should own—all the data about them, even when it is indirectly collected: 54% – 2014
    • Percentage who think they share or ought to share ownership with the company: 30% – 2014
    • Percentage who think companies alone own or should own all the data about them: 4% – 2014
    • Percentage for whom data ownership “is not something I care about”: 13% – 2014
    • Percentage who indicated they wanted to own their data: 75% – 2014 
    • Percentage who would share data only if “privacy were assured:” 68% – 2014
    • People who would supply data regardless of privacy or compensation: 27% – 2014
      • Percentage of participants who mentioned privacy, anonymity, or confidentiality when asked under what conditions they would share their data:  63% – 2014
      • Percentage who would be “more” or “much more” likely to share data for compensation: 56% – 2014
      • Percentage who indicated compensation would make no difference: 38% – 2014
      • Amount opposed to commercial  or profit-making use of their data: 13% – 2014
    • Percentage of people who would only share personal health data with a guarantee of:
      • Privacy: 57% – 2014
      • Anonymization: 90% – 2014
  • Surveyed Researchers: 
    • Percentage who agree or strongly agree that self-tracking data would help provide more insights in their research: 89% – 2014
    • Percentage who say PHD could answer questions that other data sources could not: 95% – 2014
    • Percentage who have used public datasets: 57% – 2014
    • Percentage who have paid for data for research: 19% – 2014
    • Percentage who have used self-tracking data before for research purposes: 46% – 2014
    • Percentage who have worked with application, device, or social media companies: 23% – 2014
    • Percentage who “somewhat disagree” or “strongly disagree” there are barriers that cannot be overcome to using self-tracking data in their research: 82% – 2014 

SOURCES: 

“2019 Accenture Global Financial Services Consumer Study: Discover the Patterns in Personality”, Accenture, 2019. 

“Americans’ Views About Data Collection and Security”, Pew Research Center, 2015. 

“Data Donation: Sharing Personal Data for Public Good?”, ResearchGate, 2014.

Data privacy: What the consumer really thinks,” Acxiom, 2018.

“Exclusive: Public wants Big Tech regulated”, Axios, 2018.

Consumer data value exchange,” Microsoft, 2015.

Crossing the Line: Staying on the right side of consumer privacy,” KPMG International Cooperative, 2016.

“How do you feel about the government sharing our personal data? – livechat”, The Guardian, 2017. 

“Personal data for public good: using health information in medical research”, The Academy of Medical Sciences, 2006. 

“Personal Data for the Public Good: New Opportunities to Enrich Understanding of Individual and Population Health”, Robert Wood Johnson Foundation, Health Data Exploration Project, Calit2, UC Irvine and UC San Diego, 2014. 

“Pew Internet and American Life Project: Cloud Computing Raises Privacy Concerns”, Pew Research Center, 2008. 

“Poll: Little Trust That Tech Giants Will Keep Personal Data Private”, Morning Consult & Politico, 2017. 

“Privacy and Information Sharing”, Pew Research Center, 2016. 

“Privacy, Data and the Consumer: What US Thinks About Sharing Data”, MarTech Advisor, 2018. 

“Public Opinion on Privacy”, Electronic Privacy Information Center, 2019. 

“Selligent Marketing Cloud Study Finds Consumer Expectations and Marketer Challenges are Rising in Tandem”, Selligent Marketing Cloud, 2018. 

The Data-Sharing Disconnect: The Impact of Context, Consumer Trust, and Relevance in Retail Marketing,” Boxever, 2015. 

Microsoft Research reveals understanding gap in the brand-consumer data exchange,” Microsoft Research, 2015.

“Survey: 58% will share personal data under the right circumstances”, Marketing Land: Third Door Media, 2019. 

“The state of privacy in post-Snowden America”, Pew Research Center, 2016. 

The Tradeoff Fallacy: How Marketers Are Misrepresenting American Consumers And Opening Them Up to Exploitation”, University of Pennsylvania, 2015.

Index: The Data Universe 2019


By Michelle Winowatan, Andrew J. Zahuranec, Andrew Young, Stefaan Verhulst, Max Jun Kim

The Living Library Index – inspired by the Harper’s Index – provides important statistics and highlights global trends in governance innovation. This installment focuses on the data universe.

Please share any additional, illustrative statistics on data, or other issues at the nexus of technology and governance, with us at info@thelivinglib.org

Internet Traffic:

  • Percentage of the world’s population that uses the internet: 51.2% (3.9 billion people) – 2018
  • Number of search processed worldwide by Google every year: at least 2 trillion – 2016
  • Website traffic worldwide generated through mobile phones: 52.2% – 2018
  • The total number of mobile subscriptions in the first quarter of 2019: 7.9 billion (addition of 44 million in quarter) – 2019
  • Amount of mobile data traffic worldwide: nearly 30 billion GB – 2018
  • Data category with highest traffic worldwide: video (60%) – 2018
  • Global average of data traffic per smartphone per month: 5.6 GB – 2018
    • North America: 7 GB – 2018
    • Latin America: 3.1 GB – 2018
    • Western Europe: 6.7 GB – 2018
    • Central and Eastern Europe: 4.5 GB – 2018
    • North East Asia: 7.1 GB – 2018
    • Southeast Asia and Oceania: 3.6 GB – 2018
    • India, Nepal, and Bhutan: 9.8 GB – 2018
    • Middle East and Africa: 3.0 GB – 2018
  • Time between the creation of each new bitcoin block: 9.27 minutes – 2019

Streaming Services:

  • Total hours of video streamed by Netflix users every minute: 97,222 – 2017
  • Hours of YouTube watched per day: over 1 billion – 2018
  • Number of tracks uploaded to Spotify every day: Over 20,000 – 2019
  • Number of Spotify’s monthly active users: 232 million – 2019
  • Spotify’s total subscribers: 108 million – 2019
  • Spotify’s hours of content listened: 17 billion – 2019
  • Total number of songs on Spotify’s catalog: over 30 million – 2019
  • Apple Music’s total subscribers: 60 million – 2019
  • Total number of songs on Apple Music’s catalog: 45 million – 2019

Social Media:

Calls and Messaging:

Retail/Financial Transaction:

  • Number of packages shipped by Amazon in a year: 5 billion – 2017
  • Total value of payments processed by Venmo in a year: USD 62 billion – 2019
  • Based on an independent analysis of public transactions on Venmo in 2017:
  • Based on a non-representative survey of 2,436 US consumers between the ages of 21 and 72 on P2P platforms:
    • The average volume of transactions handled by Venmo: USD 64.2 billion – 2019
    • The average volume of transactions handled by Zelle: USD 122.0 billion – 2019
    • The average volume of transactions handled by PayPal: USD 141.8 billion – 2019 
    • Platform with the highest percent adoption among all consumers: PayPal (48%) – 2019 

Internet of Things:

Sources:

Bilingual


/baɪˈlɪŋgwəl/

Practitioners across disciplines who possess both domain knowledge and data science expertise.

The Governance Lab (GovLab) just launched the 100 Questions Initiative, “an effort to identify the most important societal questions whose answers can be found in data and data science if the power of data collaboratives is harnessed.”

The initiative will seek to identify questions that could help unlock the potential of data and data science in solving various global and domestic issues, including but not limited to, climate change, economic inequality, and migration. These questions will be sourced from individuals who have expertise in both a public issue and data science or what The GovLab calls “bilinguals.”

Tom Kalil, the Chief Innovation Officer at Schmidt Futures, argues that the emergent use of data science and machine learning in the public sector will increase the demand for individuals “who speak data science and social sector.”

Similarly, within the business context, David Meer wrote that “being bilingual isn’t just a matter of native English speakers learning how to conjugate verbs in French or Spanish. Rather, it’s important that businesses cultivate talent that can simultaneously speak the language of advanced data analysis and nuts-and-bolts business operations. As data analysis becomes a more prevalent and powerful lever for strategy and growth, organizations increasingly need bilinguals to form the bridge between the work of advanced data scientists and business decision makers.”

For more info, visit www.the100questions.org

Digital Serfdom


/ˈdɪʤətəl ˈsɜrfdəm/

A condition where consumers give up their personal and private information in order to be able to use a particular product or service.

Serfdom is a system of forced labor that exists in a feudalistic society. It was very common in Europe during the medieval age. In this system, serfs or peasants do a variety of labor for their lords in exchange for protection from bandits and a small piece of land that they can cultivate for themselves. Serfs are also required to pay some form of tax often in the form of chickens or crops yielded from their piece of land.

Hassan Khan in The Next Web points out that the decline of property ownership is indicative that we are living in digital serfdom. In an article he says:

“The percentage of households without a car is increasing. Ride-hailing services have multiplied. Netflix boasts over 188 million subscribers. Spotify gains ten million paid members every five to six months.

“The model of “impermanence” has become the new normal. But there’s still one place where permanence finds its home, with over two billion active monthly users, Facebook has become a platform of record for the connected world. If it’s not on social media, it may as well have never happened.”

Joshua A. T. Fairfield elaborates this phenomenon in his book Owned: Property, Privacy, and the New Digital Serfdom. Fairfield discusses his book in an article in The Conversation, stating that:

“The issue of who gets to control property has a long history. In the feudal system of medieval Europe, the king owned almost everything, and everyone else’s property rights depended on their relationship with the king. Peasants lived on land granted by the king to a local lord, and workers didn’t always even own the tools they used for farming or other trades like carpentry and blacksmithing.

[…]

“Yet the expansion of the internet of things seems to be bringing us back to something like that old feudal model, where people didn’t own the items they used every day. In this 21st-century version, companies are using intellectual property law – intended to protect ideas – to control physical objects consumers think they own.”

In other words, Fairfield is suggesting that the devices and services that we use—iPhones, Fitbits, Roomba, digital door locks, Spotify, Uber, and many more—are constantly capturing data about behaviors. By using these products, consumers have no choice but to trade their personal data in order to access the full functionalities of these devices or services. This data is used by private corporations for targeted advertisement, among others. This system of digital serfdom binds consumers to private corporations that dictate the terms of use for their products or services.

Janet Burns wrote about Alex Rosenblat’s UBERLAND: How Algorithms Are Rewriting The Rules Of Work and gave some examples of how algorithms use personal data to manipulate consumers’ behaviors:

“For example, algorithms in control of assigning and pricing rides have often surprised drivers and riders, quietly taking into account other traffic in the area, regionally adjusted rates, and data on riders and drivers themselves.

“In recent years, we’ve seen similar adjustments happen behind the scenes in online shopping, as UBERLAND points out: major retailers have tweaked what price different customers see for the same item based on where they live, and how feasibly they could visit a brick-and-mortar store for it.”

To conclude, an excerpt from Fairfield’s book cautions: 

“In the coming decade, if we do not take back our ownership rights, the same will be said of our self-driving cars and software-enabled homes. We risk becoming digital peasants, owned by software and advertising companies, not to mention overreaching governments.”

Sources and Further Readings:

Self-Sovereign Identity


/sɛlf-ˈsɑvrən aɪˈdɛntəti/

A decentralized identification mechanism that gives individuals control over what, when, and to whom their personal information is shared.

An identification document (ID) is a crucial part of every individual’s life, in that it is often a prerequisite for accessing a variety of services—ranging from creating a bank account to enrolling children in school to buying alcoholic beverages to signing up for an email account to voting in an election—and also a proof of simply being. This system poses fundamental problems, which a field report by The GovLab on Blockchain and Identity frames as follows:

“One of the central challenges of modern identity is its fragmentation and variation across platform and individuals. There are also issues related to interoperability between different forms of identity, and the fact that different identities confer very different privileges, rights, services or forms of access. The universe of identities is vast and manifold. Every identity in effect poses its own set of challenges and difficulties—and, of course, opportunities.”

A report published in New America echoed this point, by arguing that:

“Societally, we lack a coherent approach to regulating the handling of personal data. Users share and generate far too much data—both personally identifiable information (PII) and metadata, or “data exhaust”—without a way to manage it. Private companies, by storing an increasing amount of PII, are taking on an increasing level of risk. Solution architects are recreating the wheel, instead of flying over the treacherous terrain we have just described.”

SSI is dubbed as the solution for those identity problems mentioned above. Identity Woman, a researcher and advocate for SSI, goes even further by arguing that generating “a digital identity that is not under the control of a corporation, an organization or a government” is essential “in pursuit of social justice, deep democracy, and the development of new economies that share wealth and protect the environment.”

To inform the analysis of blockchain-based Self-Sovereign Identity (SSI), The GovLab report argues that identity is “a process, not a thing” and breaks it into a 5-stage lifecycle, which are provisioning, administration, authentication, authorization, and auditing/monitoring. At each stage, identification serves a unique function and poses different challenges.

With SSI, individuals have full control over how their personal information is shared, who gets access to it, and when. The New America report summarizes the potential of SSI in the following paragraphs:

“We believe that the great potential of SSI is that it can make identity in the digital world function more like identity in the physical world, in which every person has a unique and persistent identity which is represented to others by means of both their physical attributes and a collection of credentials attested to by various external sources of authority.”

[…]

“SSI, in contrast, gives the user a portable, digital credential (like a driver’s license or some other document that proves your age), the authenticity of which can be securely validated via cryptography without the recipient having to check with the authority that issued it. This means that while the credential can be used to access many different sites and services, there is no third-party broker to track the services to which the user is authenticating. Furthermore, cryptographic techniques called “zero-knowledge proofs” (ZKPs) can be used to prove possession of a credential without revealing the credential itself. This makes it possible, for example, for users to prove that they are over the age of 21 without having to share their actual birth dates, which are both sensitive information and irrelevant to a binary, yes-or-no ID transaction.”

Some case studies on the application of SSI in the real world presented on The GovLab Blockchange website include a government-issued self-sovereign ID using blockchain technology in the city of Zug in Switzerland; a mobile election voting platform, secured via smart biometrics, real-time ID verification and the blockchain for irrefutability piloted in West Virginia; and a blockchain-based land and property transaction/registration in Sweden.

Nevertheless, on the hype of this new and emerging technology, the authors write:

“At their core, blockchain technologies offer new capacity for increasing the immutability, integrity, and resilience of information capture and disclosure mechanisms, fostering the potential to address some of the information asymmetries described above. By leveraging a shared and verified database of ledgers stored in a distributed manner, blockchain seeks to redesign information ecosystems in a more transparent, immutable, and trusted manner. Solving information asymmetries may turn out to be the real contribution of blockchain, and this—much more than the current enthusiasm over virtual currencies—is the real reason to assess its potential.

“It is important to emphasize, of course, that blockchain’s potential remains just that for the moment—only potential. Considerable hype surrounds the emerging technology, and much remains to be done and many obstacles to overcome if blockchain is to achieve the enthusiasts’ vision of “radical transparency.”

Further readings:

Grey Data


/greɪ ˈdeɪtə/

Data accumulated by an institution for operational purposes that does not fall under any traditional data protection policies.

Organizations across all sectors accumulate a massive amount of data just by virtue of operating alone, and universities are among such organizations. In a paper, Christine L. Borgman categorizes these as grey data and further suggested that universities should take a lead in demonstrating stewardship of these data, which include student applications, faculty dossier, registrar records, ID card data, security cameras, and many others.

“Some of these data are collected for mandatory reporting obligations such as enrollments, diversity, budgets, grants, and library collections. Many types of data about individuals are collected for operational and design purposes, whether for instruction, libraries, travel, health, or student services.”

(Borgman, p. 380)

Grey data typically does not fall under traditional data protection policies such as Health Insurance Portability and Accountability Act (HIPAA), Family Educational Rights and Privacy Act (FERPA), or Institutional Review Boards. Consequently, there are a lot of debates about how to use (or misuse) them. Borgman points out that universities have been “exploiting these data for research, learning analytics, faculty evaluation, strategic decisions, and other sensitive matters.” On top of this, for-profit companies “are besieging universities with requests for access to data or for partnerships to mine them.”

Recognizing both the value of data and the risks arising from the accumulation of grey data, Borgman proposes a model of Data Stewardship by drawing on the practices of data protection at the University of California which concern information security, data governance, and cyber risk.

This model is an example of a good Data Stewardship practice that the GovLab is advocating amidst the rise of public-private collaboration in leveraging data for public good.

The GovLab’s Data Stewards website presents the need for such practice as follows:

“With these new practices of data collaborations come the need to reimagine roles and responsibilities to steer the process of using private data, and the insights it can generate, to address some of society’s biggest questions and challenges: Data Stewards.

“Today, establishing and sustaining these new collaborative and accountable approaches requires significant and time-consuming effort and investment of resources for both data holders on the supply side, and institutions that represent the demand. By establishing Data Stewardship as a function, recognized within the private sector as a valued responsibility, the practice of Data Collaboratives can become more predictable, scaleable, sustainable and de-risked.”

Sources and Further Readings:

Rawification


/rɑwəfɪˈkeɪʃən/

A process of making datasets raw in three steps: reformatting, cleaning, and ungrounding (Denis and Goeta).

Hundreds of thousands of datasets are now made available via numerous channels from both public and private domains. Based on the stage of processing, these datasets can be categorized as either raw data or processed data. According to an Open Government Data principle, raw data (or primary data) “are published as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.” While processed data is data that has been through some sort of adulteration, categorization, codification, aggregation, and other similar processes.

A large amount of data that is made publicly available come in processed form. For example, population, trade, and budget data are often presented in aggregated forms, preventing researchers from understanding the underlying stories behind these data, such as the differences in patterns or trends when gender, location, or other variables come into factor. Therefore, a rawification process is oftentimes needed in order for a dataset to be useful for a more detailed, secondary, and valuable analysis.

Jérôme Denis and Samuel Goëta define ‘rawification’ as a process of reformatting, cleaning, and ungrounding data in order to obtain a truly ‘raw’ datasets.

According to Denis and Goëta, reformatting data means making sure that data that has been opened can also be easily readable by the users. This is usually achieved by reformatting the data so that it can be read and manipulated by most processing programs. One of the most commonly used formats is CSV (Comma Separated Values).

The next step in a rawification process is cleaning. In this stage, cleaning means correcting mistakes within the datasets, which include but are not limited to, redundancies and incoherence. In many cases, datasets can have multiple entries for the same item, for example ‘New York University’ and ‘NYU’ might be interpreted as two different entities, or ‘the GovLab’ and ‘the Governance Lab’ might experience a similar issue. Cleaning helps address issues like this.

The final step in a rawification process is ungrounding, which means taking out any ties or links from previous data use. Such ties include color coding, comments, and subcategories. This way the datasets can be purely raw and free of all associations and bias.

Opening up data is a clear step for increasing public access to information held within institutions. However, in order to ensure the utility of that data for those accessing it, a rawification process will likely be necessary.

Additional resources: