Selected Readings on Open Data and Generative AI


By: María Esther Cervantes, Hannah Chafetz, Sampriti Saxena, & Stefaan G. Verhulst

Generative AI tools are increasingly used across sectors, including in governments. However, there is limited research on how these generative AI tools could impact open data policies and programs. What are the opportunities for generative AI and open data? What are the risks? Could generative AI transform the role of statistical agencies? Is there a need for a global charter to govern generative AI? 

Towards this end, in May 2023, The GovLab’s Open Data Policy Lab (a collaboration between The GovLab and Microsoft) hosted a panel discussion on the intersections of generative AI and open data and the ways in which generative AI could alter our existing conception of a third wave of open data. Building on the takeaways from this discussion, below we provide a curated list of annotated readings (listed alphabetically) on these topics. 

These selected readings focus on three main areas:  (1) the opportunities and risks of applying generative AI for open data, (2) generative AI governance models and discussion, and (3) the new role of national statistical agencies in the advent of these technologies. Given the speed at which these technologies are changing, we incorporate a wide variety of sources such as journal articles, reports from international organizations and think tanks, and blog posts. 

We found several common themes across these readings. First, there is generally consensus that generative AI tools can provide value for open data and National Statistical Offices, whether it be for increasing data discovery, accessibility, or stakeholder collaboration. However, privacy, security, and safety risks remain prevalent and must be balanced. Second, there is a lack of common standards or policies for generative AI specifically. There are concerns that without a common language or standardization, algorithms may be misconstrued across borders. Third, governments are recommending synthetic data as a way to minimize privacy concerns with open data. If done responsibly, generative AI could help produce synthetic data at a larger scale. Lastly, governments around the world do not all have the same capabilities and resources for applying generative AI in their work. The countries that lag behind on these capabilities may have more challenges and risks when trying to incorporate generative AI into their public services.

*****

Alam, Zaidul. “Harnessing the Power of Generative AI in a World of Open Government Data.” LinkedIn Blog, June 15, 2023.

  • In this LinkedIn article, the author discusses the opportunities to leverage Open Government Data (specifically, census data) for generative AI.
  • The author explains that Open Data and generative AI could be merged in several ways including: helping increase interactions between citizens and governments, develop tools to engage with public institutions, and answer search queries about domain specific data (e.g. health data). 
  • The author provides an example of how census data and AI applications could be merged: “By leveraging data APIs from the ABS and other similar institutions globally, Census Chat GPT could generate real-time, data-driven insights about demographic trends, socio-economic disparities, housing statistics, and more.”
  • There are many possible intersections between generative AI and Open Government Data: “In the future, we could see more sophisticated applications of generative AI to government open data. For example, AI could be used to generate comprehensive city planning scenarios based on urban development data, or to create personalized learning plans for students based on education data. Governments could also develop AI ‘public assistants’ that can explain complex legislation, provide real-time updates on policy changes, or guide citizens through bureaucratic procedures. Such AI assistants could democratize access to public information, reduce administrative burdens, and enhance civic engagement.”

Boom, Cedric, and Michael Reusens. Changing Data Sources in the Age of Machine Learning for Official Statistics, 2023. https://doi.org/10.48550/arXiv.2306.04338

  • This paper gives an overview of the main risks, liabilities and uncertainties associated with changing data sources in the context of machine learning for official statistics. 
  • The use of machine learning for official statistics has the potential to provide more timely, accurate and comprehensive insight into a wide range of topics, by leveraging the vast amounts of data that are generated by individuals and entities on a daily basis, statistical agencies can gain a more nuanced understanding of trends and patterns, but there are risks associated with this. Mainly, concerns about data quality, privacy and security and a need for the technical skills and infrastructure in government. 
  • Machine learning can be used to complement or even replace official statistics, and its ability to nowcast and forecast is an extremely valuable addition. By incorporating machine learning into official statistical production, one can benefit from the strengths of both approaches and make more informed decisions based on the most current and accurate data.
  • National statistics agencies are used to having their data completely under their control, but using external data sources to power innovative statistics can become problematic, establishing proper protocols and procedures for external data management is necessary. 

Goasduff, Laurence. “Is Synthetic Data the Future of AI? Q&A with Alexander Linden.” Gartner Interview, November 20, 2022.

  • In this interview with Alexander Linden, a VP Analyst at Gartner, he talks about the potential of synthetic data as a complement to open data to drive the development of more accurate AI models. 
  • He says, “Synthetic data can increase the accuracy of machine learning models. Real-world data is happenstance and does not contain all permutations of conditions or events possible in the real world. Synthetic data can counter this by generating data at the edges, or for conditions not yet seen.”
  • While synthetic data may offer a way to address biases and issues of quality in open data, Linden emphasizes the importance of transparency and explainability when it comes to the models creating and using synthetic data. 

Loukis, Euripidis, Stuti Saxena, Nina Rizun, Maria Ioanna Maratsi, Mohsan Ali, and Charalampos Alexopoulos. “ChatGPT Application Vis-a-Vis Open Government Data (OGD): Capabilities, Public Values, Issues and a Research Agenda.” In Electronic Government, edited by Ida Lindgren, Csaba Csáki, Evangelos Kalampokis, Marijn Janssen, Gabriela Viale Pereira, Shefali Virkar, Efthimios Tambouris, and Anneke Zuiderwijk, 95–110. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2023. https://doi.org/10.1007/978-3-031-41138-0_7.

  • In this paper, the authors analyze the opportunities and risks of using ChatGPT for Open Government Data from an Affordances Theory perspective. Through 12 expert interviews, the authors develop a series of research agendas to accelerate the understanding of how ChatGPT could impact Open Government Data. 
  • ChatGPT could have a positive impact on Open Government Data in several ways. These include: increasing user engagement, awareness, and accessibility, helping develop new Open Government strategies, offering new ways for data discovery through government chatbots, and balancing the supply and demand of Open Government Data. Additionally, from a public values perspective, ChatGPT could provide service-related and professionalism-related values for Open Government Data. It could help design user-driven Open Government Data initiatives and lower barriers to accessing Open Government Data amongst different stakeholders (e.g. citizens)–increasing transparency around government initiatives. 
  • The authors point to several issues that ChatGPT could pose for Open Government Data such as unknowingly collecting personal information from registered users and inaccurate summaries of Open Government Data from ChatGPT. Also, the lack of governance frameworks could lead to larger problems such as inadequate results, cybersecurity issues, and algorithmic biases caused by language differences across countries. 
  • In order to harness the value of ChatGPT for Open Government Data, additional research is needed on how ChatGPT could be used to increase use and value generation from Open Government Data, how ChatGPT could benefit the publishing of Open Government Data, and the potential issues of ChatGPT for Open Government Data. 

Sallier, Kenza, and Kate Burnett-Isaacs. “Unlocking the Power of Data Synthesis with the Starter Guide on Synthetic Data for Official Statistics.” Statistics Canada, March 10, 2023.

  • In this piece, Statistics Canada provides a set of guidelines for National Statistics Offices to use when leveraging synthetic data. 
  • Using UNECE’s report as the guide, the piece explains that using synthetic data can help increase access to statistical data in a privacy compliant manner. It can help with publishing data, testing analysis, education, and testing software. Additionally, it explains the three main ways in which synthetic data can be generated: sequential modeling, stimulated data, and deep learning methods. 
  • The article provides an overview of the pros and cons of using Generative Adversarial Networks to create synthetic data for National Statistics Offices.
    • Pros: “GANs have been used in NSOs to generate continuous, discrete and textual datasets, while ensuring that the underlying distribution and patterns of the original data are preserved. Furthermore, recent research has been focused on the generation of free-text data which can be convenient in situations where models need to be developed to classify text data.”
    • Cons: “GANs can be seen as too complex to understand, explain or implement where there is only a minimal knowledge of neural networks. There is often a criticism associated with neural networks as lacking in transparency. The method is time consuming and has a high demand for computational resources. GANs may suffer from mode collapse, and lack of diversity, although newer variations of the algorithm seem to remedy these issues. Modelling discrete data can be difficult for GAN models.”
  • In sum, the article explains that synthetic data can provide benefits for National Statistics Offices and Generative Adversarial Networks can help produce the synthetic data. However, those undertaking the initiative need to balance the many associated risks. 

Ziesche, Soenke. “Open Data for AI: What Now?” UNESCO Digital Library, 2023. 

  • This report summarizes UNESCO’s guidelines for Member States in opening up data for AI systems. 
  • The report explains that there is an enormous amount of data already being collected through automated systems (building off of the COVID-19 pandemic). This data is often too large to be manually processed. AI and data science methods have the capacity to discover new information from these large data sources. 
  • The report is divided into 3 phases: the preparation phase, the opening data phase, and follow-up phase for data re-use: “The preparation phase guides  Member  States  in  preparing  for  opening  their  data,  and  includes  the  fol-lowing  suggested  steps:  drafting  an  open  data  policy,  gathering  and  collecting high quality data, developing open data capacities and making the data AI-ready. The opening of the data phase consists of the following steps: selecting datasets to be opened, opening the datasets legally, opening the datasets technically, and creating  an  open-data-driven  culture.  The  follow-up  for  reuse  and  sustainability phase consists of the following steps: supporting citizen engagement, supporting international engagement, supporting beneficial AI engagement, and maintaining high quality data.”

*****

We plan to explore these topics further over the coming months. Professionals interested in collaborating with The GovLab on these topics can contact Stefaan Verhulst, Co-Founder & Chief Research and Development Officer at [email protected].

Stay up-to-date on the latest developments of this work by signing up for the Data Stewards Network Newsletter.

Learn more about the Open Data Policy Lab by visiting our website: https://opendatapolicylab.org/.

Unlocking AI’s Potential for Everyone


Article by Diane Coyle: “…But while some policymakers do have deep knowledge about AI, their expertise tends to be narrow, and most other decision-makers simply do not understand the issue well enough to craft sensible policies. Owing to this relatively low knowledge base and the inevitable asymmetry of information between regulators and regulated, policy responses to specific issues are likely to remain inadequate, heavily influenced by lobbying, or highly contested.

So, what is to be done? Perhaps the best option is to pursue more of a principles-based policy. This approach has already gained momentum in the context of issues like misinformation and trolling, where many experts and advocates believe that Big Tech companies should have a general duty of care (meaning a default orientation toward caution and harm reduction).

In some countries, similar principles already apply to news broadcasters, who are obligated to pursue accuracy and maintain impartiality. Although enforcement in these domains can be challenging, the upshot is that we do already have a legal basis for eliciting less socially damaging behavior from technology providers.

When it comes to competition and market dominance, telecoms regulation offers a serviceable model with its principle of interoperability. People with competing service providers can still call each other because telecom companies are all required to adhere to common technical standards and reciprocity agreements. The same is true of ATMs: you may incur a fee, but you can still withdraw cash from a machine at any bank.

In the case of digital platforms, a lack of interoperability has generally been established by design, as a means of locking in users and creating “moats.” This is why policy discussions about improving data access and ensuring access to predictable APIs have failed to make any progress. But there is no technical reason why some interoperability could not be engineered back in. After all, Big Tech companies do not seem to have much trouble integrating the new services that they acquire when they take over competitors.

In the case of LLMs, interoperability probably could not apply at the level of the models themselves, since not even their creators understand their inner workings. However, it can and should apply to interactions between LLMs and other services, such as cloud platforms…(More)”.

City CIOs urged to lay the foundations for generative AI


Article by Sarah Wray: “The London Office of Technology and Innovation (LOTI) has produced a collection of guides to support local authorities in using generative artificial intelligence (genAI) tools such as ChatGPT, Bard, Midjourney and Dall-E.

The resources include a guide for local authority leaders and another aimed at all staff, as well as a guide designed specifically for council Chief Information Officers (CIOs), which was developed with AI software company Faculty.

Sam Nutt, Researcher and Data Ethicist at LOTI, a membership organisation for over 20 boroughs and the Greater London Authority, told Cities Today: “Generative AI won’t solve every problem for local governments, but it could be a catalyst to transform so many processes for how we work.

“On the one hand, personal assistants integrated into programmes like Word, Excel or Powerpoint could massively improve officer productivity. On another level there is a chance to reimagine services and government entirely, thinking about how gen AI models can do so many tasks with data that we couldn’t do before, and allow officers to completely change how they spend their time.

“There are both opportunities and challenges, but the key message on both is that local governments should be ambitious in using this ‘AI moment’ to reimagine and redesign our ways of working to be better at delivering services now and in the future for our residents.”

As an initial step, local governments are advised to provide training and guidelines for staff. Some have begun to implement these steps, including US cities such as BostonSeattle and San Jose.

Nutt stressed that generative AI policies are useful but not a silver bullet for governance and that they will need to be revisited and updated regularly as technology and regulations evolve…(More)”.

How citywide data strategies can connect the dots, drive results


Blog by Bloomberg Cities Network: “Data is more central than ever to improving service delivery, managing performance, and identifying opportunities that better serve residents. That’s why a growing number of cities are adding a new tool to their arsenal—the citywide data strategy—to provide teams with a holistic view of data efforts and then lay out a roadmap for scaling successful approaches throughout city hall.

These comprehensive strategies are increasingly “critical to help mayors reach their visions,” according to Amy Edward Holmes, executive director The Bloomberg Center for Government Excellence at John Hopkins University, which is helping dozens of cities across the Americas up their data games as part of the Bloomberg Philanthropies City Data Alliance (CDA).

Bloomberg Cities spoke with experts in the field and leaders in pioneering cities to learn more about the importance of citywide data strategies and how they can help:

  • Turn “pockets of promise” into citywide strengths;
  • Build upon and consolidate other citywide strategic efforts; 
  • Improve performance management and service delivery;
  • Align staff data capabilities with city needs;
  • Drive lasting cultural change through leadership commitment…(More)”.

Creating Action with Data: Using Data to Increase Equity in Urban Development


Report by Justin Kollar, Niko McGlashan, and Sarah Williams: “The use of data in urban development is controversial because of the numerous examples showing its use to reinforce inequality rather than inclusion. From the development of Home Owners Loan Corporation (HOLC) maps, which excluded many minority communities from mortgages, to zoning laws used to reinforce structural racism, data has been used by those in power to elevate some while further marginalizing others. Yet data can achieve the opposite outcome by exposing inequity, encouraging dialogue and debate, making developers and cities more accountable, and ultimately creating new digital tools to make development processes more inclusive. Using data for action requires that we build teams to ask and answer the right questions, collect the right data, analyze the data ingeniously, ground-truth the results with communities, and share the insights with broader groups so they can take informed action. This paper looks at the development of two recent approaches in New York and Seattle to measure equity in urban development. We reflect on these approaches through the lens of data action principles (Williams 2020). Such reflections can highlight the challenges and opportunities for furthering the measurement and achievement of equitable development by other groups, such as real estate developers and community organizations, who seek to create positive social impact through their activities…(More)”.

EU leadership in trustworthy AI: Guardrails, Innovation & Governance


Article by Thierry Breton: “As mentioned in President von der Leyen’s State of the Union letter of intent, Europe should lead global efforts on artificial intelligence, guiding innovation, setting guardrails and developing global governance.

First, on innovation: we will launch the EU AI Start-Up Initiative, leveraging one of Europe’s biggest assets: its public high-performance computing infrastructure. We will identify the most promising European start-ups in AI and give them access to our supercomputing capacity.

I have said it before: AI is a combination of data, computing and algorithms. To train and finetune the most advanced foundation models, developers need large amounts of computing power.

Europe is a world leader in supercomputing through its European High-Performance Computing Joint Undertaking (EuroHPC). Soon, Europe will have its first exascale supercomputers, JUPITER in Germany and JULES VERNE in France (able to perform a quintillion -that means a billion billion- calculations per second), in addition to various existing supercomputers (such as LEONARDO in Italy and LUMI in Finland).

Access to Europe’s supercomputing infrastructure will help start-ups bring down the training time for their newest AI models from months or years to days or weeks. And it will help them lead the development and scale-up of AI responsibly and in line with European values.

This goes together with our broader efforts to support AI innovation across the value chain – from AI start-ups to all those businesses using AI technologies in their industrial ecosystems. This includes our Testing and Experimentation Facilities for AI (launched in January 2023)our Digital Innovation Hubsthe development of regulatory sandboxes under the AI Act, our support for the European Partnership on AI, Data and Robotics and the cutting-edge research supported by HorizonEurope.

Second, guardrails for AI: Europe has pioneered clear rules for AI systems through the EU AI Act, the world’s first comprehensive regulatory framework for AI. My teams are working closely with the Parliament and Council to support the swift adoption of the EU AI Act. This will give citizens and businesses confidence in AI developed in Europe, knowing that it is safe and respects fundamental rights and European values. And it serves as an inspiration for global rules and principles for trustworthy AI.

As reiterated by President von der Leyen, we are developing an AI Pact that will convene AI companies, help them prepare for the implementation of the EU AI Act and encourage them to commit voluntarily to applying the principles of the Act before its date of applicability.

Third, governance: with the AI Act and the Coordinated Plan on AI, we are working towards a governance framework for AI, which can be a centre of expertise, in particular on large foundation models, and promote cooperation, not only between Member States, but also internationally…(More)”

AI often mangles African languages. Local scientists and volunteers are taking it back to school


Article by Sandeep Ravindran: “Imagine joyfully announcing to your Facebook friends that your wife gave birth, and having Facebook automatically translate your words to “my prostitute gave birth.” Shamsuddeen Hassan Muhammad, a computer science Ph.D. student at the University of Porto, says that’s what happened to a friend when Facebook’s English translation mangled the nativity news he shared in his native language, Hausa.

Such errors in artificial intelligence (AI) translation are common with African languages. AI may be increasingly ubiquitous, but if you’re from the Global South, it probably doesn’t speak your language.

That means Google Translate isn’t much help, and speech recognition tools such as Siri or Alexa can’t understand you. All of these services rely on a field of AI known as natural language processing (NLP), which allows AI to “understand” a language. The overwhelming majority of the world’s 7000 or so languages lack data, tools, or techniques for NLP, making them “low-resourced,” in contrast with a handful of “high-resourced” languages such as English, French, German, Spanish, and Chinese.

Hausa is the second most spoken African language, with an estimated 60 million to 80 million speakers, and it’s just one of more than 2000 African languages that are mostly absent from AI research and products. The few products available don’t work as well as those for English, notes Graham Neubig, an NLP researcher at Carnegie Mellon University. “It’s not the people who speak the languages making the technology.” More often the technology simply doesn’t exist. “For example, now you cannot talk to Siri in Hausa, because there is no data set to train Siri,” Muhammad says.

He is trying to fill that gap with a project he co-founded called HausaNLP, one of several launched within the past few years to develop AI tools for African languages…(More)”.

The Adoption and Implementation of Artificial Intelligence Chatbots in Public Organizations: Evidence from U.S. State Governments


Paper by Tzuhao Chen, Mila Gascó-Hernandez, and Marc Esteve: “Although the use of artificial intelligence (AI) chatbots in public organizations has increased in recent years, three crucial gaps remain unresolved. First, little empirical evidence has been produced to examine the deployment of chatbots in government contexts. Second, existing research does not distinguish clearly between the drivers of adoption and the determinants of success and, therefore, between the stages of adoption and implementation. Third, most current research does not use a multidimensional perspective to understand the adoption and implementation of AI in government organizations. Our study addresses these gaps by exploring the following question: what determinants facilitate or impede the adoption and implementation of chatbots in the public sector? We answer this question by analyzing 22 state agencies across the U.S.A. that use chatbots. Our analysis identifies ease of use and relative advantage of chatbots, leadership and innovative culture, external shock, and individual past experiences as the main drivers of the decisions to adopt chatbots. Further, it shows that different types of determinants (such as knowledge-base creation and maintenance, technology skills and system crashes, human and financial resources, cross-agency interaction and communication, confidentiality and safety rules and regulations, and citizens’ expectations, and the COVID-19 crisis) impact differently the adoption and implementation processes and, therefore, determine the success of chatbots in a different manner. Future research could focus on the interaction among different types of determinants for both adoption and implementation, as well as on the role of specific stakeholders, such as IT vendors…(More)”.

Social approach to the transition to smart cities


Report by the European Parliamentary Research Services (EPRS): “This study explores the main impacts of the smart city transition on our cities and, in particular, on citizens and territories. In our research, we start from an analysis of smart city use cases to identify a set of key challenges, and elaborate on the main accelerating factors that may amplify or contain their impact on particular groups and territories. We then present an account of best practices that can help mitigate or prevent such challenges, and make some general observations on their scalability and replicability. Finally, based on an analysis of EU regulatory frameworks and a mapping of current or upcoming initiatives in the domain of smart city innovation, capacity-building and knowledge capitalisation, we propose six policy options to inform future policy-making at EU level to support a more inclusive smart city transition…(More)”.

Who Wrote This? How AI and the Lure of Efficiency Threaten Human Writing


Book by Naomi S. Baron: “Would you read this book if a computer wrote it? Would you even know? And why would it matter?

Today’s eerily impressive artificial intelligence writing tools present us with a crucial challenge: As writers, do we unthinkingly adopt AI’s time-saving advantages or do we stop to weigh what we gain and lose when heeding its siren call? To understand how AI is redefining what it means to write and think, linguist and educator Naomi S. Baron leads us on a journey connecting the dots between human literacy and today’s technology. From nineteenth-century lessons in composition, to mathematician Alan Turing’s work creating a machine for deciphering war-time messages, to contemporary engines like ChatGPT, Baron gives readers a spirited overview of the emergence of both literacy and AI, and a glimpse of their possible future. As the technology becomes increasingly sophisticated and fluent, it’s tempting to take the easy way out and let AI do the work for us. Baron cautions that such efficiency isn’t always in our interest. As AI plies us with suggestions or full-blown text, we risk losing not just our technical skills but the power of writing as a springboard for personal reflection and unique expression.

Funny, informed, and conversational, Who Wrote This? urges us as individuals and as communities to make conscious choices about the extent to which we collaborate with AI. The technology is here to stay. Baron shows us how to work with AI and how to spot where it risks diminishing the valuable cognitive and social benefits of being literate…(More)”.