DATA – Page 54 – The Living Library

Anonymization: The imperfect science of using data while preserving privacy

Curated on July 24, 2024July 24, 2024 by Stefaan Verhulst

Paper by Andrea Gadotti et al: “Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today…(More)”.

Training LLMs to Draft Replies to Parliamentary Questions

Curated on July 22, 2024July 22, 2024 by Stefaan Verhulst

Blog by Watson Chua: “In Singapore, the government is answerable to Parliament and Members of Parliament (MPs) may raise queries to any Minister on any matter in his portfolio. These questions can be answered orally during the Parliament sitting or through a written reply. Regardless of the medium, public servants in the ministries must gather materials to answer the question and prepare a response.

Generative AI and Large Language Models (LLMs) have already been applied to help public servants do this more effectively and efficiently. For example, Pair Search (publicly accessible) and the Hansard Analysis Tool (only accessible to public servants) help public servants search for relevant information in past Parliamentary Sittings relevant to the question and synthesise a response to it.

The existing systems draft the responses using prompt engineering and Retrieval Augmented Generation (RAG). To recap, RAG consists of two main parts:

Retriever: A search engine that finds documents relevant to the question
Generator: A text generation model (LLM) that takes in the instruction, the question, and the search results from the retriever to respond to the question

A typical RAG system. Illustration by Hrishi Olickel, taken from here.

Using a pre-trained instruction-tuned LLM like GPT-4o, the generator can usually generate a good response. However, it might not be exactly what is desired in terms of verbosity, style and writing prose, and additional human post-processing might be needed. Extensive prompt engineering or few-shot learning can be done to mold the response at the expense of incurring higher costs from using additional tokens in the prompt…(More)”

The double-edged sword of AI in education

Curated on July 21, 2024July 23, 2024 by Stefaan Verhulst

Article by Rose Luckin: “Artificial intelligence (AI) could revolutionize education as profoundly as the internet has already revolutionized our lives. However, our experience with commercial internet platforms gives us pause. Consider how social media algorithms, designed to maximize engagement and ad revenue, have inadvertently promoted divisive content and misinformation, a development at odds with educational goals.

Like the commercialization of the internet, the AI consumerization trend, driven by massive investments across sectors, prioritizes profit over societal and educational benefits. This focus on monetization risks overshadowing crucial considerations about AI’s integration into educational contexts.

The consumerization of AI in education is a double-edged sword. While increasing accessibility, it could also undermine fundamental educational principles and reshape students’ attitudes toward learning. We must advocate for a thoughtful, education-centric approach to AI development that enhances, rather than replaces, human intelligence and recognises the value of effort in learning.

As generative AI systems for education emerge, technical experts and policymakers have a unique opportunity to ensure their design supports the interests of learners and educators.

Risk 1: Overestimating AI’s intelligence

In essence, learning is not merely an individual cognitive process but a deeply social endeavor, intricately linked to cultural context, language development, and the dynamic relationship between practical experience and theoretical knowledge…(More)”.

AI mass surveillance at Paris Olympics

Curated on July 21, 2024July 24, 2024 by Stefaan Verhulst

Article by Anne Toomey McKenna: “The 2024 Paris Olympics is drawing the eyes of the world as thousands of athletes and support personnel and hundreds of thousands of visitors from around the globe converge in France. It’s not just the eyes of the world that will be watching. Artificial intelligence systems will be watching, too.

Government and private companies will be using advanced AI tools and other surveillance tech to conduct pervasive and persistent surveillance before, during and after the Games. The Olympic world stage and international crowds pose increased security risks so significant that in recent years authorities and critics have described the Olympics as the “world’s largest security operations outside of war.”

The French government, hand in hand with the private tech sector, has harnessed that legitimate need for increased security as grounds to deploy technologically advanced surveillance and data gathering tools. Its surveillance plans to meet those risks, including controversial use of experimental AI video surveillance, are so extensive that the country had to change its laws to make the planned surveillance legal.

The plan goes beyond new AI video surveillance systems. According to news reports, the prime minister’s office has negotiated a provisional decree that is classified to permit the government to significantly ramp up traditional, surreptitious surveillance and information gathering tools for the duration of the Games. These include wiretapping; collecting geolocation, communications and computer data; and capturing greater amounts of visual and audio data…(More)”.

The impact of data portability on user empowerment, innovation, and competition

Curated on July 21, 2024July 23, 2024 by Stefaan Verhulst

OECD Note: “Data portability enhances access to and sharing of data across digital services and platforms. It can empower users to play a more active role in the re-use of their data and can help stimulate competition and innovation by fostering interoperability while reducing switching costs and lock-in effects. However, the effectiveness of data portability in enhancing competition depends on the terms and conditions of data transfer and the extent to which competitors can make use of the data effectively. Additionally, there are potential downsides: data portability measures may unintentionally stifle competition in fast-evolving markets where interoperability requirements may disproportionately burden SMEs and start-ups. Data portability can also increase digital security and privacy risks by enabling data transfers to multiple destinations. This note presents the following five dimensions essential for designing and implementing data portability frameworks: sectoral scope; beneficiaries; type of data; legal obligations; and operational modality…(More)”.

Community consent: neither a ceiling nor a floor

Curated on July 21, 2024July 21, 2024 by Stefaan Verhulst

Article by Jasmine McNealy: “The 23andMe breach and the Golden State Killer case are two of the more “flashy” cases, but questions of consent, especially the consent of all of those affected by biodata collection and analysis in more mundane or routine health and medical research projects, are just as important. The communities of people affected have expectations about their privacy and the possible impacts of inferences that could be made about them in data processing systems. Researchers must, then, acquire community consent when attempting to work with networked biodata.

Several benefits of community consent exist, especially for marginalized and vulnerable populations. These benefits include:

Ensuring that information about the research project spreads throughout the community,
Removing potential barriers that might be created by resistance from community members,
Alleviating the possible concerns of individuals about the perspectives of community leaders, and
Allowing the recruitment of participants using methods most salient to the community.

But community consent does not replace individual consent and limits exist for both community and individual consent. Therefore, within the context of a biorepository, understanding whether community consent might be a ceiling or a floor requires examining governance and autonomy…(More)”.

The Data That Powers A.I. Is Disappearing Fast

Curated on July 21, 2024July 21, 2024 by Stefaan Verhulst

Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

The Five Stages Of AI Grief

Curated on July 20, 2024July 20, 2024 by Stefaan Verhulst

Essay by Benjamin Bratton: “Alignment” toward “human-centered AI” are just words representing our hopes and fears related to how AI feels like it is out of control — but also to the idea that complex technologies were never under human control to begin with. For reasons more political than perceptive, some insist that “AI” is not even “real,” that it is just math or just an ideological construction of capitalism turning itself into a naturalized fact. Some critics are clearly very angry at the all-too-real prospects of pervasive machine intelligence. Others recognize the reality of AI but are convinced it is something that can be controlled by legislative sessions, policy papers and community workshops. This does not ameliorate the depression felt by still others, who foresee existential catastrophe.

All these reactions may confuse those who see the evolution of machine intelligence, and the artificialization of intelligence itself, as an overdetermined consequence of deeper developments. What to make of these responses?

Sigmund Freud used the term “Copernican” to describe modern decenterings of the human from a place of intuitive privilege. After Nicolaus Copernicus and Charles Darwin, he nominated psychoanalysis as the third such revolution. He also characterized the response to such decenterings as “traumas.”

Trauma brings grief. This is normal. In her 1969 book, “On Death and Dying,” the Swiss psychiatrist Elizabeth Kübler-Ross identified the “five stages of grief”: denial, anger, bargaining, depression and acceptance. Perhaps Copernican Traumas are no different…(More)”.

Reliability of U.S. Economic Data Is in Jeopardy, Study Finds

Curated on July 20, 2024July 23, 2024 by Stefaan Verhulst

Article by Ben Casselman: “A report says new approaches and increased spending are needed to ensure that government statistics remain dependable and free of political influence.

Federal Reserve officials use government data to help determine when to raise or lower interest rates. Congress and the White House use it to decide when to extend jobless benefits or send out stimulus payments. Investors place billions of dollars worth of bets that are tied to monthly reports on job growth, inflation and retail sales.

But a new study says the integrity of that data is in increasing jeopardy.

The report, issued on Tuesday by the American Statistical Association, concludes that government statistics are reliable right now. But that could soon change, the study warns, citing factors including shrinking budgets, falling survey response rates and the potential for political interference.

The authors — statisticians from George Mason University, the Urban Institute and other institutions — likened the statistical system to physical infrastructure like highways and bridges: vital, but often ignored until something goes wrong.

“We do identify this sort of downward spiral as a threat, and that’s what we’re trying to counter,” said Nancy Potok, who served as chief statistician of the United States from 2017 to 2019 and was one of the report’s authors. “We’re not there yet, but if we don’t do something, that threat could become a reality, and in the not-too-distant future.”

The report, “The Nation’s Data at Risk,” highlights the threats facing statistics produced across the federal government, including data on education, health, crime and demographic trends.

But the risks to economic data are particularly notable because of the attention it receives from policymakers and investors. Most of that data is based on surveys of households or businesses. And response rates to government surveys have plummeted in recent years, as they have for private polls. The response rate to the Current Population Survey — the monthly survey of about 60,000 households that is the basis for the unemployment rate and other labor force statistics — has fallen to about 70 percent in recent months, from nearly 90 percent a decade ago…(More)”.

Precision public health in the era of genomics and big data

Curated on July 20, 2024July 18, 2024 by Stefaan Verhulst

Paper by Megan C. Roberts et al: “Precision public health (PPH) considers the interplay between genetics, lifestyle and the environment to improve disease prevention, diagnosis and treatment on a population level—thereby delivering the right interventions to the right populations at the right time. In this Review, we explore the concept of PPH as the next generation of public health. We discuss the historical context of using individual-level data in public health interventions and examine recent advancements in how data from human and pathogen genomics and social, behavioral and environmental research, as well as artificial intelligence, have transformed public health. Real-world examples of PPH are discussed, emphasizing how these approaches are becoming a mainstay in public health, as well as outstanding challenges in their development, implementation and sustainability. Data sciences, ethical, legal and social implications research, capacity building, equity research and implementation science will have a crucial role in realizing the potential for ‘precision’ to enhance traditional public health approaches…(More)”.