The Digital Insider | It Is Possible to ‘Poison’ the Data to Compromise AI Chatbots With Little Effort

According to researchers, individuals could potentially disrupt the accuracy of AI chatbots by intentionally contaminating the datasets upon which these systems rely, all for a minimal cost.

It Is Possible to ‘Poison’ the Data to Compromise AI Chatbots With Little Effort – Technology Org

Coding a chatbot – illustrative photo. Image credit: James Harrison via Unsplash, free license

As it stands, AI chatbots already exhibit biases and deficiencies attributable to the flawed data on which they are trained. The researchers’ investigation described on Business Insider revealed that malevolent actors could deliberately introduce “poisoned” data into these datasets, with some methods requiring little technical expertise and being relatively inexpensive.

A recent study conducted by AI researchers unveiled that, with as little as $60, individuals could manipulate the datasets essential for training generative AI tools akin to ChatGPT, which are crucial for providing precise responses.

These AI systems, whether chatbots or image generators, leverage vast amounts of data extracted from the expansive digital realm of the internet to generate sophisticated responses and images.

Florian Tramèr, an associate professor of computer science at ETH Zurich, highlighted the effectiveness of this approach in empowering chatbots. However, he also underscored the inherent risk associated with training AI tools on potentially inaccurate data.

This reliance on potentially flawed data sources contributes to the prevalence of biases and inaccuracies in AI chatbots. Given the abundance of misinformation on the internet, these systems are susceptible to incorporating erroneous information into their responses, further undermining their reliability and trustworthiness.

Through their investigation, researchers discovered that even a “low-resourced attacker,” armed with modest financial resources and sufficient technical expertise, could manipulate a relatively small portion of data to substantially influence the behavior of a large language model, causing it to produce inaccurate responses.

Examining two distinct attack methods, Tramèr and his colleagues explored the potential of poisoning data through the acquisition of expired domains and manipulation of Wikipedia content.

For instance, one avenue for hackers to poison the data involves purchasing expired domains, which can be obtained for as little as $10 annually for each URL, and then disseminating any desired information on these websites.

According to Tramèr’s paper, an attacker could effectively control and contaminate at least 0.01% of a dataset by investing as little as $60 in purchasing domains. This equates to potentially influencing tens of thousands of images within the dataset.

The team also explored an alternative attack strategy, focusing on the manipulation of data within Wikipedia. Given that Wikipedia serves as a “crucial component of the training datasets” for language models, Tramèr emphasized its significance in this context.

According to the author, Wikipedia prohibits direct scraping of its content, instead offering periodic “snapshots” of its pages for download. These snapshots are captured at regular intervals, as publicly advertised on Wikipedia’s website, ensuring predictability in their availability.

Tramèr’s team outlined a relatively straightforward attack approach involving strategically timed edits to Wikipedia pages. Exploiting the predictable nature of Wikipedia’s snapshot intervals, a malicious actor could execute edits just before moderators have an opportunity to revert the changes and before the platform generates new snapshots.

This method allows for the surreptitious insertion of manipulated information into Wikipedia pages, potentially influencing the content used to train language models without raising immediate suspicion.

Tramèr suggests hat at least 5% of edits orchestrated by an attacker could successfully infiltrate the system. However, the success rate of such attacks would likely exceed 5%, he said.

Following their analysis, Tramèr’s team shared their findings with Wikipedia and proposed measures to enhance security, such as introducing randomness into the timing of webpage snapshots, mitigating the predictability exploited by potential attackers.

Written by Alius Noreika