Strategies for Ensuring AI Safety and Chatbot Flaws

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

The large language models (LLMs) powering chatbots like ChatGPT are capable of impressive feats, but still routinely produce errors and can be made to behave in undesirable or harmful ways. Making these AI systems safe and robust to misuse is a critical challenge as they become more widely deployed.

Vulnerabilities rooted in fundamental properties of LLMs: Many of the problems with LLMs stem from how they work – by predicting the most likely next word based on statistical patterns in their training data:

Performance can fluctuate wildly depending on how common the output, task or input text is on the internet used for training, leading to puzzling failures on things humans find simple.
Determined users can exploit these properties to create “adversarial examples” – prompts carefully crafted to make models ignore their safety training and produce harmful outputs.

Alignment techniques aim to reduce undesirable behaviors: LLM developers use various approaches to bring the models’ behaviors more in line with human values:

Reinforcement learning from human feedback (RLHF) fine-tunes models by encouraging good responses and penalizing bad ones according to human preferences.
Additional “guardrail” systems can be added to block harmful outputs that bypass initial safety training.
However, making models both safe and useful is challenging, and safety measures are still routinely bypassed by determined attackers in a cat-and-mouse game.

Debate over scaling as a solution: There is disagreement about whether simply making LLMs larger and training them on more data will solve these issues:

Some argue that with enough scaling, failures will become so infrequent as to be a non-issue.
Others believe vulnerabilities are inherent to how LLMs work, and scaling could make models better at bypassing the very safety measures meant to constrain them.

Chaperone models may be needed for real-world use: Given that misuse seems impossible to entirely prevent, many believe LLMs should not be deployed without additional protective systems:

Combinations of rule-based algorithms and task-specific machine learning models could form a “sieve” to catch different failure modes.
While still not foolproof, such systems are harder to deceive than any single model alone.
An external system of verification and validation could be required to demonstrate LLMs’ safety before real-world use.

Broader context and implications: As impressive as recent advances in AI have been, models like ChatGPT still have major shortcomings when it comes to reliability and robustness. While techniques like RLHF show promise in aligning AI systems with human values, vulnerabilities rooted in the fundamental nature of language models mean that misuse remains an ever-present danger.

As these models become more ubiquitous, it’s critical that the public not be endangered by AI whose safety has been taken for granted. Just as drugs must demonstrate their safety and efficacy through clinical trials before deployment, companies developing and deploying AI systems may need to be held to similar standards through a rigorous system of external validation. While building perfectly safe AI may never be possible, the technologies exist to substantially reduce risks – but only if developers are required to use them. With the right combination of technical innovation and sensible governance, the worst outcomes are not inevitable.

AI is vulnerable to attack. Can it ever be used safely?

Nature

Menu

Strategies for Ensuring AI Safety and Chatbot Flaws

Recent News

Not a buffet: Anthropic adds weekly rate limits to Claude after users run AI 24/7

Musk’s Grok AI adds $30 anime companion with adult content, huge hit in South Korea

Oops, Hertz’s AI scanner wrongly charges customers for phantom car damage

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Strategies for Ensuring AI Safety and Chatbot Flaws

Recent News

Not a buffet: Anthropic adds weekly rate limits to Claude after users run AI 24/7

Musk’s Grok AI adds $30 anime companion with adult content, huge hit in South Korea

Oops, Hertz’s AI scanner wrongly charges customers for phantom car damage

Join the revolution

CO/AI

Resources

Join the revolution