×
Written by
Published on
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The large language models (LLMs) powering chatbots like ChatGPT are capable of impressive feats, but still routinely produce errors and can be made to behave in undesirable or harmful ways. Making these AI systems safe and robust to misuse is a critical challenge as they become more widely deployed.

Vulnerabilities rooted in fundamental properties of LLMs: Many of the problems with LLMs stem from how they work – by predicting the most likely next word based on statistical patterns in their training data:

  • Performance can fluctuate wildly depending on how common the output, task or input text is on the internet used for training, leading to puzzling failures on things humans find simple.
  • Determined users can exploit these properties to create “adversarial examples” – prompts carefully crafted to make models ignore their safety training and produce harmful outputs.

Alignment techniques aim to reduce undesirable behaviors: LLM developers use various approaches to bring the models’ behaviors more in line with human values:

  • Reinforcement learning from human feedback (RLHF) fine-tunes models by encouraging good responses and penalizing bad ones according to human preferences.
  • Additional “guardrail” systems can be added to block harmful outputs that bypass initial safety training.
  • However, making models both safe and useful is challenging, and safety measures are still routinely bypassed by determined attackers in a cat-and-mouse game.

Debate over scaling as a solution: There is disagreement about whether simply making LLMs larger and training them on more data will solve these issues:

  • Some argue that with enough scaling, failures will become so infrequent as to be a non-issue.
  • Others believe vulnerabilities are inherent to how LLMs work, and scaling could make models better at bypassing the very safety measures meant to constrain them.

Chaperone models may be needed for real-world use: Given that misuse seems impossible to entirely prevent, many believe LLMs should not be deployed without additional protective systems:

  • Combinations of rule-based algorithms and task-specific machine learning models could form a “sieve” to catch different failure modes.
  • While still not foolproof, such systems are harder to deceive than any single model alone.
  • An external system of verification and validation could be required to demonstrate LLMs’ safety before real-world use.

Broader context and implications: As impressive as recent advances in AI have been, models like ChatGPT still have major shortcomings when it comes to reliability and robustness. While techniques like RLHF show promise in aligning AI systems with human values, vulnerabilities rooted in the fundamental nature of language models mean that misuse remains an ever-present danger.

As these models become more ubiquitous, it’s critical that the public not be endangered by AI whose safety has been taken for granted. Just as drugs must demonstrate their safety and efficacy through clinical trials before deployment, companies developing and deploying AI systems may need to be held to similar standards through a rigorous system of external validation. While building perfectly safe AI may never be possible, the technologies exist to substantially reduce risks – but only if developers are required to use them. With the right combination of technical innovation and sensible governance, the worst outcomes are not inevitable.

AI is vulnerable to attack. Can it ever be used safely?

Recent News

AI Governance Takes Center Stage in ASEAN-Stanford HAI Workshop

Southeast Asian officials discuss AI governance challenges and regional cooperation with Stanford experts.

Slack is Launching AI Note-Taking for Huddles

The feature aims to streamline meetings and boost productivity by automatically generating notes during Slack huddles.

Google’s AI Tool ‘Food Mood’ Will Help You Create Mouth-Watering Meals

Google's new AI tool blends cuisines from different countries to create unique recipes for adventurous home cooks.