The large language models (LLMs) powering chatbots like ChatGPT are capable of impressive feats, but still routinely produce errors and can be made to behave in undesirable or harmful ways. Making these AI systems safe and robust to misuse is a critical challenge as they become more widely deployed.
Vulnerabilities rooted in fundamental properties of LLMs: Many of the problems with LLMs stem from how they work – by predicting the most likely next word based on statistical patterns in their training data:
- Performance can fluctuate wildly depending on how common the output, task or input text is on the internet used for training, leading to puzzling failures on things humans find simple.
- Determined users can exploit these properties to create “adversarial examples” – prompts carefully crafted to make models ignore their safety training and produce harmful outputs.
Alignment techniques aim to reduce undesirable behaviors: LLM developers use various approaches to bring the models’ behaviors more in line with human values:
- Reinforcement learning from human feedback (RLHF) fine-tunes models by encouraging good responses and penalizing bad ones according to human preferences.
- Additional “guardrail” systems can be added to block harmful outputs that bypass initial safety training.
- However, making models both safe and useful is challenging, and safety measures are still routinely bypassed by determined attackers in a cat-and-mouse game.
Debate over scaling as a solution: There is disagreement about whether simply making LLMs larger and training them on more data will solve these issues:
- Some argue that with enough scaling, failures will become so infrequent as to be a non-issue.
- Others believe vulnerabilities are inherent to how LLMs work, and scaling could make models better at bypassing the very safety measures meant to constrain them.
Chaperone models may be needed for real-world use: Given that misuse seems impossible to entirely prevent, many believe LLMs should not be deployed without additional protective systems:
- Combinations of rule-based algorithms and task-specific machine learning models could form a “sieve” to catch different failure modes.
- While still not foolproof, such systems are harder to deceive than any single model alone.
- An external system of verification and validation could be required to demonstrate LLMs’ safety before real-world use.
Broader context and implications: As impressive as recent advances in AI have been, models like ChatGPT still have major shortcomings when it comes to reliability and robustness. While techniques like RLHF show promise in aligning AI systems with human values, vulnerabilities rooted in the fundamental nature of language models mean that misuse remains an ever-present danger.
As these models become more ubiquitous, it’s critical that the public not be endangered by AI whose safety has been taken for granted. Just as drugs must demonstrate their safety and efficacy through clinical trials before deployment, companies developing and deploying AI systems may need to be held to similar standards through a rigorous system of external validation. While building perfectly safe AI may never be possible, the technologies exist to substantially reduce risks – but only if developers are required to use them. With the right combination of technical innovation and sensible governance, the worst outcomes are not inevitable.
Recent Stories
DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment
The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...
Oct 17, 2025Tying it all together: Credo’s purple cables power the $4B AI data center boom
Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...
Oct 17, 2025Vatican launches Latin American AI network for human development
The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...