Despite superhuman AI performance in Go, researchers have discovered exploitable algorithmic flaws that allow even novice human players to defeat top Go AIs using simple, unorthodox strategies.
Attempting to harden AI defenses: Researchers at MIT and FAR AI tested three methods to improve the “worst-case” performance of the KataGo algorithm against adversarial attacks:
- Fine-tuning KataGo with more examples of the exploitative cyclic strategies initially showed promise, but a fine-tuned attacker quickly found a variation that reduced KataGo’s win rate to just 9%.
- An iterative “arms race” training approach, where defensive models tried to plug holes discovered by adversarial models, resulted in a final defensive algorithm winning only 19% of games against a novel attacking variation.
- Using vision transformers to avoid potential biases in KataGo’s convolutional neural networks also failed, with the defensive model winning only 22% of games against a human-replicable cyclic attack.
Implications for AI robustness: The research highlights the importance of evaluating “worst-case” performance in AI systems, even when “average-case” performance seems superhuman:
- The exploitable holes in KataGo demonstrate that otherwise “weak” adversaries can find vulnerabilities that cause the system to break down, despite its ability to dominate human players on average.
- This principle extends to other AI domains, such as large language models failing at simple math problems or visual AI struggling with basic geometric shapes, despite excelling at more complex tasks.
- Improving worst-case scenarios is crucial for avoiding embarrassing public mistakes when deploying AI systems.
Challenges and future directions: The study suggests that determined adversaries can discover new vulnerabilities in AI algorithms more quickly than the algorithms can evolve to fix them, especially in less controlled environments compared to Go:
- While the researchers’ methods could not prevent new attacks, they successfully defended against previously identified exploits, indicating that training against a sufficiently large corpus of attacks might eventually lead to full defense.
- However, the difficulty of solving this issue in a simple domain like Go suggests that patching similar vulnerabilities in more complex AI systems, such as ChatGPT jailbreaks, may be even more challenging in the near term.
- Future research proposals aim to make AI systems more robust against worst-case scenarios, which could be as valuable as pursuing new superhuman capabilities.
Analyzing deeper: This research underscores the ongoing challenges in creating truly robust AI systems that are resilient against adversarial attacks and worst-case scenarios. As AI continues to advance and be deployed in increasingly critical domains, addressing these vulnerabilities will be essential to ensure the reliability and safety of AI-powered systems. The findings also raise questions about the extent to which AI can be fully hardened against exploitation, given the speed at which adversaries can discover new holes compared to the pace of defensive adaptation. Striking a balance between pushing the boundaries of AI capabilities and ensuring robust performance in all scenarios will likely remain a key focus for researchers and developers in the field.
“Superhuman” Go AIs still have trouble defending against these simple exploits