Anthropic unveils next-generation AI models and groundbreaking computer use capability: Anthropic has announced significant upgrades to its AI models, including an enhanced Claude 3.5 Sonnet and a new Claude 3.5 Haiku, along with a revolutionary computer use feature in public beta.
Upgraded Claude 3.5 Sonnet: A leap in AI-powered coding: The new version of Claude 3.5 Sonnet demonstrates substantial improvements across various benchmarks, with particular emphasis on coding and tool use tasks.
- Performance on SWE-bench Verified increased from 33.4% to 49.0%, surpassing all publicly available models, including specialized systems for agentic coding.
- TAU-bench scores improved from 62.6% to 69.2% in the retail domain and from 36.0% to 46.0% in the more challenging airline domain.
- These advancements come at no additional cost or speed tradeoff compared to the previous version.
Industry feedback and real-world applications: Early adopters have reported significant improvements in AI-powered software development processes.
- GitLab observed up to 10% stronger reasoning across use cases with no added latency.
- Cognition noted substantial improvements in coding, planning, and problem-solving compared to the previous version.
- The Browser Company found Claude 3.5 Sonnet outperformed all previously tested models for automating web-based workflows.
Introducing Claude 3.5 Haiku: Balancing performance and efficiency: The new Claude 3.5 Haiku model offers improved capabilities at the same cost and speed as its predecessor.
- Claude 3.5 Haiku surpasses even Claude 3 Opus, the largest model in the previous generation, on many intelligence benchmarks.
- It scores 40.6% on SWE-bench Verified, outperforming many agents using publicly available state-of-the-art models.
- The model is well-suited for user-facing products, specialized sub-agent tasks, and generating personalized experiences from large datasets.
Pioneering computer use capability: Anthropic has introduced a groundbreaking feature allowing Claude to interact with computer interfaces like a human user.
- The new API enables Claude to perceive and interact with computer interfaces, translating instructions into computer commands.
- On OSWorld, which evaluates AI models’ ability to use computers like people, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category, significantly higher than the next-best AI system’s score of 7.8%.
- When given more steps to complete tasks, Claude’s score improved to 22.0%.
Responsible development and deployment: Anthropic emphasizes a proactive approach to safety and responsible AI development.
- New classifiers have been developed to identify when computer use is being employed and to detect potential harm.
- Joint pre-deployment testing was conducted with the US AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI).
- The ASL-2 Standard, as outlined in Anthropic’s Responsible Scaling Policy, remains appropriate for the upgraded Claude 3.5 Sonnet model.
Looking ahead: Implications and future developments: The introduction of these new models and capabilities represents a significant step forward in AI technology, with potential for wide-ranging applications across industries.
- The computer use feature, while still in its early stages, opens up new possibilities for automating complex tasks and workflows.
- Anthropic encourages developers to explore these new capabilities and provide feedback to help refine and improve the technology.
- The company acknowledges that the computer use capability is still imperfect and recommends starting with low-risk tasks during the exploration phase.
Balancing innovation and responsibility: As AI systems become increasingly capable, Anthropic’s approach highlights the importance of responsible development and deployment.
- The introduction of computer use capabilities raises new considerations for potential misuse, such as spam, misinformation, or fraud.
- Anthropic’s proactive safety measures and collaboration with external experts demonstrate a commitment to addressing potential risks associated with advanced AI systems.
- The public beta release of the computer use feature allows for real-world testing and feedback, which will be crucial for understanding both the potential and implications of this technology.
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku