×
Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The race for faster and more efficient AI language model processing has reached a new milestone with Cerebras achieving unprecedented speeds for Meta’s Llama 3.1 405B model, marking a significant advancement in frontier AI performance.

Record-breaking performance: Cerebras has achieved a processing speed of 969 tokens per second with Llama 3.1 405B, surpassing previous limitations of frontier models.

  • The speed represents a 12x improvement over GPT-4o and 18x faster than Claude 3.5 Sonnet
  • Time-to-first-token latency has been reduced to just 240 milliseconds, significantly improving user experience
  • The system supports a 128K context length while maintaining full model accuracy with 16-bit weights

Technical specifications and pricing: Cerebras Inference has established new benchmarks across multiple performance metrics while offering competitive pricing.

  • The service will be generally available in Q1 2025
  • Pricing is set at $6 per million input tokens and $12 per million output tokens, 20% lower than major cloud providers
  • Performance remains strong even with extended input prompts of 100,000 tokens, achieving 539 tokens per second

Competitive landscape: The breakthrough addresses a longstanding challenge in the AI industry where developers had to choose between speed and model sophistication.

  • Previous GPU, ASIC, and cloud solutions for frontier models like GPT-4o and Claude 3.5 Sonnet never exceeded 200 tokens per second
  • Cerebras’s performance is 8x faster than SambaNova and 75x faster than AWS
  • The system is the only non-GPU vendor to complete long-context benchmarks successfully

Real-world implications: The increased processing speed and reduced latency have significant practical applications.

  • Customers switching from GPT-4 report a 75% reduction in total latency
  • The improvements particularly benefit voice and video AI applications requiring real-time interaction
  • The system can generate entire pages of text, code, and mathematical content rapidly

Future trajectory: The integration of open-source models with advanced inference technology suggests a shifting landscape in AI deployment.

  • Llama 3.1 405B’s performance on Cerebras demonstrates the potential of open-source AI models
  • The platform’s success with voice, video, and reasoning applications indicates broader applications for low-latency AI processing
  • Customer trials are currently underway, suggesting imminent real-world implementation and validation

Looking ahead: While these improvements represent significant technical achievements, the true test will be the adoption and integration of this technology in practical applications, particularly in scenarios where real-time processing is crucial for user experience and functionality.

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Recent News

AI boosts developer productivity, but adoption varies

AI coding tools boost developer task completion by 26%, though their effectiveness varies significantly between junior and senior programmers while raising new concerns about code quality and compliance.

For AI safety to be effective we need a much more proactive framework

Policymakers and tech leaders shift from reactive to preventive approaches as AI capabilities outpace traditional regulatory safeguards.

Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world

The latest AI model processes responses twelve times faster than GPT-4 while maintaining accuracy and costing significantly less to operate.