×
Llama 3.1 405B on Cerebras is by far the fastest frontier model in the world
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The race for faster and more efficient AI language model processing has reached a new milestone with Cerebras achieving unprecedented speeds for Meta’s Llama 3.1 405B model, marking a significant advancement in frontier AI performance.

Record-breaking performance: Cerebras has achieved a processing speed of 969 tokens per second with Llama 3.1 405B, surpassing previous limitations of frontier models.

  • The speed represents a 12x improvement over GPT-4o and 18x faster than Claude 3.5 Sonnet
  • Time-to-first-token latency has been reduced to just 240 milliseconds, significantly improving user experience
  • The system supports a 128K context length while maintaining full model accuracy with 16-bit weights

Technical specifications and pricing: Cerebras Inference has established new benchmarks across multiple performance metrics while offering competitive pricing.

  • The service will be generally available in Q1 2025
  • Pricing is set at $6 per million input tokens and $12 per million output tokens, 20% lower than major cloud providers
  • Performance remains strong even with extended input prompts of 100,000 tokens, achieving 539 tokens per second

Competitive landscape: The breakthrough addresses a longstanding challenge in the AI industry where developers had to choose between speed and model sophistication.

  • Previous GPU, ASIC, and cloud solutions for frontier models like GPT-4o and Claude 3.5 Sonnet never exceeded 200 tokens per second
  • Cerebras’s performance is 8x faster than SambaNova and 75x faster than AWS
  • The system is the only non-GPU vendor to complete long-context benchmarks successfully

Real-world implications: The increased processing speed and reduced latency have significant practical applications.

  • Customers switching from GPT-4 report a 75% reduction in total latency
  • The improvements particularly benefit voice and video AI applications requiring real-time interaction
  • The system can generate entire pages of text, code, and mathematical content rapidly

Future trajectory: The integration of open-source models with advanced inference technology suggests a shifting landscape in AI deployment.

  • Llama 3.1 405B’s performance on Cerebras demonstrates the potential of open-source AI models
  • The platform’s success with voice, video, and reasoning applications indicates broader applications for low-latency AI processing
  • Customer trials are currently underway, suggesting imminent real-world implementation and validation

Looking ahead: While these improvements represent significant technical achievements, the true test will be the adoption and integration of this technology in practical applications, particularly in scenarios where real-time processing is crucial for user experience and functionality.

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

Recent News

AI race heats up as Google and OpenAI unveil multiple releases

Fierce competition between tech giants led to the release of over a dozen major AI products in December 2023, compressing months of planned launches into weeks.

AI safety challenges behavioral economics assumptions

Companies face mounting pressure to accelerate AI development cycles while their safety testing protocols remain inconsistent and largely self-regulated.

How AI is transforming the game of baseball

The Rangers merged advanced data analytics with traditional baseball wisdom, using AI to process weather patterns, player matchups, and scouting reports en route to their first championship.