back

Grok 3 Unveiled: xAI’s leap in AI innovation

Grok 3 Shatters AI Benchmarks: xAI's Latest Model Sets New Industry Standards with Unprecedented 1400+ Arena Score

Get SIGNAL/NOISE in your inbox daily

xAI, led by Elon Musk, has launched Grok 3, claiming it to be the world’s most advanced AI model. Following its live demo, the model has set new AI performance benchmarks—most notably becoming the first to exceed a score of 1400 on Chatbot Arena. Grok 3 outperforms established competitors like OpenAI’s GPT-4o and Google’s Gemini 2 Pro in reasoning, coding, and problem-solving capabilities. The tech community has praised this achievement, particularly noting xAI’s swift progress as a newcomer in the AI field. Within xAI, the model’s development has generated significant excitement, with both team members and industry observers anticipating its future applications and potential to advance AI technology further.

Key Insights from the Grok 3 Video

According to Elon, Grok 3 is an order of magnitude more capable than Grok 2.

The capacity was doubled in 92 days!

Total GPUs: 200K

All of this compute was used to improve Grok — which has lead to Grok 3.

Grok 3’s training was ten times more extensive than Grok 2’s. While its initial pretraining phase concluded in early January, the model continues to undergo training.

Here are the benchmark numbers:

Grok 3 significantly outperforms other models in its category such as Gemini 2 Pro and GPT-4o. Even Grok-3 mini shows to be competitive.

Results of early Grok 3 in the Chatbot Arena (LMSYS)

It reached an Elo score of 1400 which no other model has achieved.

The model score keeps improving.

Grok 3 also has reasoning capabilities too!

The Grok team has been testing these capabilities which they have unlocked using RL.

The model is good, especially in coding.

Grok 3 Reasoning performance:

The results correspond to the beta version of Grok-3 Reasoning.

It outperforms o1 and DeepSeek-R1 when given more test-time compute (allowing it to think longer).

The Grok 3 mini reasoning model is also very capable.

More on DeepSearch:

  • the model can think deeply about user intent
  • what facts to consider
  • how many websites to browse
  • it can cross-validate different sources

DeepSearch also exposes the steps that it takes to conduct the search itself.

What others are saying on X:

An early version of Grok-3 (codename “chocolate”) has claimed the #1 spot in Arena!

Grok-3 has achieved two major milestones:

  • First model ever to break the 1400 score barrier
  • #1 ranking across all categories—an increasingly challenging feat

Recent Blog Posts

Mar 15, 2026

Elon Musk Doesn’t Run Six Companies. He Runs One Router.

On Wednesday morning, Andrej Karpathy — the man who taught a generation of engineers to build neural networks — told everyone to stop writing code. Manage the agents that write it, he said. The guy who wrote the playbook just rewrote it. We covered the implications in our Signal/Noise briefing: who builds the Cisco for agents, what happens when agents get wallets, why the Fastenal vending machine is the best metaphor for the AI economy. But the more we pulled on the thread, the more a different question emerged. Not about code. Not about models. About organizations. Specifically: what does a...

Mar 3, 2026

Stop Boarding Up the Windows. The Tsunami Is Coming.

There's a popular narrative about AI and jobs right now. It goes something like this: AI is coming for your job. Companies are laying people off. The robots are winning. It's not wrong, exactly. But it's dangerously incomplete — like watching a hurricane through your living room window and thinking the problem is the wind. When a hurricane hits, the first thing you notice is the wind. Trees bending, debris flying, power lines snapping. It's dramatic and visible and it's what every camera crew points at. Then comes the rain — relentless, overwhelming, the kind that makes you question every...

Feb 24, 2026

The command line didn’t die. It was waiting. 

There's a moment every programmer remembers. Not when they learned to code — that's a different memory, usually involving a textbook and a lot of frustration. I mean the moment when the terminal stopped feeling like a place you visited and started feeling like a place you lived. For me, that moment happened twice. Once in my early twenties, bent over a keyboard writing Bash scripts, watching the Unix command line respond to me like a conversation. And then again, exactly one year ago, when I typed my first prompt into Claude Code and felt that same electricity — something on...