Grok 3 Unveiled: xAI’s leap in AI innovation
Grok 3 Shatters AI Benchmarks: xAI's Latest Model Sets New Industry Standards with Unprecedented 1400+ Arena Score
xAI, led by Elon Musk, has launched Grok 3, claiming it to be the world’s most advanced AI model. Following its live demo, the model has set new AI performance benchmarks—most notably becoming the first to exceed a score of 1400 on Chatbot Arena. Grok 3 outperforms established competitors like OpenAI’s GPT-4o and Google’s Gemini 2 Pro in reasoning, coding, and problem-solving capabilities. The tech community has praised this achievement, particularly noting xAI’s swift progress as a newcomer in the AI field. Within xAI, the model’s development has generated significant excitement, with both team members and industry observers anticipating its future applications and potential to advance AI technology further.
Key Insights from the Grok 3 Video
According to Elon, Grok 3 is an order of magnitude more capable than Grok 2.

The capacity was doubled in 92 days!
Total GPUs: 200K
All of this compute was used to improve Grok — which has lead to Grok 3.

Grok 3’s training was ten times more extensive than Grok 2’s. While its initial pretraining phase concluded in early January, the model continues to undergo training.

Here are the benchmark numbers:
Grok 3 significantly outperforms other models in its category such as Gemini 2 Pro and GPT-4o. Even Grok-3 mini shows to be competitive.

Results of early Grok 3 in the Chatbot Arena (LMSYS)
It reached an Elo score of 1400 which no other model has achieved.
The model score keeps improving.

Grok 3 also has reasoning capabilities too!
The Grok team has been testing these capabilities which they have unlocked using RL.
The model is good, especially in coding.

Grok 3 Reasoning performance:
The results correspond to the beta version of Grok-3 Reasoning.
It outperforms o1 and DeepSeek-R1 when given more test-time compute (allowing it to think longer).
The Grok 3 mini reasoning model is also very capable.

More on DeepSearch:
- the model can think deeply about user intent
- what facts to consider
- how many websites to browse
- it can cross-validate different sources

DeepSearch also exposes the steps that it takes to conduct the search itself.

What others are saying on X:
An early version of Grok-3 (codename “chocolate”) has claimed the #1 spot in Arena!
Grok-3 has achieved two major milestones:
- First model ever to break the 1400 score barrier
- #1 ranking across all categories—an increasingly challenging feat
I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check.
— Andrej Karpathy (@karpathy) February 18, 2025
Thinking
✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan… pic.twitter.com/qIrUAN1IfD
BREAKING: @xAI early version of Grok-3 (codename “chocolate”) is now #1 in Arena! 🏆
— lmarena.ai (formerly lmsys.org) (@lmarena_ai) February 18, 2025
Grok-3 is:
– First-ever model to break 1400 score!
– #1 across all categories, a milestone that keeps getting harder to achieve
Huge congratulations to @xAI on this milestone! View thread 🧵… https://t.co/p8z8lccNd5 pic.twitter.com/hShGy8ZN1o
Recent Blog Posts
Anthropic Shipped Claude Channels. Your AI Agent Can Now Text You Back.
Until very recently, every interaction with an AI agent had the same shape. You sit down. You open the tool. You give it a task. You wait. You check. You iterate. Every cycle requires your presence. Walk away and the session stalls, the output piles up unseen, or a permission prompt freezes everything until you come back. That constraint just changed. On March 20, 2026, Anthropic shipped a feature called Claude Code Channels. It lets Claude's agentic tool communicate with you through Telegram, Discord, and iMessage. You send a task from your phone. Claude does the work on your computer....
Apr 13, 2026What Did You Do Today?
There's a saying in Jackson Hole. You hear it at the coffee shop on the square, on the chairlift at the Village, in the bars after a day on the mountain. It goes like this: It's not what you do. It's what you did today. I've been thinking about that line all weekend. Because Sam Lessin dropped a piece arguing that AI isn't just a labor crisis — it's a meaning crisis. And Goldman Sachs just published 40 years of data proving that when technology displaces workers, the damage doesn't heal. It scars. Ten percent slower earnings growth for the...
Apr 3, 2026Claw-code Broke GitHub’s Star Record in 24 Hours. Two Engineers Did It on an Airplane. Here’s What That Means for Your Business.
Here's the number: 100,000. That's how many GitHub stars a repository called claw-code collected in roughly 24 hours. Not a year. Not a month. One day. By the time a live stream was done discussing it, the counter was climbing by a thousand stars every ten minutes. Nobody in the room could remember seeing anything grow that fast. Because nothing had. I watched it happen in real time. I'd met the two engineers behind it the weekend before at an AI hackathon in San Francisco. Within 72 hours of shaking hands, they'd built the fastest-growing repo in GitHub history —...