Andrej Karpathy – Outsource your thinking, but you can’t outsource your understanding

Here’s what Andrej Karpathy just figured out that everyone else is still dancing around: we’re not in an era of “better models.” We’re in a different era of computing altogether.

And the difference between understanding that and not understanding it is the difference between being a vibe coder and being an agentic engineer.

Last October, Karpathy had a realization. AI didn’t stop being ChatGPT-adjacent. It fundamentally shifted. Agentic coherent workflows started to actually work. And he’s spent the last three months living in side projects, VB coding, exploring what’s actually possible.

What he found is a framework that explains everything.

Software 1.0 was explicit rules. You wrote code. If statements and loops. Hard-coded logic.

Software 2.0 was learned weights. You created datasets. You trained neural networks. Programming became arranging data and defining objectives.

Software 3.0 is context. Your prompt and what’s in the context window are your program. The LLM is the interpreter. You’re no longer writing code—you’re writing instructions for an entity to interpret and execute.

This is not a faster version of Software 2.0. It’s a different kind of computing entirely.

The proof is everywhere if you know what to look for. Andrej points to OpenClaw’s installation. Normally you’d write a shell script. But shell scripts get complex. They balloon across platforms. So OpenClaw moved to Software 3.0: copy-paste this text into your agent and it’ll install. The agent reads your environment, debugs in the loop, makes intelligent decisions. It’s more powerful because you don’t need to spell out every detail. The agent has intelligence.

Same with MenuGen. The Software 2.0 version: build an app that OCRs the menu, generates images for each item, stitches them together, deploys on Vercel. The Software 3.0 version: give a photo to Gemini and ask it to use Nanobanana to overlay the pictures. Done. One second. The entire app is no longer necessary.

This sounds like it should be obvious in hindsight. But it’s not. Most teams are still building Software 2.0 solutions and calling it optimization. They’re solving problems with the paradigm they know, not the paradigm that’s available.

The shift requires understanding what these systems actually are. Not magically intelligent entities. Not slight upgrades to previous technology. But something genuinely different—a programmable substrate where your instructions and context are the code.

Once you see it, you can’t unsee it. And a lot of things stop making sense.

The Verifiability Axis Explains Everything

Why Models Peak at Code and Stumble at Common Sense

Andrej spent time writing about something called verifiability. And it’s the key to understanding the jagged landscape of LLM capability.

Here’s the thing: frontier labs train these models with reinforcement learning. They give models verification rewards. Code runs or it doesn’t. Math is correct or it isn’t. So models get really good at things that have clear, unambiguous feedback loops.

But the same labs—they care about code and math because those are economically valuable. Chess got dramatically better from GPT-3.5 to GPT-4 not because the model improved. It got better because someone at OpenAI added a ton of chess data to the training set. The capability spiked because of the data distribution, not because of fundamental advances.

This reveals something crucial: models are shaped by what the labs have chosen to reward. They peak at verifiable domains. Code, math, and adjacent spaces. They stagnate everywhere else.

Which is why state-of-the-art Claude Opus can refactor a 100,000-line codebase and find zero-day vulnerabilities. And simultaneously tell you to walk 50 meters to a car wash instead of driving.

How is this possible? Same entity. Two different universes of capability. One is in the RL circuits. One is massively out of distribution.

What This Means: Your Hiring Is Broken

If you’re hiring for agentic engineers, you can’t use algorithmic puzzles. You’re testing for the wrong thing. You need to give someone a big project—”build a Twitter clone”—and then have agents try to break it. The engineering discipline isn’t algorithmic thinking. It’s specifying what matters, coordinating agents that have no judgment, and staying in the loop.

Also: understand where your domain sits on the verifiability spectrum. If you’re building in code or math, agents will shine. If you’re building in aesthetic or judgment domains, you need to stay in charge.

The Menu Gen Moment

Software 3.0 Isn’t a Faster Software 2.0

Andrej built MenuGen as a full-stack app. Photo uploader, OCR pipeline, image generation, stitching logic, Vercel deployment. It works. It does what it’s supposed to do.

Then he saw the Software 3.0 version: one sentence to an LLM. Upload a photo. Render pictures of the menu items. Done.

And he called his own app “spurious.” Not because it’s broken. But because it shouldn’t exist in a world where Software 3.0 is possible.

This is the hardest mental shift for engineers. You spent years learning to optimize within a paradigm. To do more with less code. To abstract and refactor. That skill doesn’t disappear, but the paradigm changed.

In Software 3.0, the work moves from “how do I build this” to “what’s the minimal set of instructions I give to an LLM to get the output I want.” The LLM does the heavy lifting. Your code is almost invisible.

Andrej talks about neural networks where he’s forgotten all the API details. keep_dims vs keepdim. reshape vs permute. transpose. He doesn’t remember anymore because he doesn’t have to. The agent handles it. He handles the taste, the design, the “this needs to be a unique user ID we tie everything to.” The conceptual architecture. The human stuff.

The agent fills in the blanks with perfect recall.

What This Means: Stop Building What Shouldn’t Exist

Look at your codebase. How much of it is solving problems that an agent could solve if you just described them well enough? The MenuGen test is brutal but useful: if an LLM can do it in one sentence, your multi-file app might be spurious too.

This doesn’t mean “delete everything.” It means rethink. What parts require human judgment? Put those first. What parts are deterministic or well-defined? Let the agent handle them.

Vibe Coding vs Agentic Engineering

Raising the Floor vs Maintaining the Bar

Andrej makes a crucial distinction that most people miss.

Vibe coding is amazing. It raises the floor for everyone. A junior developer, a non-engineer, someone with an idea—they can now make something work. They can prompt an agent and iterate. That’s beautiful.

But agentic engineering is something else entirely. It’s about preserving the quality bar that existed before. You’re not allowed to introduce vulnerabilities because you vibe coded. You’re still responsible for your software. You’re just faster.

Vibe coding is democratic. Agentic engineering is professional.

And the ceiling on agentic engineering is absurdly high. Andrej doesn’t think 10x engineers exist anymore. He thinks they exist and they’re magnified. The people who understand how to coordinate agents, who stay in touch with what’s being built, who maintain taste and judgment—they’re shipping things that shouldn’t be possible.

This is the tension. Vibe coding says “everyone can build.” Agentic engineering says “building fast while maintaining quality is a discipline.”

Most teams are trying to do vibe coding at professional scale. That’s why they have bugs. That’s why their features are incomplete. They haven’t learned to engineer with agents.

What This Means: Your Team Needs a New Discipline

Agentic engineering isn’t just faster coding. It’s a different skill set. You need people who can write exceptionally clear specs (agents follow them literally), understand the taste and judgment parts (that remain human), know when to stay in the loop and when to delegate, can debug agent mistakes intelligently, and understand the jagged landscape (what agents are good at vs what they’re not).

Also: the hiring bar goes up. It’s not about solving puzzles faster. It’s about building bigger things, faster, without sacrificing quality.

Ghosts Not Animals

These Aren’t Creatures. They’re Statistical Simulations

Andrej wrote something that stuck with him: we’re summoning ghosts, not building animals. This matters more than it sounds.

An animal has intrinsic motivation, curiosity, a drive to explore. If you yell at it, it reacts. It has empowerment—a sense of agency.

An LLM has none of this. It’s statistical simulation circuits. Pre-training is statistics. Reinforcement learning is bolted on top. It’s deterministic in a way that looks probabilistic.

Why does this framing matter? Because it changes how you think about using them.

If you think they’re animals, you might try to motivate them. To inspire them. To treat them like interns with feelings. But they’re not. They’re pure statistical processes. Throw them at a domain they were RL’d on, they fly. Throw them at something out of distribution, they struggle.

This connects directly back to the jagged intelligence problem. The jaggedness isn’t because they’re “almost intelligent.” It’s because they’re statistical circuits shaped by what they were trained on. Nothing more. Nothing less.

Understanding this changes what you ask of them. You’re not negotiating with an entity. You’re interfacing with a substrate.

What This Means: Stop Anthropomorphizing

The way you talk to agents should change. Don’t “work with” them like they’re thinking partners. Interface with them. Describe your domain clearly. Be precise about what success looks like. Understand where they’ll succeed and where they’ll fail.

Also: realize that they’ll make mistakes that make no sense to you. Not because they’re being obtuse. Because they’re outside the RL circuits.

Agent-Native Infrastructure Is the Real Frontier

Everything Is Written for Humans. Everything Needs to Be Rewritten

Here’s Andrej’s favorite pet peeve: documentation. Almost all of it is written for humans. “Go to this URL.” “Click this button.” “Configure your DNS.”

But the real frontier isn’t making better agents. It’s making infrastructure that agents can natively understand.

Documentation needs to be rewritten as “here’s what to copy-paste to your agent.” Systems need to expose APIs that agents can call directly. Deployment pipelines need to be agent-legible.

Andrej built MenuGen and the actual work wasn’t writing code. It was fighting with Vercel, configuring DNS, logging into settings panels. All designed for humans. All painful for agents.

Imagine instead: tell an LLM “build MenuGen and deploy it.” The agent navigates a world where everything speaks its language. Sensors and actuators everywhere. Data structures optimized for LLM understanding, not human reading.

That’s the frontier.

The companies that win won’t be the ones with the best agents. They’ll be the ones who’ve redesigned their infrastructure to be agent-native. Verb-friendly APIs. Clear, legible data structures. Deployment automated for intelligent systems, not manual configuration.

What This Means: Start Redesigning for Agents

When you build the next feature, ask: can an agent understand this? Can an agent navigate it? Or have I just built another human interface that now requires an LLM to translate?

Also: watch for companies that are rebuilding infrastructure to be agent-native. That’s where the productivity gains actually live.

What This Means For You

One: You’re either in the Software 2.0 paradigm or you’ve moved to Software 3.0. Spot which world your code lives in. A lot of it probably shouldn’t exist.

Two: Understand verifiability. If you’re in a domain where outputs are easy to verify, agents will get really good fast. If you’re not, you need humans in the loop. Plan accordingly.

Three: Hire for agentic engineering, not algorithmic prowess. The skill is coordination and taste, not puzzle-solving speed. Test by building big things, not by solving small problems fast.

Four: Your infrastructure is written for humans. It’ll cost you. Start thinking about what agent-native looks like for your systems.

Five: Understand that taste and judgment still matter. Maybe more now than before. That’s the scarcest skill.

Three Questions We Think You Should Be Asking Yourself

What parts of your codebase are spurious—solving problems that shouldn’t need to exist in Software 3.0? (Probably most of it. That’s not a failure. That’s the frontier.)

Where does your domain sit on the verifiability spectrum? (High: code, math. Low: taste, judgment, aesthetics. Plan your agent strategy around this.)

Is your infrastructure written for humans or for agents? (It’s probably humans. That’s why you’re still manually configuring things.)

“You can outsource your thinking but you can’t outsource your understanding. Something has to direct the thinking and processing, and that’s still fundamentally constrained by understanding.”— Andrej Karpathy