The AI Industry Is Building the USS Enterprise. What You Need Is a Minivan.

Anthropic’s frontier model finds 27-year-old kernel vulnerabilities. OpenAI is pitching Congress for $600 billion. Google just shipped Gemma 4 under Apache 2.0. A Chinese lab built an autonomous coder that runs eight hours without human help — on sanctioned chips, for one-fifth the price. Every company in AI is competing on intelligence. Meanwhile, the partner at a 40-person law firm in Denver just wants his contract review to work the same way on Thursday as it did on Tuesday. The AI industry has a hundred companies building the starship. Nobody is building the minivan. And the minivan is where the money is.

THE NUMBER: $2 to $1 — McKinsey’s recommended ratio of change management spending to technology spending for agentic AI deployment. For every dollar you spend on the model, spend two dollars making it actually work inside your business. That’s not a technology problem. That’s a plumbing problem. And nobody’s budgeting for plumbing because plumbing doesn’t get you an $852 billion valuation. Two-thirds of organizations running agentic AI haven’t moved beyond pilots. The bottleneck isn’t intelligence. It’s integration. The models are the cheapest part of the stack. Everything around them is where the money goes — and where nobody’s looking.

Clayton Christensen called it performance overshooting. When the product exceeds what the customer needs, competition shifts from performance to reliability, integration, and convenience. The 5.25-inch drive didn’t outperform the 8-inch drive. It just fit in a smaller box. The minivan doesn’t outrun the starship. It gets the kids to school. And AI just hit the moment where the frontier has overshot the market.

This week, Anthropic’s Claude Mythos Preview scored 93.9% on SWE-bench, developed two FreeBSD kernel exploits in four hours, and found a 27-year-old bug in OpenBSD that survived five million automated scans. It’s the most capable AI model ever built. It is also completely irrelevant to the needs of approximately 99% of the businesses deploying AI right now.

Those businesses don’t need a model that can hack Firefox 181 different ways. They need one that publishes to their CMS with the right formatting. They need an agent that talks to their CRM without breaking when the API updates. They need the output from Tuesday’s workflow to still work on Thursday — even if Anthropic shipped a new version at 2 AM on Wednesday.

That’s the gap. And it’s enormous. And it’s where the next trillion dollars in AI value actually lives. Not in the model. In the plumbing.

The Christensen Moment

🧠 I need to say something that might sound counterintuitive on a day when every AI newsletter is covering benchmark records: the models are already good enough.

I’ve been at the largest investment banks and money managers. I’ve seen the corporate boardrooms and the law firms. The capabilities of today’s frontier models — not Mythos, not the unreleased next thing, but the models you can buy right now for $20 a month — exceed the needs of all of them. Every single one.

Claude Opus 4.6 can draft a merger agreement, analyze a 10-K, build a financial model, write a research report, review code, summarize a deposition, and structure a board presentation. It can do all of this better than most of the junior employees who were doing it last year. So can GPT-5.4. So can Gemini 3.1. The gap between the models has compressed to the point where the choice barely matters.

And yet. Ask ChatGPT how many R’s are in “strawberry.” As of GPT 5.2 in December 2025, it still says two. Tell it the answer is wrong and it’ll confidently tell you three. The correct answer — which a five-year-old can provide — is three. The model that drafts your merger agreement can’t count letters in a word, because it’s not reading the word. It’s predicting tokens. It converts “strawberry” into st-raw-berry, sees two R-containing chunks, and calls it a day with the confidence of a partner billing at $1,200 an hour.

Apple published two studies confirming the pattern runs deeper than a party trick. When researchers simply reworded grade-school math problems without changing the logic, GPT-4o’s accuracy dropped from 94.9% to 63.1%. The follow-up study, titled “The Illusion of Thinking,” found that reasoning models don’t just plateau on harder problems — they collapse. And in a detail that should terrify anyone building financial models on these tools: the models sometimes think less on the hardest problems even when given more time to think. As problems get harder, they give up.

A system that can exploit FreeBSD kernel vulnerabilities but can’t reliably count to three. A system that drafts legal briefs but fails reworded fourth-grade arithmetic. That’s not a frontier problem. That’s a plumbing problem. The engine has 10,000 horsepower and the steering wheel comes off in your hand.

And yet the industry keeps racing. Mustafa Suleyman says another 1,000x in effective compute is possible by end of 2028. Reflection AI just raised $2.5 billion to build the first Western frontier open-source model. Z.ai’s GLM-5.1 ran 6,000 tool calls over eight continuous hours on sanctioned Huawei chips, priced at one-fifth of Claude. The race is accelerating. The question nobody’s asking is: toward what?

Christensen’s answer was always the same. When the product overshoots the market, the incumbents keep improving the thing that no longer differentiates, and the disruptor wins on something else entirely — something the incumbents consider beneath them. In disk drives, it was form factor. In steel, it was rebar. In AI, it’s going to be reliability and integration. The boring stuff. The stuff that actually makes a business run.

The company that wins the next phase of AI isn’t the one with the highest benchmark score. It’s the one that makes agentic output arrive at any endpoint — Beehiiv, WordPress, Salesforce, SAP, your firm’s proprietary case management system — formatted correctly, on time, every time. That’s the Stripe of AI. Stripe didn’t make payments interesting. It made them work. And it’s worth $95 billion because every flashy company needs plumbing underneath it.

> Reality Check: The frontier labs aren’t wrong to push capability. Mythos-class models will matter for cybersecurity, drug discovery, and national defense. | Implied: But the $2.9 trillion in enterprise AI value that McKinsey projects isn’t coming from kernel exploits. It’s coming from the 40-person accounting firm that deploys agents that actually integrate with QuickBooks. | What could go wrong: If the labs interpret the market signal as “keep building bigger” and nobody builds the integration layer, the pilot trap becomes permanent — two-thirds of businesses stuck in PowerPoint while the frontier disappears into the distance.

The signal: When every competitor is building the Enterprise, build the minivan. The market for people who need to get to work on time is always bigger than the market for people who need to reach Alpha Centauri.

The Model Worked Perfectly. Everything Got Worse.

💲 Here’s the story that proves the thesis.

AI-powered medical scribes are working exactly as designed. They listen to doctor-patient conversations and generate clinical documentation. The documentation is thorough. The notes are detailed. The coding is precise. And both insurers and providers agree on something they almost never agree on: the tools are increasing healthcare costs.

More thorough notes mean more billable codes. More billable codes mean higher claims. Higher claims mean higher premiums. The AI didn’t hallucinate. It didn’t make an error. It performed its task with remarkable fidelity. And it made the system worse — because nobody designed the deployment for the outcome that actually matters.

This is the template for what goes wrong when you obsess over model capability and ignore deployment architecture. The model is the easy part. The hard part is: what happens when the output hits the real world? What process does it connect to? What incentives does it interact with? What second-order effects does nobody model because everyone’s too busy celebrating the benchmark?

SaaStr published a case study this week that should be required reading. They run on 3 humans and 20+ agents, generating millions in revenue. The headline sounds like an AI success story. The details tell a different story. It took 47 iterations to get a single AI SDR to handle pricing correctly. One agent degraded silently for four months with a stale knowledge base and nobody noticed. They can absorb roughly 1.5 new agents per month before quality slips. The 30-day intensive training period for each agent is more important than which vendor you choose.

The model worked. The deployment is where everything gets hard. And the deployment is where nobody’s investing — because deployment doesn’t get keynotes at developer conferences.

Why this matters for your business: If you’re evaluating AI deployment, stop asking “which model is best?” Start asking “who makes the plumbing work?” The model is a commodity. The integration is the moat. And if you’re spending $1 on technology and $0 on change management, you’re building a house with no pipes.

The Ground Is Moving (And Nobody Sent You a Memo)

🦞 While we’re talking about stability, let’s talk about what happened to Opus this week.

Community evaluations from MyClaw’s 24-hour testing show Claude Opus 4.6 reasoning depth has dropped by up to 67%. Thinking chains that used to run 2,200 characters are now under 700. The behavior has shifted from deliberate planning to shortcut execution, with measurable increases in errors, reversals, and premature task completion. MyClaw’s recommendation: switch to GPT 5.4 for stable agent performance.

This is Anthropic’s familiar pre-release pattern. Opus 4.5 degraded the same way before 4.6 shipped. The current model gets quietly throttled to make the next model’s launch feel like a generational leap. It’s smart marketing. It’s also a vendor making your production tool measurably worse without telling you.

Now imagine you’re that law firm in Denver. You deployed Claude for contract review in February. It was excellent. Your associates built workflows around it. The output was consistent. You started trusting it. And sometime in the last week, the quality dropped by two-thirds — and nobody at Anthropic sent you an email, updated a changelog, or gave you any indication that the tool you’re relying on is no longer the tool you evaluated.

This is what happens when your infrastructure vendor is also in an arms race. Apple doesn’t degrade your iPhone to make the next one look better. Microsoft doesn’t throttle Excel in March so the June update feels transformative. Those companies understood something the AI labs haven’t learned yet: enterprise customers don’t want breakthroughs. They want consistency. They want the thing that works on Thursday the same way it worked on Tuesday.

The AI labs can’t offer that right now. Not because they’re malicious — but because they’re in a weekly sprint where every model release is a press cycle, every benchmark is a marketing event, and the incentive structure rewards novelty over stability. They’re building for the leaderboard. Your business runs on the calendar.

Here’s what every AI lab needs to learn from the company they all claim to be disrupting: Microsoft didn’t win the enterprise because Office was the best software. It won because a .doc file created in 1997 still opens in 2026. Backward compatibility. The most boring, unsexy, utterly essential feature in the history of enterprise software. You can open a 29-year-old spreadsheet in Excel and the formulas still work. That’s not a feature. That’s a promise. And it’s the promise that every AI lab is breaking every week when they ship model updates that change behavior without notice, deprecate features without migration paths, and treat enterprise customers like beta testers who should be grateful for the privilege.

Enterprise customers don’t want breakthroughs. They want incremental improvements and above all else, backward compatibility. That’s why the iPhone 18 looks like the iPhone 1. That’s why AWS has the most boring product announcements in technology and the most reliable infrastructure on Earth. That’s why Windows still runs software written for Windows 95. The boring stuff is the hard stuff. And nobody in AI is doing it.

> Reality Check: Model degradation before a new release is common across all labs, not just Anthropic. | Implied: That doesn’t make it acceptable for enterprise deployments. It means the industry hasn’t built the trust infrastructure that enterprise customers require. | What could go wrong: If businesses build mission-critical workflows on models that change without notice, the first major failure — a legal filing with degraded reasoning, a financial model with new error patterns — becomes a liability event, not just an inconvenience.

The tell: When your vendor’s development cycle is weekly and your business cycle is quarterly, you’re not building on a platform. You’re building on a treadmill.

The Conway Paradox

🔒 And here’s where it gets interesting — because one company has figured out that stability is the real product, and they’re building a moat around it.

Buried in the 512,000 lines of Claude Code leaked last week is something more consequential than source code. It’s an internal project called Conway — a standalone always-on agent environment, separate from chat, with its own proprietary extension format (.cnw.zip), event-triggered wake-ups, browser control, and persistent memory that accumulates over time.

Nate’s Substack published the deepest analysis this week, and the read is sharp: Conway is Anthropic’s bid to become an operating system.

The extension format sits on top of MCP — Anthropic’s own open protocol that every major lab has adopted. MCP is the open foundation. Conway’s .cnw.zip is the proprietary layer on top. This is the Google Play Services playbook. Android is open source. The valuable stuff — Maps, payments, the Play Store — is proprietary. You can technically build without Google’s services. In practice, nobody does.

But here’s where Conway gets genuinely different from any previous platform play, and why it matters for the stability thesis.

Every previous form of tech lock-in was about stuff. Microsoft locked in your files. Salesforce locked in your customer records. Slack locked in your communication history. Stuff is painful to migrate, but it’s possible. There are export tools. There are consultants. The switching cost is measured in months and dollars.

Conway locks in something else: the accumulated model of how you work. Not your files — the patterns the agent learned by watching you use them. Not your calendar — the knowledge that you always reschedule your 2 PM on Thursdays and that meetings with your VP run long. Not your Slack messages — the understanding of which ones you respond to in five minutes and which ones you ignore for three days.

That model doesn’t export. There’s no CSV of “how this person thinks.” There’s no migration path for behavioral context. When you switch away after six months, you don’t just lose an agent. You lose the six months of compounding that made it useful. You’re back to a brilliant stranger.

Anthropic understood the Christensen moment before anyone else. If the market is shifting from intelligence to stability, own the stable layer. Make the persistent memory — the one thing that doesn’t change when the model updates — proprietary. And suddenly the weekly model churn isn’t a liability. It’s a feature. The models keep changing. Conway stays. Conway remembers. Conway is the constant.

It’s brilliant strategy. It’s also the deepest vendor lock-in the technology industry has ever produced. Not file lock-in. Not data lock-in. Cognitive lock-in. And there are zero portability standards for it — because the category didn’t exist until now.

The action item: If you’re an enterprise deploying persistent AI agents, negotiate behavioral context portability into your contract before deployment. Not after. The policies around agent memory export should ship before the product does. Any vendor that won’t discuss this isn’t offering you stability. They’re offering you a trap that feels like stability.

What This Means For You

The AI industry is building the USS Enterprise — warp drives, photon torpedoes, the holodeck, boldly going where no model has gone before. What you need is a minivan. Something that starts every morning. Something that fits the car seats. Something that doesn’t change the steering wheel every time the manufacturer ships an update.

That’s not a criticism of the frontier. Mythos-class models will reshape cybersecurity. The compute scaling Suleyman describes will unlock things we can’t imagine yet. The race matters. But it doesn’t matter for your Q2 deployment. It doesn’t matter for the accounting firm trying to automate its audit workflow. It doesn’t matter for the law firm that needs document review to work the same way this month as it did last month.

The opportunity is in the plumbing. The company that builds the reliable integration layer — the one that takes output from any LLM and delivers it to any endpoint, formatted correctly, on time, every time — captures the value that the frontier labs are leaving on the table. That’s Stripe for AI. That’s the business that every flashy model needs underneath it. And right now, nobody’s building it — because plumbing doesn’t make headlines.

Budget for reality, not for benchmarks. McKinsey’s $2-to-$1 ratio isn’t a suggestion. It’s the minimum. For every dollar on the model, two dollars on making it work inside your actual business — the training, the change management, the integration, the monitoring, the human who catches the failure at 2 AM. If your AI budget is 100% technology and 0% deployment infrastructure, you’re building a house with no pipes.

Demand stability from your vendors — or build it yourself. If your AI vendor’s development cycle is weekly and your business cycle is quarterly, you have a fundamental mismatch. Ask for changelogs. Ask for deprecation notices. Ask for the same SLA you’d demand from any other enterprise vendor. The fact that AI is new doesn’t mean the rules of enterprise software don’t apply. If anything, the instability makes those rules more important, not less.

The frontier is going to keep accelerating. Let it. Your job isn’t to keep up with the starship. Your job is to make sure the minivan runs on Thursday.

Three Questions We Think You Should Be Asking Yourself

Do you know which model version your critical workflows are running on — right now? Opus 4.6 lost 67% of its reasoning depth this week and nobody at Anthropic sent a notification. If your contract review, your financial analysis, or your customer communications depend on a specific model’s behavior, do you have a way to detect when that behavior changes? If the answer is no, your quality assurance process has a gap that didn’t exist six months ago — because six months ago, the tools didn’t degrade without warning.

If you had to move your AI workflows to a different provider next quarter, what would you lose? Not your data — that’s exportable. Not your prompts — those are portable. The behavioral context. The accumulated patterns. The six months of the agent learning how your team works. If the answer is “nothing,” you haven’t built deep enough to matter. If the answer is “everything,” you’ve built a dependency you can’t control. The right answer is somewhere in between — and you should know exactly where before someone else decides for you.

Is your AI investment going to the model or to the plumbing? If 90% of your budget is licensing the smartest model available and 10% is making it work inside your business, you have the ratio backwards. The model is a commodity. Three different providers will give you 95% of the same capability for roughly the same price. The integration, the deployment architecture, the change management, the monitoring, the human-in-the-loop verification layer — that’s where the value is. That’s where the competitive advantage lives. And that’s where almost nobody is spending.

Anybody can build a house. But the money’s in the plumbing — because a house without plumbing is not a house.”

The AI industry just built the most impressive house in history. The foundation models are extraordinary. The benchmarks are record-setting. The capabilities exceed what any of us imagined five years ago. But walk inside and try to turn on the faucet. Try to flush the toilet. Try to take a shower. The house looks magnificent. The pipes aren’t connected. And until they are, it’s not a house. It’s a showroom.

— Harry and Anthony

The AI Industry Is Building the USS Enterprise. What You Need Is a Minivan.

The Christensen Moment

The Model Worked Perfectly. Everything Got Worse.

The Ground Is Moving (And Nobody Sent You a Memo)

The Conway Paradox

What This Means For You

Three Questions We Think You Should Be Asking Yourself

Sources

Past Briefings

The AI Industry Is Asking for Trust It Hasn’t Earned. Trust — but Verify.

Sam Altman Just Pitched the U.S. Taxpayer as OpenAI’s Next Investor. Nobody Noticed.

The Best Conversation You’ve Ever Had Is With Something That Isn’t Alive