AI Dev 25 | Aman Khan: Beyond Vibe Checks—Rethinking How We Evaluate AI Agent Performance

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

A great demo is just the starting point—getting AI agents to perform reliably in production is the real challenge. In his AI Dev 25 talk, Aman Kahn, Director of Product at Arize, shared how his team moved beyond simple accuracy checks to build more robust evaluation frameworks for generative AI systems. Drawing from real-world experience, he outlined how to: – Use LLMs as judges for nuanced evaluation – Build automated pipelines to catch issues early – Establish feedback loops and workflows that support rapid iteration without compromising quality Whether you’re just getting started with agents or scaling them in production, this session offers practical techniques for evaluating and improving agent performance.

Menu

AI Dev 25 | Aman Khan: Beyond Vibe Checks—Rethinking How We Evaluate AI Agent Performance

Recent Videos

AI NEWS: OpenAI Economic Impact, Google’s Robots and Apollo’s Strange Scheming AI’s

AI News: ChatGPT Pulse, Gemini Robotics, Qwen3-Max, Stargate, OpenAI and Nvidia, and more!

REVEALED: The 100x Faster AI Brain Behind China’s New AI Breakthrough

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

AI Dev 25 | Aman Khan: Beyond Vibe Checks—Rethinking How We Evaluate AI Agent Performance

Recent Videos

AI NEWS: OpenAI Economic Impact, Google’s Robots and Apollo’s Strange Scheming AI’s

AI News: ChatGPT Pulse, Gemini Robotics, Qwen3-Max, Stargate, OpenAI and Nvidia, and more!

REVEALED: The 100x Faster AI Brain Behind China’s New AI Breakthrough

Join the revolution

CO/AI

Resources

Join the revolution