A great demo is just the starting point—getting AI agents to perform reliably in production is the real challenge. In his AI Dev 25 talk, Aman Kahn, Director of Product at Arize, shared how his team moved beyond simple accuracy checks to build more robust evaluation frameworks for generative AI systems. Drawing from real-world experience, he outlined how to: – Use LLMs as judges for nuanced evaluation – Build automated pipelines to catch issues early – Establish feedback loops and workflows that support rapid iteration without compromising quality Whether you’re just getting started with agents or scaling them in production, this session offers practical techniques for evaluating and improving agent performance.