Sesame's CTO reveals how they're building real-time voice AI that talks like humans

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Andreessen Horowitz’s latest episode of AI + a16z features Sesame’s CTO Ankit Kumar delving into the technical foundations of their voice technology with a16z partner Anjney Midha. This conversation offers a rare glimpse into the engineering complexities behind real-time conversational AI, exploring how voice interfaces might fundamentally change human-computer interaction as the technology continues to evolve from research labs into everyday applications.

The big picture: Sesame’s voice technology represents a significant advancement in AI-powered conversational interfaces, with the company taking the unusual step of open-sourcing key components of their underlying models.

Kumar and Midha explore the technical challenges involved in creating voice AI that can maintain natural conversation flow while balancing personality expression with computational efficiency.
The discussion highlights how multimodal AI systems must integrate speech recognition, natural language processing, and speech synthesis in real-time to create convincing voice interactions.

Key technical challenges: Developing real-time voice AI requires overcoming several complex engineering hurdles that balance performance with computational constraints.

Full-duplex conversation modeling, which allows the AI to both listen and speak simultaneously like humans do, represents a particularly difficult problem that Sesame has addressed in their technology.
The team has implemented specific computational optimizations to achieve the low-latency interactions necessary for natural-feeling conversations without requiring excessive processing power.

Why open-sourcing matters: Sesame’s decision to release key components of their model architecture reflects a strategic approach to advancing voice AI technology within the broader ecosystem.

Open-sourcing creates opportunities for community contributions while potentially accelerating adoption of their underlying technical approach.
The move suggests Sesame believes their competitive advantage lies in implementation and product experience rather than solely in proprietary model architecture.

In plain English: Sesame is building AI that can talk with people naturally in real-time, and they’re sharing some of their technical blueprints with the broader developer community rather than keeping everything proprietary.

Technical deep dives: The conversation explores advanced concepts in speech AI that explain how modern voice interfaces are evolving beyond simple command-response patterns.

Kumar breaks down how multimodal AI systems must integrate different types of intelligence – processing audio input, understanding language context, and generating natural-sounding speech – all while maintaining conversation flow.
The discussion addresses scaling laws in speech synthesis, examining how larger models affect voice quality and expressiveness compared to more optimized smaller models.

Where voice interfaces are heading: The conversation positions natural language as potentially the most intuitive user interface, capable of redefining how humans interact with technology.

Voice AI’s evolution toward more contextual understanding and human-like conversational abilities could make technology more accessible to people regardless of technical literacy.
The discussion suggests voice interfaces may eventually become the primary way people interact with digital systems, supplementing or replacing screen-based interfaces in many contexts.

Building the Next Generation of Conversational AI

Andreessen Horowitz

Menu

Sesame’s CTO reveals how they’re building real-time voice AI that talks like humans

Recent News

ByteDance releases Seed-OSS-36B with 512K token context window

Intel’s new feature boosts AI performance by allocating more RAM to integrated graphics

Insta360’s $150 AI webcam uses gimbal tech to fix video calls

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Sesame’s CTO reveals how they’re building real-time voice AI that talks like humans

Recent News

ByteDance releases Seed-OSS-36B with 512K token context window

Intel’s new feature boosts AI performance by allocating more RAM to integrated graphics

Insta360’s $150 AI webcam uses gimbal tech to fix video calls

Join the revolution

CO/AI

Resources

Join the revolution