×
Sesame’s CTO reveals how they’re building real-time voice AI that talks like humans
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Andreessen Horowitz’s latest episode of AI + a16z features Sesame’s CTO Ankit Kumar delving into the technical foundations of their voice technology with a16z partner Anjney Midha. This conversation offers a rare glimpse into the engineering complexities behind real-time conversational AI, exploring how voice interfaces might fundamentally change human-computer interaction as the technology continues to evolve from research labs into everyday applications.

The big picture: Sesame’s voice technology represents a significant advancement in AI-powered conversational interfaces, with the company taking the unusual step of open-sourcing key components of their underlying models.

  • Kumar and Midha explore the technical challenges involved in creating voice AI that can maintain natural conversation flow while balancing personality expression with computational efficiency.
  • The discussion highlights how multimodal AI systems must integrate speech recognition, natural language processing, and speech synthesis in real-time to create convincing voice interactions.

Key technical challenges: Developing real-time voice AI requires overcoming several complex engineering hurdles that balance performance with computational constraints.

  • Full-duplex conversation modeling, which allows the AI to both listen and speak simultaneously like humans do, represents a particularly difficult problem that Sesame has addressed in their technology.
  • The team has implemented specific computational optimizations to achieve the low-latency interactions necessary for natural-feeling conversations without requiring excessive processing power.

Why open-sourcing matters: Sesame’s decision to release key components of their model architecture reflects a strategic approach to advancing voice AI technology within the broader ecosystem.

  • Open-sourcing creates opportunities for community contributions while potentially accelerating adoption of their underlying technical approach.
  • The move suggests Sesame believes their competitive advantage lies in implementation and product experience rather than solely in proprietary model architecture.

In plain English: Sesame is building AI that can talk with people naturally in real-time, and they’re sharing some of their technical blueprints with the broader developer community rather than keeping everything proprietary.

Technical deep dives: The conversation explores advanced concepts in speech AI that explain how modern voice interfaces are evolving beyond simple command-response patterns.

  • Kumar breaks down how multimodal AI systems must integrate different types of intelligence – processing audio input, understanding language context, and generating natural-sounding speech – all while maintaining conversation flow.
  • The discussion addresses scaling laws in speech synthesis, examining how larger models affect voice quality and expressiveness compared to more optimized smaller models.

Where voice interfaces are heading: The conversation positions natural language as potentially the most intuitive user interface, capable of redefining how humans interact with technology.

  • Voice AI’s evolution toward more contextual understanding and human-like conversational abilities could make technology more accessible to people regardless of technical literacy.
  • The discussion suggests voice interfaces may eventually become the primary way people interact with digital systems, supplementing or replacing screen-based interfaces in many contexts.
Building the Next Generation of Conversational AI

Recent News

Hacker admits using AI malware to breach Disney employee data

The case reveals how cybercriminals are exploiting AI enthusiasm to deliver sophisticated trojans targeting corporate networks and stealing personal data.

AI-powered social media monitoring expands US government reach

Federal agencies are increasingly adopting AI tools to analyze social media content, raising concerns that surveillance ostensibly targeting immigrants will inevitably capture American citizens' data.

MediaTek’s Q1 results reveal 4 key AI and mobile trends

Growing revenue but shrinking profits for MediaTek highlight the cost of competing in AI and premium mobile chips amid ongoing market volatility.