×
How multimodal AI models are unlocking opportunities for vertical applications
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Multimodal AI: Expanding Vertical AI’s Impact: The emergence of multimodal models capable of processing audio, video, voice, and vision data is creating new opportunities for vertical AI applications to transform a wider range of industries and workflows.

Key advancements in multimodal architecture:

  • Recent models have demonstrated improved context understanding, reduced hallucinations, and enhanced reasoning capabilities.
  • Performance in speech recognition, image processing, and voice generation is approaching or surpassing human capabilities in some cases.
  • New speech-native models, like OpenAI’s Realtime API and Kyutai’s Moshi, are replacing cascading architecture with lower latency and better context capture.

Voice capabilities and use cases:

  • Transcription applications are freeing up time for professionals in various fields:

    • Abridge’s medical transcription tool generates notes and identifies follow-ups from clinical conversations.
    • Rillavoice records and transcribes conversations for sales training in the home services industry.
  • End-to-end voice agents are showing promise in multiple areas:

    • Inbound sales: Fielding customer calls after hours and booking appointments.
    • Customer support: Providing more effective responses than traditional IVR systems.
    • Outbound calls: Automating initial contact for sales and recruiting teams.

Vision capabilities and applications:

  • Models like GPT-4V and Gemini 1.5 Pro can interpret images, respond to questions, and process raw images and video.
  • Key use cases include:
    • Data extraction from unstructured documents (e.g., Raft’s platform for freight forwarding).
    • Visual inspection augmentation (e.g., xBuild’s AI construction platform).
    • 2D and 3D design generation (e.g., Snaptrude’s 3D building design tool).
    • Video analytics for safety monitoring and object tracking.

The rise of AI agents:

  • Progress in constraining tasks for AI agents has led to reduced errors in multi-step reasoning.
  • Reasoning-focused foundation models like OpenAI’s o1 are showing promise in complex problem-solving.
  • Current applications include:
    • Sales and marketing: Researching prospects and crafting personalized outreach.
    • Negotiations: Automating legal and commercial term negotiations.
    • Investigations: Assisting with initial phases of cybersecurity alert investigations.

Broader implications for vertical AI:

  • Multimodal capabilities are expanding the potential impact of vertical AI across industries.
  • As underlying models become commoditized, it will be more sustainable for companies to build applications on top of powerful foundation models.
  • The integration of these new capabilities is expected to fundamentally change how we work and interact with the world.

Looking ahead: The next wave of vertical AI applications will likely focus on addressing complex workflows autonomously, leveraging advancements in reasoning-based models. As the technology continues to evolve, we can expect to see novel business models emerge, including copilots, agents, and AI-enabled services, opening up new opportunities in previously untapped industries.

Part II: Multimodal capabilities unlock new opportunities in Vertical AI

Recent News

Nvidia’s new AI agents can search and summarize huge quantities of visual data

NVIDIA's new AI Blueprint combines computer vision and generative AI to enable efficient analysis of video and image content, with potential applications across industries and smart city initiatives.

How Boulder schools balance AI innovation with student data protection

Colorado school districts embrace AI in classrooms, focusing on ethical use and data privacy while preparing students for a tech-driven future.

Microsoft Copilot Vision nears launch — here’s what we know right now

Microsoft's new AI feature can analyze on-screen content, offering contextual assistance without the need for additional searches or explanations.