Multimodal AI: Expanding Vertical AI’s Impact: The emergence of multimodal models capable of processing audio, video, voice, and vision data is creating new opportunities for vertical AI applications to transform a wider range of industries and workflows.
Key advancements in multimodal architecture:
- Recent models have demonstrated improved context understanding, reduced hallucinations, and enhanced reasoning capabilities.
- Performance in speech recognition, image processing, and voice generation is approaching or surpassing human capabilities in some cases.
- New speech-native models, like OpenAI’s Realtime API and Kyutai’s Moshi, are replacing cascading architecture with lower latency and better context capture.
Voice capabilities and use cases:
Vision capabilities and applications:
- Models like GPT-4V and Gemini 1.5 Pro can interpret images, respond to questions, and process raw images and video.
- Key use cases include:
- Data extraction from unstructured documents (e.g., Raft’s platform for freight forwarding).
- Visual inspection augmentation (e.g., xBuild’s AI construction platform).
- 2D and 3D design generation (e.g., Snaptrude’s 3D building design tool).
- Video analytics for safety monitoring and object tracking.
The rise of AI agents:
- Progress in constraining tasks for AI agents has led to reduced errors in multi-step reasoning.
- Reasoning-focused foundation models like OpenAI’s o1 are showing promise in complex problem-solving.
- Current applications include:
- Sales and marketing: Researching prospects and crafting personalized outreach.
- Negotiations: Automating legal and commercial term negotiations.
- Investigations: Assisting with initial phases of cybersecurity alert investigations.
Broader implications for vertical AI:
- Multimodal capabilities are expanding the potential impact of vertical AI across industries.
- As underlying models become commoditized, it will be more sustainable for companies to build applications on top of powerful foundation models.
- The integration of these new capabilities is expected to fundamentally change how we work and interact with the world.
Looking ahead: The next wave of vertical AI applications will likely focus on addressing complex workflows autonomously, leveraging advancements in reasoning-based models. As the technology continues to evolve, we can expect to see novel business models emerge, including copilots, agents, and AI-enabled services, opening up new opportunities in previously untapped industries.
Part II: Multimodal capabilities unlock new opportunities in Vertical AI