Salesforce has unveiled TACO, a new family of multimodal AI models that can process multiple types of data and perform complex reasoning tasks using a step-by-step approach.
Key Innovation: TACO represents a significant advancement in multimodal AI by combining chains-of-thought-and-action (CoTA) with the ability to process various data types including images, text, and numerical calculations.
- The system utilizes external tools like optical character recognition (OCR), depth estimation, and calculators to process different types of information
- TACO can break down complex questions into smaller, manageable steps and execute them sequentially
- The model demonstrates particular strength in tasks requiring both visual understanding and mathematical reasoning
Technical Implementation: Salesforce developed TACO through an extensive training process designed to enhance its problem-solving capabilities.
- The model was trained using over 1 million synthetic CoTA traces
- Training incorporated both model-based and programmatic generation methods
- TACO showed 30-50% better performance compared to traditional direct-answer models
- The system achieved up to 20% improvement over baseline models on the MMVet benchmark
Practical Applications: TACO’s architecture enables it to tackle real-world problems that require multiple steps and different types of reasoning.
- The model can handle practical questions like calculating gas purchases from photographed price signs
- Future applications could include medical question answering and web navigation tasks
- The framework is designed to be adaptable for training new models with different actions across various domains
Looking Ahead: While TACO represents a significant step forward in multimodal AI capabilities, its true impact will likely depend on how effectively it can be integrated into practical applications and whether it can maintain consistent performance across diverse real-world scenarios.
Salesforce Introduces New Family of Multimodal Action Models Named TACO