Hugging Face has released two new additions to the SmolVLM model family. The new compact Vision Language Models – a 256M parameter version and a 500M parameter version – are designed to deliver efficient multimodal AI capabilities while maintaining a small computational footprint.
Core innovations; The new SmolVLM models represent significant architectural improvements over their 2B parameter predecessor, introducing key optimizations for real-world applications.
- The models now utilize a streamlined 93M parameter SigLIP vision encoder, drastically reduced from the previous 400M version
- Higher resolution image processing capabilities enable enhanced visual comprehension
- New tokenization optimizations boost performance in practical applications
- The training data mixture has been refined to better handle document understanding and image captioning tasks
Technical specifications; Both new models come in multiple variants to support different use cases and deployment scenarios.
- Four distinct checkpoints are available: base and instruction-tuned versions for both the 256M and 500M parameter models
- The models maintain compatibility across transformers, MLX and ONNX frameworks
- ColSmolVLM variants have been released specifically for multimodal retrieval applications
- Existing SmolVLM code bases can be used for inference and fine-tuning with the new models
Deployment flexibility; The models offer multiple implementation paths to accommodate various technical requirements and use cases.
- Ready-to-use code examples are provided for both transformers and MLX implementations
- ONNX checkpoints enable broad platform compatibility
- WebGPU demonstrations showcase the models’ capabilities in browser-based applications
- Comprehensive documentation includes multimodal RAG (Retrieval-Augmented Generation) examples
Looking ahead; The release of these ultra-compact models marks a significant step toward making multimodal AI more accessible and deployable across a broader range of devices and applications, though their reduced size may present some performance trade-offs compared to larger models.
SmolVLM Grows Smaller – Introducing the 256M & 500M Models!