×
Meta releases ‘quantized models’ to efficiently run AI on mobile devices
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Quantized Llama models: A leap forward in mobile AI: Meta has released lightweight quantized versions of their Llama 3.2 1B and 3B language models, designed to run efficiently on popular mobile devices while maintaining high performance and accuracy.

Key advancements in model efficiency:

  • The quantized models achieve a 2-4x speedup compared to their original counterparts.
  • Model size has been reduced by an average of 56%.
  • Memory usage has decreased by an average of 41%.
  • These improvements enable on-device AI capabilities with enhanced privacy and speed.

Quantization techniques employed:

  • Quantization-Aware Training with LoRA adaptors (QLoRA): This method prioritizes accuracy by simulating quantization effects during training.
  • SpinQuant: A post-training quantization technique that emphasizes portability and can be applied without access to training datasets.

Technical implementation details:

  • The quantization scheme targets PyTorch’s ExecuTorch inference framework and Arm CPU backend.
  • Linear layers in transformer blocks use 4-bit groupwise quantization for weights and 8-bit per-token dynamic quantization for activations.
  • The classification layer employs 8-bit per-channel quantization for weights and 8-bit per-token dynamic quantization for activations.
  • Embedding uses 8-bit per-channel quantization.

Collaborative development and optimization:

  • Meta worked closely with industry partners including Qualcomm and MediaTek to optimize the models for specific SoCs with Arm CPUs.
  • Performance optimizations utilize Kleidi AI kernels for mobile CPUs.
  • Ongoing collaborations aim to leverage NPUs for even greater performance gains.

Implications for developers:

  • The quantized models enable developers to create unique, privacy-focused AI experiences that run entirely on-device.
  • Developers can use QAT as a foundation and further fine-tune the models using LoRA for specific use cases.
  • The SpinQuant method allows developers to quantize their own fine-tuned Llama models for various hardware targets.

Broader context and future outlook: The release of these quantized Llama models represents a significant step towards making advanced AI capabilities more accessible on mobile devices. This development aligns with the growing trend of edge AI, where processing occurs on-device rather than in the cloud, offering benefits in terms of privacy, latency, and offline functionality.

  • The quantized models’ ability to run efficiently on mobile CPUs opens up new possibilities for AI-powered applications in resource-constrained environments.
  • Meta’s commitment to open-sourcing these models and collaborating with industry partners suggests a push towards democratizing AI technology.
  • As work continues on optimizing these models for NPUs, we can expect even greater performance improvements in the future, potentially enabling more complex AI tasks on mobile devices.
Introducing quantized Llama models with increased speed and a reduced memory footprint

Recent News

AI is getting really good at math — we must leverage these capabilities now to make AI safe

Human-level mathematical reasoning in AI systems creates an urgent but brief window for safety researchers to formalize their approaches before capabilities advance further.

UK government announces initiative to solve AI’s copyright problem

The government seeks to balance creator rights with AI development needs through new transparency rules and enhanced copyright controls for content owners.

4 major scientific breakthroughs achieved by AI in 2024

Scientific research in sectors from archaeology to marine biology saw AI accelerate discoveries that previously took years to achieve.