×
Meta releases ‘quantized models’ to efficiently run AI on mobile devices
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Quantized Llama models: A leap forward in mobile AI: Meta has released lightweight quantized versions of their Llama 3.2 1B and 3B language models, designed to run efficiently on popular mobile devices while maintaining high performance and accuracy.

Key advancements in model efficiency:

  • The quantized models achieve a 2-4x speedup compared to their original counterparts.
  • Model size has been reduced by an average of 56%.
  • Memory usage has decreased by an average of 41%.
  • These improvements enable on-device AI capabilities with enhanced privacy and speed.

Quantization techniques employed:

  • Quantization-Aware Training with LoRA adaptors (QLoRA): This method prioritizes accuracy by simulating quantization effects during training.
  • SpinQuant: A post-training quantization technique that emphasizes portability and can be applied without access to training datasets.

Technical implementation details:

  • The quantization scheme targets PyTorch’s ExecuTorch inference framework and Arm CPU backend.
  • Linear layers in transformer blocks use 4-bit groupwise quantization for weights and 8-bit per-token dynamic quantization for activations.
  • The classification layer employs 8-bit per-channel quantization for weights and 8-bit per-token dynamic quantization for activations.
  • Embedding uses 8-bit per-channel quantization.

Collaborative development and optimization:

  • Meta worked closely with industry partners including Qualcomm and MediaTek to optimize the models for specific SoCs with Arm CPUs.
  • Performance optimizations utilize Kleidi AI kernels for mobile CPUs.
  • Ongoing collaborations aim to leverage NPUs for even greater performance gains.

Implications for developers:

  • The quantized models enable developers to create unique, privacy-focused AI experiences that run entirely on-device.
  • Developers can use QAT as a foundation and further fine-tune the models using LoRA for specific use cases.
  • The SpinQuant method allows developers to quantize their own fine-tuned Llama models for various hardware targets.

Broader context and future outlook: The release of these quantized Llama models represents a significant step towards making advanced AI capabilities more accessible on mobile devices. This development aligns with the growing trend of edge AI, where processing occurs on-device rather than in the cloud, offering benefits in terms of privacy, latency, and offline functionality.

  • The quantized models’ ability to run efficiently on mobile CPUs opens up new possibilities for AI-powered applications in resource-constrained environments.
  • Meta’s commitment to open-sourcing these models and collaborating with industry partners suggests a push towards democratizing AI technology.
  • As work continues on optimizing these models for NPUs, we can expect even greater performance improvements in the future, potentially enabling more complex AI tasks on mobile devices.
Introducing quantized Llama models with increased speed and a reduced memory footprint

Recent News

‘Agent orchestration’ is the backbone of business ops in the AI era — here’s why

Agent orchestration leverages AI to actively manage interactions and optimize data flow across enterprise systems, promising more responsive and adaptive business environments.

This startup is using AI to help patients decode their X-rays

AI-powered dental imaging system enhances X-rays to improve patient understanding and treatment decisions.

MIT’s latest breakthrough is tiny, but it has big implications for the semiconductor industry

The novel 3D nanoscale transistor design could overcome silicon's physical limitations, potentially leading to more efficient and powerful electronic devices.