×
Meta releases ‘quantized models’ to efficiently run AI on mobile devices
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Quantized Llama models: A leap forward in mobile AI: Meta has released lightweight quantized versions of their Llama 3.2 1B and 3B language models, designed to run efficiently on popular mobile devices while maintaining high performance and accuracy.

Key advancements in model efficiency:

  • The quantized models achieve a 2-4x speedup compared to their original counterparts.
  • Model size has been reduced by an average of 56%.
  • Memory usage has decreased by an average of 41%.
  • These improvements enable on-device AI capabilities with enhanced privacy and speed.

Quantization techniques employed:

  • Quantization-Aware Training with LoRA adaptors (QLoRA): This method prioritizes accuracy by simulating quantization effects during training.
  • SpinQuant: A post-training quantization technique that emphasizes portability and can be applied without access to training datasets.

Technical implementation details:

  • The quantization scheme targets PyTorch’s ExecuTorch inference framework and Arm CPU backend.
  • Linear layers in transformer blocks use 4-bit groupwise quantization for weights and 8-bit per-token dynamic quantization for activations.
  • The classification layer employs 8-bit per-channel quantization for weights and 8-bit per-token dynamic quantization for activations.
  • Embedding uses 8-bit per-channel quantization.

Collaborative development and optimization:

  • Meta worked closely with industry partners including Qualcomm and MediaTek to optimize the models for specific SoCs with Arm CPUs.
  • Performance optimizations utilize Kleidi AI kernels for mobile CPUs.
  • Ongoing collaborations aim to leverage NPUs for even greater performance gains.

Implications for developers:

  • The quantized models enable developers to create unique, privacy-focused AI experiences that run entirely on-device.
  • Developers can use QAT as a foundation and further fine-tune the models using LoRA for specific use cases.
  • The SpinQuant method allows developers to quantize their own fine-tuned Llama models for various hardware targets.

Broader context and future outlook: The release of these quantized Llama models represents a significant step towards making advanced AI capabilities more accessible on mobile devices. This development aligns with the growing trend of edge AI, where processing occurs on-device rather than in the cloud, offering benefits in terms of privacy, latency, and offline functionality.

  • The quantized models’ ability to run efficiently on mobile CPUs opens up new possibilities for AI-powered applications in resource-constrained environments.
  • Meta’s commitment to open-sourcing these models and collaborating with industry partners suggests a push towards democratizing AI technology.
  • As work continues on optimizing these models for NPUs, we can expect even greater performance improvements in the future, potentially enabling more complex AI tasks on mobile devices.
Introducing quantized Llama models with increased speed and a reduced memory footprint

Recent News

‘Heretic’ film directors include anti-AI disclaimer in film credits

Hollywood directors' anti-AI stance reflects growing concerns about automation in creative industries and its potential impact on jobs.

AI at the edge: Key architecture decisions for future success

Edge intelligence brings AI processing closer to data sources, enabling faster and more reliable decision-making across industries.

Why new AI data centers may spike Americans’ electricity bills

The growing energy demands of AI data centers are causing electricity costs to rise for consumers in some parts of the U.S., highlighting the unintended consequences of rapid technological expansion.