×
Meta releases ‘quantized models’ to efficiently run AI on mobile devices
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Quantized Llama models: A leap forward in mobile AI: Meta has released lightweight quantized versions of their Llama 3.2 1B and 3B language models, designed to run efficiently on popular mobile devices while maintaining high performance and accuracy.

Key advancements in model efficiency:

  • The quantized models achieve a 2-4x speedup compared to their original counterparts.
  • Model size has been reduced by an average of 56%.
  • Memory usage has decreased by an average of 41%.
  • These improvements enable on-device AI capabilities with enhanced privacy and speed.

Quantization techniques employed:

  • Quantization-Aware Training with LoRA adaptors (QLoRA): This method prioritizes accuracy by simulating quantization effects during training.
  • SpinQuant: A post-training quantization technique that emphasizes portability and can be applied without access to training datasets.

Technical implementation details:

  • The quantization scheme targets PyTorch’s ExecuTorch inference framework and Arm CPU backend.
  • Linear layers in transformer blocks use 4-bit groupwise quantization for weights and 8-bit per-token dynamic quantization for activations.
  • The classification layer employs 8-bit per-channel quantization for weights and 8-bit per-token dynamic quantization for activations.
  • Embedding uses 8-bit per-channel quantization.

Collaborative development and optimization:

  • Meta worked closely with industry partners including Qualcomm and MediaTek to optimize the models for specific SoCs with Arm CPUs.
  • Performance optimizations utilize Kleidi AI kernels for mobile CPUs.
  • Ongoing collaborations aim to leverage NPUs for even greater performance gains.

Implications for developers:

  • The quantized models enable developers to create unique, privacy-focused AI experiences that run entirely on-device.
  • Developers can use QAT as a foundation and further fine-tune the models using LoRA for specific use cases.
  • The SpinQuant method allows developers to quantize their own fine-tuned Llama models for various hardware targets.

Broader context and future outlook: The release of these quantized Llama models represents a significant step towards making advanced AI capabilities more accessible on mobile devices. This development aligns with the growing trend of edge AI, where processing occurs on-device rather than in the cloud, offering benefits in terms of privacy, latency, and offline functionality.

  • The quantized models’ ability to run efficiently on mobile CPUs opens up new possibilities for AI-powered applications in resource-constrained environments.
  • Meta’s commitment to open-sourcing these models and collaborating with industry partners suggests a push towards democratizing AI technology.
  • As work continues on optimizing these models for NPUs, we can expect even greater performance improvements in the future, potentially enabling more complex AI tasks on mobile devices.
Introducing quantized Llama models with increased speed and a reduced memory footprint

Recent News

Netflix drops AI-generated poster after creator backlash

Studios face mounting pressure over AI-generated artwork as backlash grows from both artists and audiences, prompting hasty removal of promotional materials and public apologies.

ChatGPT’s water usage is 4x higher than previously estimated

Growing demand for AI computing is straining local water supplies as data centers consume billions of gallons for cooling systems.

Conservationists in the UK turn to AI to save red squirrels

AI-powered feeders help Britain's endangered red squirrels access food while diverting invasive grey squirrels to contraceptive stations.