back
Get SIGNAL/NOISE in your inbox daily

Quantized Llama models: A leap forward in mobile AI: Meta has released lightweight quantized versions of their Llama 3.2 1B and 3B language models, designed to run efficiently on popular mobile devices while maintaining high performance and accuracy.

Key advancements in model efficiency:

  • The quantized models achieve a 2-4x speedup compared to their original counterparts.
  • Model size has been reduced by an average of 56%.
  • Memory usage has decreased by an average of 41%.
  • These improvements enable on-device AI capabilities with enhanced privacy and speed.

Quantization techniques employed:

  • Quantization-Aware Training with LoRA adaptors (QLoRA): This method prioritizes accuracy by simulating quantization effects during training.
  • SpinQuant: A post-training quantization technique that emphasizes portability and can be applied without access to training datasets.

Technical implementation details:

  • The quantization scheme targets PyTorch’s ExecuTorch inference framework and Arm CPU backend.
  • Linear layers in transformer blocks use 4-bit groupwise quantization for weights and 8-bit per-token dynamic quantization for activations.
  • The classification layer employs 8-bit per-channel quantization for weights and 8-bit per-token dynamic quantization for activations.
  • Embedding uses 8-bit per-channel quantization.

Collaborative development and optimization:

  • Meta worked closely with industry partners including Qualcomm and MediaTek to optimize the models for specific SoCs with Arm CPUs.
  • Performance optimizations utilize Kleidi AI kernels for mobile CPUs.
  • Ongoing collaborations aim to leverage NPUs for even greater performance gains.

Implications for developers:

  • The quantized models enable developers to create unique, privacy-focused AI experiences that run entirely on-device.
  • Developers can use QAT as a foundation and further fine-tune the models using LoRA for specific use cases.
  • The SpinQuant method allows developers to quantize their own fine-tuned Llama models for various hardware targets.

Broader context and future outlook: The release of these quantized Llama models represents a significant step towards making advanced AI capabilities more accessible on mobile devices. This development aligns with the growing trend of edge AI, where processing occurs on-device rather than in the cloud, offering benefits in terms of privacy, latency, and offline functionality.

  • The quantized models’ ability to run efficiently on mobile CPUs opens up new possibilities for AI-powered applications in resource-constrained environments.
  • Meta’s commitment to open-sourcing these models and collaborating with industry partners suggests a push towards democratizing AI technology.
  • As work continues on optimizing these models for NPUs, we can expect even greater performance improvements in the future, potentially enabling more complex AI tasks on mobile devices.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...