Apple's SeedLM compression technique could make AI models run faster on phones

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Apple researchers have developed a new compression technique for large language models that could significantly accelerate AI deployment on memory-constrained devices. SeedLM represents a novel approach to model compression that maintains performance while reducing memory requirements, potentially enabling more efficient AI systems across a range of hardware platforms. The technique’s data-free approach and ability to maintain accuracy even at high compression rates could help address one of the most significant barriers to widespread LLM implementation.

The big picture: Apple researchers have introduced SeedLM, a post-training compression method that efficiently encodes model weights using seeds from a pseudo-random generator, addressing the high runtime costs of large language models.

SeedLM uses Linear Feedback Shift Registers (LFSRs) during inference to generate random matrices that, when combined with compressed coefficients, can reconstruct weight blocks.
Unlike competing compression techniques, SeedLM operates without requiring calibration data, making it more versatile across different tasks and applications.

How it works: The technique trades compute for memory by generating weight matrices on-the-fly during inference rather than storing and retrieving them from memory.

For each block of weights in the model, researchers find a seed that feeds into an LFSR to efficiently generate a random matrix during runtime.
These generated matrices are linearly combined with compressed coefficients to reconstruct the original weight blocks, reducing both storage requirements and memory bandwidth needs.

Key results: Tests with the particularly challenging Llama3 70B model demonstrate that SeedLM maintains performance comparable to much larger models while achieving significant compression.

The method’s zero-shot accuracy retention at 4-bit and 3-bit compression matches or exceeds state-of-the-art compression methods.
FPGA-based testing shows that 4-bit SeedLM approaches a 4x speed-up over FP16 baselines as model size increases.

Why this matters: SeedLM addresses one of the fundamental bottlenecks in AI deployment by focusing on the memory bandwidth limitations that often constrain inference performance.

By reducing memory access requirements, the technique could enable more efficient AI systems on resource-constrained devices like mobile phones and edge computing platforms.
The data-free approach eliminates the need for task-specific calibration data, potentially making LLM deployment more practical across diverse applications.

In plain English: Apple researchers have created a clever way to shrink massive AI models without sacrificing performance by using mathematical shortcuts to generate parts of the model on-demand rather than storing everything in memory.

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Apple Machine Learning Research