Efficiency

Why LoRA Is More Efficient Than Full Fine-Tuning

A deep dive into the intrinsic dimensionality hypothesis and rank decomposition

February 12, 202616 min read

Abstract

Full fine-tuning updates billions of parameters to adapt a pre-trained LLM — expensive in compute, memory, and storage. LoRA proposes that weight updates during adaptation have a low intrinsic rank, enabling effective fine-tuning with <1% of the original parameters. This article explains why this works from the lens of intrinsic dimensionality theory and gradient geometry.

#LoRA#Fine-tuning#PEFT#LLMs#Intrinsic Dimensionality

1. The Intrinsic Dimensionality Hypothesis

Aghajanyan et al. (2020) showed that pre-trained models can be fine-tuned effectively by optimising in a much lower-dimensional subspace of the full parameter space. This 'intrinsic dimensionality' is often several orders of magnitude smaller than the number of parameters — a strong signal that weight updates are inherently low-rank.

2. Rank Decomposition of Weight Updates

LoRA decomposes the weight update matrix into two small matrices. Since rank r is much smaller than d, the number of trainable parameters drops from d² to 2dr — a reduction of d/2r ≈ thousands for typical transformer widths.

LoRA Decomposition

W = W_0 + B * A
where W_0 in R^(d x d), B in R^(d x r), A in R^(r x d), r << d

3. Why Not Just Train a Smaller Model?

Smaller models sacrifice capability. LoRA retains the full expressiveness of the large pre-trained model for the majority of the task (handled by frozen W_0), while the low-rank adaptation captures task-specific nuance. At inference, BA can be merged into W_0 with zero additional latency.

4. Empirical Trade-offs

Rank r is the key hyperparameter. Too low (r=1) and the model underfits the target task; too high (r=64) and the efficiency advantage disappears. Empirically, r=4 or r=8 suffices for most NLP benchmarks. For instruction-following tasks, r=16 is often optimal. The right rank depends on how different the target distribution is from the pre-training distribution.

References

[1]

LoRA: Low-Rank Adaptation of Large Language Models

Hu et al., 2022Link
[2]

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Aghajanyan et al., 2020Link