A deep dive into the intrinsic dimensionality hypothesis and rank decomposition
Abstract
Full fine-tuning updates billions of parameters to adapt a pre-trained LLM — expensive in compute, memory, and storage. LoRA proposes that weight updates during adaptation have a low intrinsic rank, enabling effective fine-tuning with <1% of the original parameters. This article explains why this works from the lens of intrinsic dimensionality theory and gradient geometry.
Aghajanyan et al. (2020) showed that pre-trained models can be fine-tuned effectively by optimising in a much lower-dimensional subspace of the full parameter space. This 'intrinsic dimensionality' is often several orders of magnitude smaller than the number of parameters — a strong signal that weight updates are inherently low-rank.
LoRA decomposes the weight update matrix into two small matrices. Since rank r is much smaller than d, the number of trainable parameters drops from d² to 2dr — a reduction of d/2r ≈ thousands for typical transformer widths.
LoRA Decomposition
W = W_0 + B * A where W_0 in R^(d x d), B in R^(d x r), A in R^(r x d), r << d
Smaller models sacrifice capability. LoRA retains the full expressiveness of the large pre-trained model for the majority of the task (handled by frozen W_0), while the low-rank adaptation captures task-specific nuance. At inference, BA can be merged into W_0 with zero additional latency.
Rank r is the key hyperparameter. Too low (r=1) and the model underfits the target task; too high (r=64) and the efficiency advantage disappears. Empirically, r=4 or r=8 suffices for most NLP benchmarks. For instruction-following tasks, r=16 is often optimal. The right rank depends on how different the target distribution is from the pre-training distribution.