LoRA is a parameter-efficient fine-tuning method. Instead of updating all weights in a pretrained model, it freezes the original weights entirely and injects small trainable matrices alongside specific layers.
The key insight: weight updates during fine-tuning have a low intrinsic rank. You don't need a full-dimensional update to adapt behavior. A low-rank decomposition captures most of what matters.
Published by Hu et al. at Microsoft in 2021. Originally for GPT-style transformers but now applied broadly across vision, audio, and multimodal models.
A 70B parameter model has ~140GB of weights at FP16. Full fine-tuning requires storing those weights plus optimizer states (Adam needs 2x the param count), gradients, and activations.
That's realistically 500GB+ of GPU memory per training run. You also need to store a full copy of the model per task. With LoRA you store tiny adapter files, often under 100MB, that snap onto a shared base.
Catastrophic forgetting is also reduced since base weights never move.
For a pretrained weight matrix W₀ ∈ R^(d×k), LoRA constrains the update by representing it as the product of two smaller matrices:
The scaling factor α/r is applied to ΔW output, where α is a hyperparameter. Typically r=4,8,16 or 64 depending on task complexity.
| Rank (r) | Params | Use Case | Quality |
|---|---|---|---|
| r = 1 | ~0.01% | Tone/style only | low |
| r = 4 | ~0.05% | Persona, light domain | medium |
| r = 8 | ~0.1% | Task-specific behavior | good |
| r = 16 | ~0.2% | Domain knowledge | high |
| r = 64 | ~0.8% | Complex task transfer | very high |
LoRA is typically applied to W_q and W_v attention projections. Applying to all four projections and FFN layers improves quality at the cost of adapter size.
QLoRA (Dettmers et al. 2023) combines LoRA with quantization of the base model weights. The frozen base runs in NF4 (Normal Float 4-bit), a data type optimized for normally distributed weights.
A 65B model that requires ~130GB at FP16 fits in ~35GB at NF4. That's 2x A100s instead of 8. The adapters themselves remain in BF16 for training stability.
Double quantization further compresses the quantization constants themselves. Paged optimizers handle memory spikes. Net result: you can fine-tune massive models on consumer hardware.
World knowledge
Language modeling
Code / math
BGP / CDN context
Your writing voice
CY_BORG / modular refs
Preferred reasoning style
Your conversation history, annotated with good/bad signal. Technical docs you reference. Code you've written. Notes, wiki entries, runbooks. The adapter learns your patterns and domain without touching base weights.
At inference, ΔW is merged into base weights (zero latency overhead) or applied as a parallel branch. You load one base model, hot-swap adapter files per context: work mode, creative mode, technical mode.
Not realtime but periodic. Every month, new data is added to training set, adapter retrained from scratch or continued from prior checkpoint. Small r keeps training under an hour on a single GPU even for large base models.
Houlsby et al. introduce adapter layers for NLP transfer learning. Small bottleneck modules inserted serially inside transformer layers.
LoRA paper (Hu et al., Microsoft). Parallel low-rank matrices rather than serial bottlenecks. No added inference latency when merged.
IA³, prefix tuning, prompt tuning proliferate. PEFT becomes a standard training paradigm. HuggingFace PEFT library released.
QLoRA enables 65B fine-tuning on a single 48GB GPU. LLaMA ecosystem explodes with community adapters. DoRA, AdaLoRA, VeRA variants published.
LoRA merging research matures. Multiple adapters composed arithmetically. Personal model stacks become practical on prosumer hardware.