LORA

LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
ADAPTER LAYER DEEP DIVE REF-DOC // TECHNICAL REFERENCE
01 // Concept
WHAT IS LORA

LoRA is a parameter-efficient fine-tuning method. Instead of updating all weights in a pretrained model, it freezes the original weights entirely and injects small trainable matrices alongside specific layers.

The key insight: weight updates during fine-tuning have a low intrinsic rank. You don't need a full-dimensional update to adapt behavior. A low-rank decomposition captures most of what matters.

Published by Hu et al. at Microsoft in 2021. Originally for GPT-style transformers but now applied broadly across vision, audio, and multimodal models.

02 // Problem Solved
WHY NOT FULL FINETUNE

A 70B parameter model has ~140GB of weights at FP16. Full fine-tuning requires storing those weights plus optimizer states (Adam needs 2x the param count), gradients, and activations.

That's realistically 500GB+ of GPU memory per training run. You also need to store a full copy of the model per task. With LoRA you store tiny adapter files, often under 100MB, that snap onto a shared base.

Catastrophic forgetting is also reduced since base weights never move.

03 // The Mathematics
LOW-RANK DECOMPOSITION

For a pretrained weight matrix W₀ ∈ R^(d×k), LoRA constrains the update by representing it as the product of two smaller matrices:

W = W₀ + ΔW = W₀ + BA
Where:
B ∈ R^(d×r)
A ∈ R^(r×k)
rank r << min(d, k)
// A initialized with gaussian, B with zeros so ΔW=0 at start
ΔW
d × k
FULL RANK
=
B
d × r
×
A
r × k

The scaling factor α/r is applied to ΔW output, where α is a hyperparameter. Typically r=4,8,16 or 64 depending on task complexity.

04 // Architecture
LAYER STRUCTURE
INPUT EMBEDDING
// frozen
ATTENTION LAYER (W_q, W_k, W_v, W_o)
// frozen base weights
LORA ADAPTER
BA · (α/r) // trainable
FFN SUBLAYER
// frozen
LORA ADAPTER
optional // FFN projection
LAYER NORM + RESIDUAL
// frozen
05 // Hyperparameters
RANK SELECTION GUIDE
Rank (r) Params Use Case Quality
r = 1 ~0.01% Tone/style only low
r = 4 ~0.05% Persona, light domain medium
r = 8 ~0.1% Task-specific behavior good
r = 16 ~0.2% Domain knowledge high
r = 64 ~0.8% Complex task transfer very high

LoRA is typically applied to W_q and W_v attention projections. Applying to all four projections and FFN layers improves quality at the cost of adapter size.

06 // QLoRA Variant
QLORA // 4-BIT BASE

QLoRA (Dettmers et al. 2023) combines LoRA with quantization of the base model weights. The frozen base runs in NF4 (Normal Float 4-bit), a data type optimized for normally distributed weights.

FP324 bytes/param
FP162 bytes/param
INT81 byte/param
NF40.5 bytes/param

A 65B model that requires ~130GB at FP16 fits in ~35GB at NF4. That's 2x A100s instead of 8. The adapters themselves remain in BF16 for training stability.

Double quantization further compresses the quantization constants themselves. Paged optimizers handle memory spikes. Net result: you can fine-tune massive models on consumer hardware.

07 // Personal Model Architecture // Your Use Case
LORA AS A PERSONAL COGNITIVE LAYER
BASE MODEL
Claude / Llama 3 / Mistral
General reasoning
World knowledge
Language modeling
Code / math
frozen shared 70B+
+
YOUR LORA
Personal Adapter // ~50-200MB
SRE / network domain
BGP / CDN context
Your writing voice
CY_BORG / modular refs
Preferred reasoning style
trainable yours swappable
TRAINING DATA

Your conversation history, annotated with good/bad signal. Technical docs you reference. Code you've written. Notes, wiki entries, runbooks. The adapter learns your patterns and domain without touching base weights.

INFERENCE

At inference, ΔW is merged into base weights (zero latency overhead) or applied as a parallel branch. You load one base model, hot-swap adapter files per context: work mode, creative mode, technical mode.

REFRESH CYCLE

Not realtime but periodic. Every month, new data is added to training set, adapter retrained from scratch or continued from prior checkpoint. Small r keeps training under an hour on a single GPU even for large base models.

08 // By The Numbers
EFFICIENCY COMPARISON
0.1%
trainable params vs full finetune
3x
faster training throughput
~0
inference latency overhead (merged)
N
adapters per base model // unlimited
09 // Lineage
ADAPTER HISTORY
2019

Houlsby et al. introduce adapter layers for NLP transfer learning. Small bottleneck modules inserted serially inside transformer layers.

2021

LoRA paper (Hu et al., Microsoft). Parallel low-rank matrices rather than serial bottlenecks. No added inference latency when merged.

2022

IA³, prefix tuning, prompt tuning proliferate. PEFT becomes a standard training paradigm. HuggingFace PEFT library released.

2023

QLoRA enables 65B fine-tuning on a single 48GB GPU. LLaMA ecosystem explodes with community adapters. DoRA, AdaLoRA, VeRA variants published.

2024+

LoRA merging research matures. Multiple adapters composed arithmetically. Personal model stacks become practical on prosumer hardware.