LoRA // ADAPTER LAYERS

01 // Concept

WHAT IS LORA

LoRA is a parameter-efficient fine-tuning method. Instead of updating all weights in a pretrained model, it freezes the original weights entirely and injects small trainable matrices alongside specific layers.

The key insight: weight updates during fine-tuning have a low intrinsic rank. You don't need a full-dimensional update to adapt behavior. A low-rank decomposition captures most of what matters.

Published by Hu et al. at Microsoft in 2021. Originally for GPT-style transformers but now applied broadly across vision, audio, and multimodal models.

02 // Problem Solved

WHY NOT FULL FINETUNE

A 70B parameter model has ~140GB of weights at FP16. Full fine-tuning requires storing those weights plus optimizer states (Adam needs 2x the param count), gradients, and activations.

That's realistically 500GB+ of GPU memory per training run. You also need to store a full copy of the model per task. With LoRA you store tiny adapter files, often under 100MB, that snap onto a shared base.

Catastrophic forgetting is also reduced since base weights never move.

03 // The Mathematics

LOW-RANK DECOMPOSITION

For a pretrained weight matrix W₀ ∈ R^(d×k), LoRA constrains the update by representing it as the product of two smaller matrices:

W = W₀ + ΔW = W₀ + BA Where: B \in R^(d\timesr) A \in R^(r\timesk) rank r << min(d, k) // A initialized with gaussian, B with zeros so ΔW=0 at start

ΔW

d × k

FULL RANK

d × r

r × k

The scaling factor α/r is applied to ΔW output, where α is a hyperparameter. Typically r=4,8,16 or 64 depending on task complexity.

04 // Architecture

LAYER STRUCTURE

INPUT EMBEDDING

// frozen

ATTENTION LAYER (W_q, W_k, W_v, W_o)

// frozen base weights

LORA ADAPTER

BA · (α/r) // trainable

FFN SUBLAYER

// frozen

LORA ADAPTER

optional // FFN projection

LAYER NORM + RESIDUAL

// frozen

05 // Hyperparameters

RANK SELECTION GUIDE

Rank (r)	Params	Use Case	Quality
r = 1	~0.01%	Tone/style only	low
r = 4	~0.05%	Persona, light domain	medium
r = 8	~0.1%	Task-specific behavior	good
r = 16	~0.2%	Domain knowledge	high
r = 64	~0.8%	Complex task transfer	very high

LoRA is typically applied to W_q and W_v attention projections. Applying to all four projections and FFN layers improves quality at the cost of adapter size.

06 // QLoRA Variant

QLORA // 4-BIT BASE

QLoRA (Dettmers et al. 2023) combines LoRA with quantization of the base model weights. The frozen base runs in NF4 (Normal Float 4-bit), a data type optimized for normally distributed weights.

FP324 bytes/param

FP162 bytes/param

INT81 byte/param

NF40.5 bytes/param

A 65B model that requires ~130GB at FP16 fits in ~35GB at NF4. That's 2x A100s instead of 8. The adapters themselves remain in BF16 for training stability.

Double quantization further compresses the quantization constants themselves. Paged optimizers handle memory spikes. Net result: you can fine-tune massive models on consumer hardware.

07 // Personal Model Architecture // Your Use Case

LORA AS A PERSONAL COGNITIVE LAYER

BASE MODEL

Claude / Llama 3 / Mistral

General reasoning
World knowledge
Language modeling
Code / math

YOUR LORA

Personal Adapter // ~50-200MB

SRE / network domain
BGP / CDN context
Your writing voice
CY_BORG / modular refs
Preferred reasoning style

TRAINING DATA

Your conversation history, annotated with good/bad signal. Technical docs you reference. Code you've written. Notes, wiki entries, runbooks. The adapter learns your patterns and domain without touching base weights.

INFERENCE

At inference, ΔW is merged into base weights (zero latency overhead) or applied as a parallel branch. You load one base model, hot-swap adapter files per context: work mode, creative mode, technical mode.

REFRESH CYCLE

Not realtime but periodic. Every month, new data is added to training set, adapter retrained from scratch or continued from prior checkpoint. Small r keeps training under an hour on a single GPU even for large base models.

08 // By The Numbers

EFFICIENCY COMPARISON

0.1%

trainable params vs full finetune

faster training throughput

inference latency overhead (merged)

adapters per base model // unlimited

09 // Lineage

ADAPTER HISTORY

2019

Houlsby et al. introduce adapter layers for NLP transfer learning. Small bottleneck modules inserted serially inside transformer layers.

2021

LoRA paper (Hu et al., Microsoft). Parallel low-rank matrices rather than serial bottlenecks. No added inference latency when merged.

2022

IA³, prefix tuning, prompt tuning proliferate. PEFT becomes a standard training paradigm. HuggingFace PEFT library released.

2023

QLoRA enables 65B fine-tuning on a single 48GB GPU. LLaMA ecosystem explodes with community adapters. DoRA, AdaLoRA, VeRA variants published.

2024+

LoRA merging research matures. Multiple adapters composed arithmetically. Personal model stacks become practical on prosumer hardware.