EMBEDDINGS

MEANING AS GEOMETRY // WORDS AS VECTORS SEMANTIC SPACE // COSINE SIMILARITY // WORD2VEC
HIGH-DIMENSIONAL REPRESENTATION LEARNING
01 // The Core Idea
WORDS AS POINTS IN SPACE

An embedding is a learned mapping from a discrete token (a word, a subword piece, an image patch) into a continuous high-dimensional vector. The vectors are not hand-crafted - they emerge from training on massive corpora. The model discovers that meaning can be encoded as geometry.

The key emergent property: semantically similar tokens end up close in vector space. Tokens that share syntactic roles cluster together. Relationships between concepts become directions in space. Arithmetic on vectors corresponds to arithmetic on meaning.

The famous example - king - man + woman ≈ queen - is not programmed. It emerges from the statistical structure of language alone. The "royalty" dimension and the "gender" dimension are discovered independently and happen to compose linearly.

v(king) - v(man) + v(woman) ≈ v(queen)
v(Paris) - v(France) + v(Italy) ≈ v(Rome)
// these are not identities. they're nearest-neighbor approximations in a 300-12288 dimensional space. the fact they hold at all is remarkable.
"cat" → embedding vector (truncated):
[ 0.312, -0.841, 0.093, 0.774, -0.201,
  0.558, -0.119, 0.930, -0.447, 0.663,
  ... 4086 more dimensions ... ]

Each number is a weight across a learned axis of meaning. No single dimension corresponds to a human-interpretable concept - it's a superposition across thousands of features.

02 // Geometry of Meaning
COSINE SIMILARITY

Distance in embedding space is typically measured with cosine similarity - the angle between two vectors, not their magnitude. This normalizes for token frequency (common words have large magnitude vectors) and focuses purely on directional similarity.

cos(θ) = (A · B) / (|A| · |B|)
= Σᵢ(Aᵢ·Bᵢ) / (√ΣAᵢ² · √ΣBᵢ²)
// range [-1, 1]. 1 = identical direction. 0 = orthogonal (unrelated). -1 = opposite. euclidean distance captures both direction and magnitude.

Similarity scores for "dog" compared to:

wolf
0.82
puppy
0.79
cat
0.68
animal
0.61
house
0.28
democracy
0.06
03 // Vector Space Visualization // 2D Projection (t-SNE/UMAP collapse from ~4096D)
SEMANTIC CLUSTERING
ANIMALS dog cat wolf fox bear ROYALTY king queen prince throne crown GEOGRAPHY France Paris Italy Rome Berlin TECHNOLOGY computer network router BGP MUSIC synth chord modular oscillator gender dim → man woman 2D PROJECTION // ACTUAL SPACE: 4096D+
04 // Word2Vec // Origin
HOW THEY'RE LEARNED

Word2Vec (Mikolov et al. 2013) popularized the approach. Two architectures:

CBOW (Continuous Bag of Words): predict center word from surrounding context. Fast, good for frequent words.

Skip-gram: predict surrounding context from center word. Slower but better for rare words and fine-grained semantics. The choice for most quality embeddings.

SKIP-GRAM OBJECTIVE:
maximize Σ log P(context | word)
Train a shallow network to predict neighbors. Throw away the network. Keep the learned weight matrix. That matrix is your embeddings.

Negative sampling: for each positive context pair, sample k random "negative" pairs. Model must discriminate real context from noise. More efficient than softmax over full vocabulary.

05 // Tokenization
BEFORE EMBEDDING

Modern LLMs don't embed whole words - they use subword tokenization. BPE (Byte Pair Encoding) or SentencePiece merge frequent character pairs iteratively until vocabulary size is reached (~32k-100k tokens).

TOKENIZATION EXAMPLE // GPT-4
un believ able
BGP route reflect or
each colored block = 1 token = 1 embedding lookup

Rare words get split into subwords. Common words may be single tokens. "BGP" might be one token in a code-heavy model but split in a general one. Tokenization shapes what the model can "see" efficiently.

06 // Contextual Embeddings
STATIC vs DYNAMIC

Word2Vec produces static embeddings - "bank" always maps to the same vector regardless of context. This fails for polysemy.

Transformer models produce contextual embeddings - the representation of each token is computed by attending to the entire context. "bank" in "river bank" and "bank account" produce different vectors because the surrounding tokens reshape the representation through attention.

"BANK" IN CONTEXT
static:  always → [0.31, -0.44, 0.82, ...]
contextual (river):  [0.12, 0.71, -0.23, ...]
contextual (money):  [0.88, -0.15, 0.49, ...]

ELMo (2018) was first to demonstrate this at scale. BERT (2018) made it the dominant paradigm. Every modern LLM produces contextual representations - static embeddings are now only used for retrieval/RAG efficiency.

polysemy BERT ELMo RAG
07 // Superposition Hypothesis
MORE FEATURES THAN DIMENSIONS

A surprising finding from mechanistic interpretability research: transformer models appear to store far more features than they have dimensions. A 4096-dimensional model shouldn't be able to represent more than 4096 orthogonal concepts - but empirically they seem to represent millions.

The proposed explanation is superposition: features are encoded as nearly-orthogonal directions in high-dimensional space, slightly interfering with each other. Sparse activation means most features are off at any time, limiting interference.

This is related to why monosemantic neurons (neurons that respond to exactly one concept) are rare. Most neurons are polysemantic - they participate in encoding many different features depending on context. Anthropic's mechanistic interpretability research is actively trying to decompose these superposed representations using sparse autoencoders.

superpositionsparse codingmech interpSAE
08 // Applications
WHAT EMBEDDINGS POWER
RAG

Embed documents and queries. Retrieve by cosine similarity. Feed relevant docs to LLM. Long-term memory without fine-tuning.

SEMANTIC SEARCH

Query "fast database" finds "high-performance PostgreSQL" without keyword overlap. Meaning, not string matching.

CLUSTERING

k-means on embedding space groups semantically related documents. Topic modeling without explicit labels.

ANOMALY DETECT

Log lines, alerts, events embedded and clustered. Outliers in vector space = unusual behavior. Your SRE use case.

Vector databases (Pinecone, Weaviate, pgvector) store millions of embeddings and serve approximate nearest-neighbor queries in milliseconds. The infrastructure layer underneath most production RAG systems.