EMBEDDINGS // VECTOR SPACE

01 // The Core Idea

WORDS AS POINTS IN SPACE

An embedding is a learned mapping from a discrete token (a word, a subword piece, an image patch) into a continuous high-dimensional vector. The vectors are not hand-crafted - they emerge from training on massive corpora. The model discovers that meaning can be encoded as geometry.

The key emergent property: semantically similar tokens end up close in vector space. Tokens that share syntactic roles cluster together. Relationships between concepts become directions in space. Arithmetic on vectors corresponds to arithmetic on meaning.

The famous example - king - man + woman ≈ queen - is not programmed. It emerges from the statistical structure of language alone. The "royalty" dimension and the "gender" dimension are discovered independently and happen to compose linearly.

v(king) - v(man) + v(woman) \approx v(queen) v(Paris) - v(France) + v(Italy) \approx v(Rome) // these are not identities. they're nearest-neighbor approximations in a 300-12288 dimensional space. the fact they hold at all is remarkable.

"cat" → embedding vector (truncated):

[ 0.312, -0.841, 0.093, 0.774, -0.201,
0.558, -0.119, 0.930, -0.447, 0.663,
... 4086 more dimensions ... ]

Each number is a weight across a learned axis of meaning. No single dimension corresponds to a human-interpretable concept - it's a superposition across thousands of features.

02 // Geometry of Meaning

COSINE SIMILARITY

Distance in embedding space is typically measured with cosine similarity - the angle between two vectors, not their magnitude. This normalizes for token frequency (common words have large magnitude vectors) and focuses purely on directional similarity.

cos(θ) = (A \cdot B) / (|A| \cdot |B|) = Σᵢ(Aᵢ\cdotBᵢ) / (\sqrtΣAᵢ² \cdot \sqrtΣBᵢ²) // range [-1, 1]. 1 = identical direction. 0 = orthogonal (unrelated). -1 = opposite. euclidean distance captures both direction and magnitude.

Similarity scores for "dog" compared to:

wolf

0.82

puppy

0.79

cat

0.68

animal

0.61

house

0.28

democracy

0.06

03 // Vector Space Visualization // 2D Projection (t-SNE/UMAP collapse from ~4096D)

SEMANTIC CLUSTERING

04 // Word2Vec // Origin

HOW THEY'RE LEARNED

Word2Vec (Mikolov et al. 2013) popularized the approach. Two architectures:

CBOW (Continuous Bag of Words): predict center word from surrounding context. Fast, good for frequent words.

Skip-gram: predict surrounding context from center word. Slower but better for rare words and fine-grained semantics. The choice for most quality embeddings.

SKIP-GRAM OBJECTIVE:

maximize Σ log P(context | word)

Train a shallow network to predict neighbors. Throw away the network. Keep the learned weight matrix. That matrix is your embeddings.

Negative sampling: for each positive context pair, sample k random "negative" pairs. Model must discriminate real context from noise. More efficient than softmax over full vocabulary.

05 // Tokenization

BEFORE EMBEDDING

Modern LLMs don't embed whole words - they use subword tokenization. BPE (Byte Pair Encoding) or SentencePiece merge frequent character pairs iteratively until vocabulary size is reached (~32k-100k tokens).

TOKENIZATION EXAMPLE // GPT-4

un believ able

BGP route reflect or

each colored block = 1 token = 1 embedding lookup

Rare words get split into subwords. Common words may be single tokens. "BGP" might be one token in a code-heavy model but split in a general one. Tokenization shapes what the model can "see" efficiently.

06 // Contextual Embeddings

STATIC vs DYNAMIC

Word2Vec produces static embeddings - "bank" always maps to the same vector regardless of context. This fails for polysemy.

Transformer models produce contextual embeddings - the representation of each token is computed by attending to the entire context. "bank" in "river bank" and "bank account" produce different vectors because the surrounding tokens reshape the representation through attention.

"BANK" IN CONTEXT

static: always → [0.31, -0.44, 0.82, ...]

contextual (river): [0.12, 0.71, -0.23, ...]

contextual (money): [0.88, -0.15, 0.49, ...]

ELMo (2018) was first to demonstrate this at scale. BERT (2018) made it the dominant paradigm. Every modern LLM produces contextual representations - static embeddings are now only used for retrieval/RAG efficiency.

polysemy BERT ELMo RAG

07 // Superposition Hypothesis

MORE FEATURES THAN DIMENSIONS

A surprising finding from mechanistic interpretability research: transformer models appear to store far more features than they have dimensions. A 4096-dimensional model shouldn't be able to represent more than 4096 orthogonal concepts - but empirically they seem to represent millions.

The proposed explanation is superposition: features are encoded as nearly-orthogonal directions in high-dimensional space, slightly interfering with each other. Sparse activation means most features are off at any time, limiting interference.

This is related to why monosemantic neurons (neurons that respond to exactly one concept) are rare. Most neurons are polysemantic - they participate in encoding many different features depending on context. Anthropic's mechanistic interpretability research is actively trying to decompose these superposed representations using sparse autoencoders.

superpositionsparse codingmech interpSAE

08 // Applications

WHAT EMBEDDINGS POWER

RAG

Embed documents and queries. Retrieve by cosine similarity. Feed relevant docs to LLM. Long-term memory without fine-tuning.

SEMANTIC SEARCH

Query "fast database" finds "high-performance PostgreSQL" without keyword overlap. Meaning, not string matching.

CLUSTERING

k-means on embedding space groups semantically related documents. Topic modeling without explicit labels.

ANOMALY DETECT

Log lines, alerts, events embedded and clustered. Outliers in vector space = unusual behavior. Your SRE use case.

Vector databases (Pinecone, Weaviate, pgvector) store millions of embeddings and serve approximate nearest-neighbor queries in milliseconds. The infrastructure layer underneath most production RAG systems.