An embedding is a learned mapping from a discrete token (a word, a subword piece, an image patch) into a continuous high-dimensional vector. The vectors are not hand-crafted - they emerge from training on massive corpora. The model discovers that meaning can be encoded as geometry.
The key emergent property: semantically similar tokens end up close in vector space. Tokens that share syntactic roles cluster together. Relationships between concepts become directions in space. Arithmetic on vectors corresponds to arithmetic on meaning.
The famous example - king - man + woman ≈ queen - is not programmed. It emerges from the statistical structure of language alone. The "royalty" dimension and the "gender" dimension are discovered independently and happen to compose linearly.
0.558, -0.119, 0.930, -0.447, 0.663,
... 4086 more dimensions ... ]
Each number is a weight across a learned axis of meaning. No single dimension corresponds to a human-interpretable concept - it's a superposition across thousands of features.
Distance in embedding space is typically measured with cosine similarity - the angle between two vectors, not their magnitude. This normalizes for token frequency (common words have large magnitude vectors) and focuses purely on directional similarity.
Similarity scores for "dog" compared to:
Word2Vec (Mikolov et al. 2013) popularized the approach. Two architectures:
CBOW (Continuous Bag of Words): predict center word from surrounding context. Fast, good for frequent words.
Skip-gram: predict surrounding context from center word. Slower but better for rare words and fine-grained semantics. The choice for most quality embeddings.
Negative sampling: for each positive context pair, sample k random "negative" pairs. Model must discriminate real context from noise. More efficient than softmax over full vocabulary.
Modern LLMs don't embed whole words - they use subword tokenization. BPE (Byte Pair Encoding) or SentencePiece merge frequent character pairs iteratively until vocabulary size is reached (~32k-100k tokens).
Rare words get split into subwords. Common words may be single tokens. "BGP" might be one token in a code-heavy model but split in a general one. Tokenization shapes what the model can "see" efficiently.
Word2Vec produces static embeddings - "bank" always maps to the same vector regardless of context. This fails for polysemy.
Transformer models produce contextual embeddings - the representation of each token is computed by attending to the entire context. "bank" in "river bank" and "bank account" produce different vectors because the surrounding tokens reshape the representation through attention.
ELMo (2018) was first to demonstrate this at scale. BERT (2018) made it the dominant paradigm. Every modern LLM produces contextual representations - static embeddings are now only used for retrieval/RAG efficiency.
A surprising finding from mechanistic interpretability research: transformer models appear to store far more features than they have dimensions. A 4096-dimensional model shouldn't be able to represent more than 4096 orthogonal concepts - but empirically they seem to represent millions.
The proposed explanation is superposition: features are encoded as nearly-orthogonal directions in high-dimensional space, slightly interfering with each other. Sparse activation means most features are off at any time, limiting interference.
This is related to why monosemantic neurons (neurons that respond to exactly one concept) are rare. Most neurons are polysemantic - they participate in encoding many different features depending on context. Anthropic's mechanistic interpretability research is actively trying to decompose these superposed representations using sparse autoencoders.
Embed documents and queries. Retrieve by cosine similarity. Feed relevant docs to LLM. Long-term memory without fine-tuning.
Query "fast database" finds "high-performance PostgreSQL" without keyword overlap. Meaning, not string matching.
k-means on embedding space groups semantically related documents. Topic modeling without explicit labels.
Log lines, alerts, events embedded and clustered. Outliers in vector space = unusual behavior. Your SRE use case.