How Words Become Vectors: Embeddings Inside Transformers (Without Tears)
“Your embedding API returns a vector. Here are the 12 disasters happening inside and the 5 you can actually fix.”
TL;DR:
Modern embedding libraries handle CLS pooling disasters for you. But they can’t fix anisotropy (vectors clustering in narrow cones), length bias (longer texts having systematically different magnitude distributions), or domain vocabulary collapse (subword tokenization destroying semantic units). This guide shows what’s breaking, what’s already fixed, and what you still need to handle.
🎯 The Architecture: What Libraries Handle vs What You Handle
What SentenceTransformers/OpenAI API handle for you:
✅ CLS vs mean pooling
✅ Padding token exclusion
✅ Special token handling
✅ Layer selection
✅ L2 normalization
What you still need to handle:
❌ Anisotropic distribution (vectors in narrow cones)
❌ Length-dependent magnitude bias
❌ Domain-specific tokenization failures
❌ Quantization errors
❌ Task-appropriate model selection
💀 The CLS Token: Design vs Reality
What Libraries Are Actually Doing
Historical context:
BERT was pre-trained with Next Sentence Prediction (NSP), where [CLS] aggregated sentence-level information. That made sense for classification but fails for semantic similarity.
# BERT's original training (2018):
# [CLS] was trained to predict if sentence B follows sentence A
# This biases CLS toward boundary/classification signals, not semantics
# Modern sentence encoders (2020+):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
print(model[1].pooling_mode_mean_tokens) # True
print(model[1].pooling_mode_cls_token) # False
Technical correction:
CLS doesn’t “stop paying attention.” It develops fixed attention patterns during pre-training, emphasizing early tokens due to bias in attention distributions.
# CLS attention follows a power-law decay
import numpy as np
position = np.arange(512)
attention_weight = 1 / (1 + 0.1 * position)
attention_weight /= attention_weight.sum()
Position Range | Total Attention |
---|---|
0–10 | 31% |
11–50 | 42% |
51–100 | 18% |
100+ | 9% |
It’s not “giving up” it’s learned positional bias.
🔥 The Domain Vocabulary Problem
The Real Issue: Subword Tokenization vs Semantic Units
The problem isn’t that subword fragments get “wrong” embeddings, it’s that their compositional semantics don’t equal the learned semantics of the full token.
# Mathematically:
# embed("DataForce") ≠ f(embed("Data"), embed("Force"))
# for any simple function f (mean, sum, concat)
Reasons:
- “DataForce” as a single token might have appeared in training.
- Split tokens trigger different attention patterns.
- Positional encodings differ (1 vs 2 positions).
- Attention heads specialize around token boundaries.
Even if “DataForce” never appeared, its composed meaning is context-dependent and unstable across uses.
Why Averaging Fails
The embedding space is non-linear and anisotropic, so averaging subword vectors rarely preserves meaning.
Midpoints land in semantically meaningless regions of the manifold.
# The embedding space is non-linear and anisotropic
# Given embeddings in ℝ^d:
v_compound = embed("DataForce") # If it existed as single token
v_data = embed("Data")
v_force = embed("Force")
# The average (v_data + v_force)/2 assumes:
# 1. Linear interpolation preserves semantics (FALSE)
# 2. The space between points is semantically meaningful (FALSE)
# 3. Composition is additive (FALSE)
# In reality, the manifold is highly curved
# The midpoint often lands in a semantically different region
✅ Production Fixes
Fix 1: Subword Pooling with Positional Awareness
def subword_aware_pooling(token_embeddings, word_ids, attention_mask):
'''
Pool subwords back to word level before sentence pooling.
'''
word_embeddings, current_embeddings = [], []
current_word = -1
for i, word_id in enumerate(word_ids):
if word_id is None:
continue
if word_id != current_word:
if current_embeddings:
word_embeddings.append(torch.stack(current_embeddings).mean(0))
current_word = word_id
current_embeddings = [token_embeddings[i]]
else:
current_embeddings.append(token_embeddings[i])
if current_embeddings:
word_embeddings.append(torch.stack(current_embeddings).mean(0))
return torch.stack(word_embeddings).mean(0)
Fix 2: Contrastive Fine-Tuning (What Actually Works at Scale)
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model = SentenceTransformer('all-mpnet-base-v2')
train_examples = [
InputExample(texts=['DataForce quarterly report', 'DataForce Q3 earnings'], label=0.9),
InputExample(texts=['DataForce analysis', 'Data Force military'], label=0.1),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)
Now “DataForce” has consistent representation regardless of tokenization.
🌡️ The Anisotropy Problem (Mathematically Correct)
What It Actually Means
Definition: Anisotropy = embeddings cluster in a narrow cone of the hypersphere instead of spreading uniformly.
What you expect (isotropic - good):
* *
* *
* *
* *
* *
* *
* *
(Embeddings spread evenly)
What you get (anisotropic - bad):
..
.:::.
.::::.
'::'
(All embeddings in narrow cone)
def measure_anisotropy(embeddings):
'''
Measures how uniformly embeddings fill the space.
Returns isotropy score: 0 = highly anisotropic, 1 = perfectly isotropic.
'''
normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
similarities = normalized @ normalized.T
d = embeddings.shape[1]
expected_var = 1 / d
mask = ~np.eye(similarities.shape[0], dtype=bool)
off_diagonal = similarities[mask]
actual_mean = np.mean(off_diagonal)
actual_var = np.var(off_diagonal)
mean_isotropy = 1 - abs(actual_mean)
var_isotropy = min(actual_var / expected_var, 1)
return (mean_isotropy + var_isotropy) / 2
Model | Isotropy Score |
---|---|
BERT | 0.31 |
all-MiniLM-L6-v2 | 0.67 |
all-mpnet-base-v2 | 0.71 |
SimCSE | 0.84 |
Why It Happens
During softmax training, models minimize:
p(correct|q) = exp(sim(q, d+)) / Σ exp(sim(q, d))
To stabilize the denominator, models compress embeddings into a narrow cone, increasing the average similarity across all pairs.
This is a side-effect of optimization dynamics, not vocabulary frequency.
# During training with softmax loss:
# p(correct|query) = exp(sim(q,d+)) / Σ exp(sim(q,d))
# To minimize loss, model learns:
# 1. Push positive pairs together
# 2. BUT: denominator includes ALL negatives
# 3. Easiest solution: compress all vectors to small region
# 4. This maximizes average similarity, making denominator predictable
# Mathematical proof (simplified):
# Given uniformly sampled negatives on unit sphere
# Expected cosine similarity E[cos(θ)] in d dimensions = 0
# But if all vectors in cone of angle α:
# E[cos(θ)] = cos(α) > 0
# This reduces variance in denominator, stabilizing training
⚙️ The Whitening Fix
def whiten_embeddings(embeddings, eps=1e-6):
mean = embeddings.mean(0)
centered = embeddings - mean
cov = (centered.T @ centered) / (len(centered) - 1)
eigvals, eigvecs = np.linalg.eigh(cov)
W = eigvecs @ np.diag(1/np.sqrt(eigvals + eps)) @ eigvecs.T
return centered @ W, mean, W
Always apply the same
mean
andW
at inference.
Whitening must be fit on your distribution, not applied blindly.
📏 The Length Bias Problem
It’s not the Central Limit Theorem: it’s semantic dilution.
Mean pooling assigns equal weight to all tokens, so meaningful ones get drowned out.
# Short text (3 tokens): (v1 + v2 + v3) / 3 → 33% per token
# Long text (100 tokens): (v1 + ... + v100) / 100 → 1% per token
# The actual mechanism:
# 1. Longer sequences have more tokens
# 2. Attention is roughly uniform (after many layers)
# 3. Mean pooling weights each token equally
# 4. More tokens = each individual token contributes less
# 5. Outlier tokens (carrying key semantics) get diluted
# Mathematical formulation:
# Short text: v_short = (v1 + v2 + v3) / 3
# If v2 is semantic key: contribution = 33%
# Long text: v_long = (v1 + ... + v100) / 100
# If v50 is semantic key: contribution = 1%
# The semantic signal gets diluted, not "converged to mean"
Fix: Weighted Pooling
def weighted_pooling(token_embeddings, attention_mask, attention_weights=None):
'''
Use attention weights for importance-weighted pooling.
'''
if attention_weights is None:
attention_weights = model(**inputs, output_attentions=True).attentions[-1]
attention_weights = attention_weights.mean(dim=1)[:, 0, :]
attention_weights = attention_weights * attention_mask
attention_weights = attention_weights / attention_weights.sum(dim=1, keepdim=True)
weighted = (token_embeddings * attention_weights.unsqueeze(-1)).sum(dim=1)
return weighted
🧮 Quantization: The Real Math
Scalar Quantization (int8)
Scalar quantization and product quantization work differently:
def scalar_quantize(embeddings):
scale = np.percentile(np.abs(embeddings), 95)
quantized = np.clip(embeddings / scale * 127, -128, 127).astype(np.int8)
dequantized = quantized.astype(np.float32) * scale / 127
mse = np.mean((embeddings - dequantized) ** 2)
return quantized, scale, mse
Product Quantization (PQ)
def product_quantize(embeddings, num_subvectors=8, bits_per_subvector=8):
'''
Split vector into subvectors and quantize each independently.
'''
d = embeddings.shape[1]
d_sub = d // num_subvectors
quantized, codebooks = [], []
for i in range(num_subvectors):
sub = embeddings[:, i*d_sub:(i+1)*d_sub]
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2**bits_per_subvector)
kmeans.fit(sub)
codes = kmeans.predict(sub)
quantized.append(codes)
codebooks.append(kmeans.cluster_centers_)
return quantized, codebooks
PQ typically yields 8-16× compression with 3–6% cosine similarity error. Scalar quantization achieves 4x with ~2% error
Which Model For Which Problem?
The Decision Matrix:
# The Decision Tree:
if task == "General semantic search":
# Best isotropy (wide spread)
model = "all-MiniLM-L6-v2" # 384d, fast, good spread
why = "Optimized for semantic similarity with good isotropy"
elif task == "Finding similar documents":
# Trained on paraphrase data
model = "all-mpnet-base-v2" # 768d, better for similarity
why = "Trained on paraphrase pairs, understands document similarity"
elif task == "Question-Answer matching":
# Trained on Q&A pairs
model = "multi-qa-mpnet-base-dot-v1"
why = "Trained specifically on question-answer pairs"
elif task == "Multilingual":
# Handles 50+ languages
model = "paraphrase-multilingual-mpnet-base-v2"
why = "Aligned representations across languages"
elif task == "Code search":
# Understands code syntax
model = "flax-sentence-embeddings/stackoverflow_mpnet-base"
why = "Trained on Stack Overflow data, understands code"
elif task == "Medical domain":
model = "pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb"
why = "Trained on medical literature, knows medical terminology"
elif task == "Legal domain":
model = "nlpaueb/legal-bert-base-uncased"
why = "Trained on legal documents, understands legal language"
elif need_speed and not accuracy:
# 5x faster, 90% accuracy
model = "all-MiniLM-L12-v2" # Only 12 layers
why = "Distilled model, much faster with minimal accuracy loss"
elif need_accuracy and not speed:
# State of the art
model = "OpenAI text-embedding-3-large" # 3072d
why = "Largest model, best accuracy, but expensive"
🧠 The Corrected Production Pipeline
class ProductionEmbeddingPipeline:
class ProductionEmbeddingPipeline:
"""
Scientifically accurate production pipeline
"""
def __init__(self, model_name='all-mpnet-base-v2'):
self.model = SentenceTransformer(model_name)
self.tokenizer = self.model.tokenizer
# Fit whitening on your data (not random data)
self.whitening_params = None
# Cache with TTL
self.cache = LRUCache(maxsize=10000)
def fit_whitening(self, sample_texts, sample_size=10000):
"""
Fit whitening transformation on your domain
"""
# Sample your actual data distribution
embeddings = self.model.encode(sample_texts[:sample_size])
# Fit whitening
self.whitening_params = self._fit_whiten(embeddings)
def encode(self, texts, apply_whitening=True):
"""
Production encoding with all fixes
"""
# Batch processing for efficiency
if isinstance(texts, str):
texts = [texts]
# Check cache
uncached = []
cached_results = {}
for i, text in enumerate(texts):
if text in self.cache:
cached_results[i] = self.cache[text]
else:
uncached.append((i, text))
if uncached:
# Encode uncached
uncached_texts = [t for i, t in uncached]
# Handle long texts with sliding window
embeddings = []
for text in uncached_texts:
if len(text.split()) > 256:
emb = self._encode_long(text)
else:
emb = self.model.encode(text)
embeddings.append(emb)
embeddings = np.vstack(embeddings)
# Apply whitening if fitted
if apply_whitening and self.whitening_params:
embeddings = self._apply_whitening(embeddings)
# L2 normalize (critical for cosine similarity)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
# Update cache
for (i, text), emb in zip(uncached, embeddings):
self.cache[text] = emb
cached_results[i] = emb
# Return in original order
return np.vstack([cached_results[i] for i in range(len(texts))])
def _encode_long(self, text, window_size=256, stride=128):
"""
Sliding window for long texts with max pooling
Max pooling preserves semantic peaks better than mean
"""
words = text.split()
if len(words) <= window_size:
return self.model.encode(text)
# Generate windows
windows = []
for i in range(0, len(words) - window_size + 1, stride):
window = ' '.join(words[i:i + window_size])
windows.append(window)
# Encode all windows
window_embeddings = self.model.encode(windows)
# Max pool (preserves semantic peaks)
# Note: Mean pooling would dilute signal
return np.max(window_embeddings, axis=0)
def search_with_reranking(self, query, corpus, k=10):
"""
Production search with reranking for accuracy
"""
# Step 1: Get top 100 candidates with embeddings (fast)
query_emb = self.encode(query)
corpus_embs = self.encode(corpus)
similarities = cosine_similarity([query_emb], corpus_embs)[0]
top_100_idx = np.argsort(similarities)[-100:][::-1]
# Step 2: Rerank with cross-encoder (accurate)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, corpus[idx]) for idx in top_100_idx]
scores = reranker.predict(pairs)
# Step 3: Return top k after reranking
reranked_idx = np.argsort(scores)[-k:][::-1]
return [top_100_idx[i] for i in reranked_idx]
💥 What You Actually Need To Do Today
1. Check Your Model Choice:
# Run this benchmark on YOUR data:
from sentence_transformers import SentenceTransformer, util
models = ['all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'intfloat/e5-base-v2']
test_queries = [your actual queries]
test_corpus = [your actual documents]
for model_name in models:
model = SentenceTransformer(model_name)
# Test and measure
# Key metrics: speed, recall@10, anisotropy
2. Add Hybrid Search for Domain Terms:
# Combine embedding search with BM25
from rank_bm25 import BM25Okapi
class HybridSearch:
def __init__(self):
self.encoder = SentenceTransformer('all-mpnet-base-v2')
self.bm25 = None
def index(self, corpus):
# Tokenize for BM25
tokenized = [doc.split() for doc in corpus]
self.bm25 = BM25Okapi(tokenized)
# Encode for semantic
self.embeddings = self.encoder.encode(corpus)
def search(self, query, k=10):
# BM25 scores
bm25_scores = self.bm25.get_scores(query.split())
# Semantic scores
query_emb = self.encoder.encode(query)
semantic_scores = cosine_similarity([query_emb], self.embeddings)[0]
# Combine with RRF
combined_scores = self.reciprocal_rank_fusion(
[bm25_scores, semantic_scores]
)
return np.argsort(combined_scores)[-k:][::-1]
3. Implement Reranking:
Always rerank your top results for accuracy, Embeddings for recall, cross-encoder for precision
4. Monitor Anisotropy:
# If average similarity > 0.6, you have a problem
embeddings = model.encode(sample_of_your_corpus)
avg_sim = cosine_similarity(embeddings).mean()
print(f"Anisotropy check: {avg_sim}")
if avg_sim > 0.6:
print("WARNING: High anisotropy detected. Consider whitening or different model")
⚖️ The Scientifically Accurate Truth
- Anisotropy is caused by optimization dynamics, not “common words.”
- CLS attention follows positional biases, not “inattention.”
- Length bias arises from semantic dilution, not the CLT.
- Subword tokenization breaks semantic composition, not “wrong embeddings.”
- Quantization errors are non-uniform across the embedding space.
- Whitening must be fit on your data, not borrowed blindly.
🧩 Final Insight
We’re compressing variable-length discrete sequences into fixed-size continuous vectors.
Information loss is inevitable.
The goal isn’t perfect semantics it’s to minimize task-relevant information loss.