Back to all posts

Embedding Models Deep Dive: Training, Fine-Tuning, and Optimization for Retrieval

You will learn what embedding models are, how they represent meaning as vectors, and why embedding quality directly determines what your retrieval system can and cannot find.

You will understand how contrastive learning trains embeddings by pushing similar texts closer together and dissimilar texts further apart in vector space.

You will learn when to fine-tune a domain-specific model and how hard negative mining makes fine-tuning dramatically more effective.

You will see when to use a bi-encoder versus a cross-encoder, and why the production standard is a two-stage retrieval pipeline combining both.

Key takeaway: start with a strong pre-trained model, evaluate on your task with held-out data and real retrieval metrics, then fine-tune only when evaluation shows the model falls short.

Introduction

Imagine you have 10,000 support articles and a user asks: "How do I cancel my subscription?" You need to quickly find the most relevant articles, not by matching keywords, but by understanding meaning. The user might have typed "end my plan" or "stop being charged" and still meant exactly the same thing.

Embeddings are the technology that makes this possible. An embedding model converts text into a list of numbers, a vector, that captures its meaning. Texts with similar meanings produce similar vectors. You can then find relevant documents by finding the vectors closest to the query vector, regardless of the exact words used.

Embedding quality is one of the most important factors in a RAG (retrieval-augmented generation) system. The difference between a mediocre and an excellent embedding model can mean the difference between 60% retrieval accuracy and 95% retrieval accuracy on the same corpus. This article explains how embedding models work internally, how contrastive training teaches them to capture meaning, how to fine-tune them for a specific domain, and how to optimize them for production.

Problem Statement

Keyword-based search fails when users phrase queries differently from how documents are written. A user searching for "cancel subscription" will not find an article titled "Terminating your plan" if the system relies on exact word matching. This is the core problem that semantic search addresses.

But generic pre-trained embedding models also have limits. A general-purpose model trained on web text may not understand that "ACL" means "anterior cruciate ligament" in a medical context and "access control list" in a security context. Domain-specific vocabulary, abbreviations, and relationships require domain-specific training signal to be represented correctly in the embedding space.

Core Concepts and Terminology

Term	Definition
Embedding	A learned numerical representation of text as a vector in a continuous, high-dimensional space.
Vector space	The mathematical space in which embeddings live. Semantic similarity corresponds to geometric proximity.
Cosine similarity	A measure of the angle between two vectors, ranging from -1 (opposite) to 1 (identical). The standard metric for comparing embedding similarity.
BERT	Bidirectional Encoder Representations from Transformers, the encoder-only transformer architecture that most embedding models are built on.
Sentence Transformer	A BERT-based model fine-tuned specifically for producing high-quality sentence-level embeddings using contrastive learning.
Contrastive learning	A training objective that pushes embeddings of similar texts closer together and dissimilar texts further apart.
Hard negatives	Training examples that are superficially similar but semantically different, which force the model to learn fine-grained distinctions.
Bi-encoder	Architecture that encodes query and document separately and compares their embeddings. Fast, pre-computable, used for initial retrieval.
Cross-encoder	Architecture that encodes query and document together, producing a single relevance score. More accurate but cannot pre-compute document representations.
MTEB	Massive Text Embedding Benchmark, the standard leaderboard for comparing embedding model performance across 56 tasks.
Mean pooling	Averaging the output token embeddings from the transformer to produce a single sentence-level vector. The most common and effective pooling strategy.
Matryoshka Representation Learning (MRL)	A training technique that allows embedding vectors to be truncated to any shorter prefix without retraining, enabling flexible quality-cost trade-offs.

How It Works

Think of an embedding model as a sophisticated librarian who has read every book and has developed an intuition for which texts are about the same idea. When you hand the librarian a sentence, they place it on a giant map where related sentences are clustered together. Your query about "cancelling a subscription" lands near documents about "terminating service", "stopping a plan", and "ending an account", not because the words match, but because the librarian understands they are all about the same action.

Diagram comparing the CBOW and Skip-gram architectures used in the Word2vec word embedding algorithm — **Figure:** The two core Word2vec training architectures. CBOW (left) predicts a target word from its surrounding context tokens; Skip-gram (right) does the reverse. Both objectives force the model to encode semantic relationships into vector positions, the same contrastive principle that modern sentence embedding models like Sentence-BERT build upon. Source: Aelu013 / Wikimedia Commons (CC BY-SA 4.0)

Tokenization: The input text is split into tokens, sub-word units, and each is assigned a numerical ID. A sentence like "cancel subscription" becomes a sequence of token IDs.
Transformer encoding: The sequence of token IDs is processed through multiple transformer layers. Each layer uses self-attention: every token looks at every other token in the sentence to build a context-aware representation. The word "bank" in "river bank" develops a different internal representation from "bank" in "savings bank" because the surrounding tokens differ.
Pooling: The transformer produces one vector per token. To get a single vector for the whole sentence, mean pooling averages all token vectors, weighted to exclude padding tokens. This produces the final embedding for the sentence.
L2 normalization: The embedding vector is scaled to unit length. After normalization, the cosine similarity between two embeddings equals their dot product, a much cheaper computation that enables fast similarity search at scale.
Contrastive training: The model learns what "similar" means by seeing labeled pairs. Positive pairs (semantically equivalent sentences) are pushed close together in vector space; negative pairs (semantically different sentences) are pushed apart. The training objective is to make the model's similarity scores match human judgments of semantic similarity.
Retrieval: At query time, the query text is encoded into a vector using the same model. A vector similarity search (using tools like FAISS or a vector database) finds the corpus documents with the highest cosine similarity to the query vector. These are the retrieved candidates.

Practical Example

A healthcare company builds an internal knowledge base search for clinical staff. Clinicians ask questions about drug interactions, dosage guidelines, and patient care protocols. Generic embedding models trained on web text perform poorly, they do not know that "QD" means "once daily", that "contraindicated in" has a specific clinical meaning, or that "adverse event" and "side effect" are related but distinct concepts in this context.

The team collects training data from their existing search logs: query-document pairs where a clinician clicked on a result, indicating relevance. They use an LLM to generate additional synthetic pairs, for each key document, the LLM generates five plausible clinical questions that document would answer. They mine hard negatives by finding documents that the current model ranks highly for each query but that are not actually relevant, for example, documents about a different drug with a similar name.

Fine-tuning on this data with the Multiple Negatives Ranking Loss improves their Recall@5 metric from 61% to 87% on a held-out evaluation set. The improvement is concentrated on queries using clinical abbreviations and domain-specific terminology, exactly where the generic model was weakest.

Advantages

Language-independent matching: Embedding-based retrieval finds relevant documents even when query words do not appear in the document text. A search for "chest pain" retrieves documents about "angina" and "myocardial infarction" because they occupy nearby regions of the semantic space.
Multilingual capability: Multilingual embedding models place texts from different languages in the same vector space. A query in English can retrieve relevant documents written in French or Spanish without any translation step.
Adaptability through fine-tuning: A pre-trained base model can be adapted to a new domain with relatively little labeled data, hundreds to thousands of pairs rather than millions. This makes high-quality domain-specific search achievable without building a model from scratch.
Scalable retrieval: Document embeddings are pre-computed once and stored in a vector index. At query time, only the query needs to be encoded, a single forward pass through the model, followed by an efficient approximate nearest-neighbor search. This scales to millions of documents with sub-second latency.

Limitations and Trade-offs

Context length limits: Most BERT-based models process a maximum of 512 tokens. Text beyond this limit is silently truncated, losing information from the end of the document. Long documents must be chunked into overlapping segments and embedded separately.
Embedding quality is task-specific: A model that scores highly on general MTEB benchmarks may perform poorly on your specific domain. MTEB scores are averages across 56 tasks; what matters for your use case is performance on a held-out evaluation set built from your actual query distribution.
Dimensionality-storage trade-off: Higher-dimensional embeddings capture more semantic nuance but cost more in storage and slow down similarity search. A 3072-dimensional embedding takes 12 kilobytes per document at 32-bit float precision, significant at millions of documents.
Bi-encoders sacrifice accuracy for speed: Because query and document are encoded separately in a bi-encoder, the model cannot compare specific words across query and document during encoding. Cross-encoders are significantly more accurate but cannot pre-compute document representations, making them too slow for initial retrieval over large corpora.
Fine-tuning requires labeled data: Fine-tuning is only beneficial when you have enough domain-specific query-document pairs to train on without overfitting. With very small datasets (under a few hundred pairs), a generic pre-trained model often generalizes better than an overfitted fine-tuned one.

Common Mistakes

Using only easy negatives during fine-tuning. Training with random or obviously different negatives produces embeddings that cannot distinguish between similar-looking but different texts. Hard negatives, texts that are superficially similar but semantically different, are the most valuable training signal for improving retrieval precision.
Skipping evaluation on your actual task. Teams that select a model based on MTEB leaderboard scores without evaluating on their own domain and query distribution often discover that the top-ranked general model is outperformed by a smaller, task-specific model on their use case.
Forgetting to normalize embeddings. Many vector databases normalize automatically, but if you are computing similarity manually, failing to L2-normalize gives incorrect results. Un-normalized cosine similarity computes the angle correctly only for unit vectors.
Using deprecated models. OpenAI's text-embedding-ada-002 is deprecated. The current standard models are text-embedding-3-small for cost efficiency and text-embedding-3-large for highest quality. Using deprecated models means missing improvements and eventually losing API support.
Embedding whole documents without chunking. Documents longer than 512 tokens will be silently truncated by most models. For long documents, split into overlapping chunks (for example, 400 tokens with 50-token overlap) and embed each chunk. At retrieval time, retrieve the most relevant chunks, not documents.

Best Practices

Start with a well-benchmarked general-purpose model such as all-mpnet-base-v2 or text-embedding-3-large. Evaluate it on a held-out sample of your actual query-document pairs before investing in fine-tuning.
Build an evaluation set of 200 to 500 query-relevant document pairs from your domain before fine-tuning. You cannot measure improvement without a baseline to compare against.
Use retrieval metrics, Precision@K, Recall@K, NDCG, not similarity scores, to evaluate embedding quality. A high average cosine similarity score does not mean the model retrieves the right documents.
When fine-tuning, use the Multiple Negatives Ranking Loss with hard negative mining. Hard negatives, similar but non-relevant documents, provide the training signal that most improves retrieval precision on real queries.
Implement a two-stage retrieval pipeline for production: a bi-encoder retrieves the top 100 to 500 candidates quickly, then a cross-encoder re-ranks them to produce the final top 10. This combines the speed of bi-encoders with the accuracy of cross-encoders.
Cache embeddings for frequently accessed documents. Embedding generation is computationally expensive, recomputing embeddings for the same text on every request wastes resources unnecessarily.
Process documents in batches rather than one at a time. GPU inference is optimized for parallel operations; batching can provide 10 to 50 times the throughput of sequential encoding.
For long documents, use overlapping chunks rather than hard splits to avoid cutting a sentence mid-thought at a chunk boundary.

Comparison: Popular Embedding Models

Model	Dimensions	MTEB Score	Speed	Best For
all-MiniLM-L6-v2	384	58.8	Very Fast	High-volume, latency-critical applications
all-mpnet-base-v2	768	63.3	Medium	Balanced quality and speed, recommended starting point
e5-large-v2	1024	65.0	Slow	High-accuracy needs with self-hosted infrastructure
bge-large-en-v1.5	1024	64.2	Medium	General retrieval, strong open-source option
instructor-xl	768	66.8	Slow	Multi-task use cases where task-specific instructions improve quality
text-embedding-3-small	1536	62.3	API latency	Cost-efficient OpenAI API integration with MRL support
text-embedding-3-large	3072	64.6	API latency	Best OpenAI quality, MRL allows dimension truncation without retraining

FAQ

When should I fine-tune versus use an off-the-shelf model?

Start by evaluating a strong off-the-shelf model on your specific task using held-out data and real retrieval metrics. If Recall@5 or NDCG on your evaluation set is above 80 to 85%, the generic model is likely sufficient. Fine-tune when your domain has specialized vocabulary that generic models underperform on, when you have labeled query-document pairs (at least several hundred), and when evaluation shows the generic model misses relevant results on representative queries. The most common case that justifies fine-tuning is a corpus with technical terminology, medical, legal, financial, or code, that is underrepresented in general web training data.

What is hard negative mining and why does it matter?

A hard negative is a document that is superficially similar to the query but semantically different and not actually relevant. For example, in legal search: if the query is about "force majeure in commercial leases", a hard negative might be an article about "force majeure in residential leases", same legal concept, different domain. Easy negatives (cooking recipes vs. contract law) are so obviously different that the model learns little from them. Hard negatives force the model to develop fine-grained distinctions between similar-looking texts, which is exactly the capability needed for precise retrieval in a specialized domain.

Why is a two-stage retrieval pipeline better than using a cross-encoder directly?

Cross-encoders are significantly more accurate than bi-encoders because they can compare every word in the query against every word in the document during encoding, via direct attention interactions. The problem is that this requires re-encoding every query-document pair at query time, you cannot pre-compute document representations. For a corpus of one million documents, running a cross-encoder on every query-document pair would take minutes. A bi-encoder pre-computes all document embeddings once, reducing query time to a single forward pass plus a fast vector search. The two-stage pipeline uses the bi-encoder's speed for initial retrieval of 100 to 500 candidates, then uses the cross-encoder's accuracy for final re-ranking of that small candidate set. You get most of the accuracy at a fraction of the cost.

What does an MTEB score actually measure and should I rely on it for model selection?

MTEB is an average score across 56 tasks including sentence similarity, retrieval, clustering, classification, and reranking. It is a useful proxy for general capability, but an average can hide important task-specific weaknesses. A model that scores 64 on MTEB might score 85 on retrieval tasks but 45 on clustering tasks, or it might perform well on English Wikipedia-style text but poorly on medical abbreviations. Always treat MTEB as a starting-point filter for model candidates, then evaluate the shortlisted models on a held-out sample of your own query-document pairs before making a final selection.

How does Matryoshka Representation Learning help in practice?

MRL trains the model such that the first N dimensions of the embedding are themselves a useful representation, not just a prefix of a larger one. This means you can truncate a 3072-dimensional embedding to 256 or 512 dimensions at inference time and still get a meaningful similarity score, without retraining the model. In practice this gives you a quality-cost dial: use full dimensions for the final re-ranking stage where accuracy matters most, and use truncated dimensions for the initial candidate retrieval stage where you are searching over millions of documents and storage cost is significant. OpenAI's text-embedding-3-large supports MRL natively via the dimensions parameter in the API call.

References

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
Muennighoff, N., et al. (2022). MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316.
Gao, L., et al. (2021). SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP 2021.
Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021 Datasets Track.
Sentence Transformers Documentation

Key Takeaways

Embedding quality directly determines what a retrieval system can find. A difference of even 10 NDCG points between two models translates to meaningfully more relevant results for end users, invest time in model selection and evaluation before optimizing other parts of the retrieval pipeline.
Contrastive learning trains embeddings by pushing similar texts closer and dissimilar texts further apart in vector space. Hard negatives, superficially similar but semantically different texts, are the most valuable training signal for domain adaptation.
Always evaluate on your own task with real retrieval metrics (Precision@K, Recall@K, NDCG) using held-out query-document pairs. MTEB scores are useful starting filters but do not substitute for task-specific evaluation.
Use bi-encoders for initial candidate retrieval (pre-computable, fast at scale) and cross-encoders for final re-ranking (more accurate, not pre-computable). The two-stage pipeline is the production standard for combining both.
For production optimization, combine batch processing (10 to 50 times throughput improvement), embedding caching for frequently accessed documents, and quantization (4x storage reduction) to make high-quality retrieval economically viable at scale.
Fine-tune only when evaluation shows the generic model falls short on your domain. Start with a strong pre-trained model, measure first, and invest in fine-tuning only when the gap is large enough to justify the effort.