How to Generate Better Embeddings for Vector Search

A practical guide to improving retrieval quality with chunking, cleaning, metadata, and embedding strategies

Posted by Perivitta on February 12, 2026 · 21 mins read
Understanding : A Step-by-Step Guide

How to Generate Better Embeddings for Vector Search


Introduction

Most people assume vector search quality depends on which vector database they choose. Pinecone, Weaviate, Qdrant, Milvus, Chroma, FAISS — the tools change, but the same retrieval problems still happen.

In real production systems, bad vector search is rarely caused by the database itself. The most common issue is that embeddings were generated from low-quality chunks or poorly structured text.

This is why two teams can use the same embedding model and still get completely different results. The difference usually comes down to preprocessing, chunking strategy, metadata, and evaluation.

In this post, I’ll go through the practical ways to generate better embeddings so your vector search results become more accurate, consistent, and useful in real-world RAG pipelines.


What Does “Better Embeddings” Actually Mean?

When people say “better embeddings”, they usually mean retrieval works better in real usage, not just in a demo.

  • Retrieval returns the correct document more often.
  • Retrieval results are more consistent across similar queries.
  • Top results contain fewer irrelevant matches.
  • The system performs better with vague natural language questions.
  • Similarity ranking becomes more predictable and logical.

A better embedding pipeline is not about producing “nicer vectors”. It is about improving what your system retrieves.


Common Symptoms of Bad Embeddings

Before improving embeddings, it helps to recognize the usual failure patterns.

  • Retrieval returns documents that share keywords but are semantically unrelated.
  • Minor wording changes cause completely different retrieval results.
  • Short queries like “pricing” return random chunks.
  • Retrieval works in testing but fails with real user questions.
  • The system retrieves chunks that are incomplete or lack context.

If you are seeing these issues, your embedding model might not be the problem. The text going into the embedding model is often the real cause.


Step 1: Clean the Text Before Embedding

If you embed messy input, you get messy vectors. Many teams embed raw scraped HTML, PDF text, or markdown without cleaning, which introduces a lot of noise.

A solid preprocessing pipeline should remove:

  • Navigation menus.
  • Repeated headers and footers.
  • Cookie banners and privacy popups.
  • Ads and irrelevant sidebar content.
  • Page numbers, watermark text, and PDF artifacts.
  • Extra whitespace and broken formatting.

A simple rule works well: if a human would ignore the text, do not embed it.


Step 2: Chunking Matters More Than Model Choice

Chunking is one of the biggest factors in retrieval quality. Even the best embedding model cannot fix poor chunking.

Chunking defines what information is compressed into a single vector. If chunks are too large, they become vague. If chunks are too small, they lose meaning.

Chunking mistakes that destroy retrieval quality

  • Splitting text every N tokens without respecting structure.
  • Cutting chunks in the middle of sentences.
  • Embedding entire documents as one chunk.
  • Embedding only headings with no content.
  • Embedding bullet lists without the paragraph explaining them.

A good chunk should feel like something a human could read and understand by itself.


Chunk Size: What Works in Practice?

There is no universal chunk size, but most production RAG systems fall into predictable ranges.

Chunk Size Best For Typical Result
150–300 tokens FAQs, short Q&A datasets High precision, weaker context
300–700 tokens General RAG pipelines Strong balance of context and relevance
700–1200 tokens Technical docs and manuals More context, but risk of being too broad

For most chatbot and documentation retrieval systems, 400–800 tokens is a strong baseline.

A good way to validate chunk size is to ask:

  • Can this chunk answer a question on its own?
  • Does it contain enough context to make sense?
  • Is it too broad to match specific queries?

Step 3: Use Overlap, But Do Not Overdo It

Chunk overlap prevents context loss at chunk boundaries. Without overlap, important sentences can get cut off.

Common overlap values:

  • 10% overlap for general text.
  • 15–20% overlap for technical documentation.
  • 30% overlap only when chunks are extremely small.

Overlap improves recall, but too much overlap causes duplicated results and wastes vector storage.

A practical baseline is 100–200 tokens overlap.


Step 4: Chunk by Meaning, Not by Token Count

One of the best improvements you can make is semantic chunking. Instead of splitting every 500 tokens, split by document structure.

  • Headings and subheadings.
  • Paragraph boundaries.
  • Code blocks and explanation blocks.
  • Lists and bullet groups.
  • Table sections.

This produces chunks that represent real concepts, not random fragments.


Step 5: Add Context Headers Inside Each Chunk

This is one of the most underrated tricks in embedding pipelines.

When you chunk a document, each chunk loses context such as the page title and section name. That context matters for embeddings.

Instead of embedding the chunk alone, prepend a short header.

Example

Instead of embedding:

Rate limits apply to all API calls...

Embed:

Document: API Documentation
Section: Rate Limiting

Rate limits apply to all API calls...

This simple change often improves retrieval quality dramatically.


Step 6: Handle Tables and Code Blocks Correctly

Tables and code blocks are common sources of embedding failure.

If you embed raw tables from HTML or PDF extraction, the content often becomes unreadable and confusing.

For example, this is what bad extracted table text looks like:

Plan Price Limit Basic 10 1000 Pro 25 10000

Instead, represent it in a structured format or natural language.

Plan Price Request Limit
Basic $10 1000 requests
Pro $25 10,000 requests
Enterprise Custom Unlimited

For code blocks, embedding works well only if the chunk includes explanation text. Raw code by itself is harder to retrieve meaningfully.

  • Keep code blocks together with the paragraph explaining them.
  • Avoid embedding minified or generated code.
  • Do not separate code examples from their headings.

Step 7: Choose the Right Embedding Model for Your Domain

Once chunking and preprocessing are done properly, embedding model choice becomes more important.

When evaluating models, look for:

  • Strong performance on your content domain.
  • Stable similarity behavior across query variations.
  • Reasonable vector dimensions for your scale.
  • Cost efficiency for large ingestion workloads.
  • Low latency for real-time query embedding.

Benchmarks are useful, but the real benchmark is your own dataset and your own queries.


Step 8: Normalize Queries and Documents

A common mistake is embedding documents and user queries in completely different formats.

Documents often contain structured text, but user queries are short and vague. This gap can hurt retrieval.

You can improve results by lightly normalizing queries:

  • Expanding abbreviations like “db” to “database”.
  • Removing noisy punctuation and repeated whitespace.
  • Converting vague follow-ups into complete questions.

This becomes even more important in chatbots, where users ask follow-up questions like “what about pricing?” without context.


Step 9: Use Query Rewriting for Better Recall

Query rewriting is one of the strongest production techniques for improving retrieval.

Instead of embedding the raw query, you run a lightweight LLM prompt that rewrites the query into a more descriptive form.

Example

User query: "how do i do this?"
Rewritten query: "How do I configure vector database indexing for better similarity search performance?"

Now the embedding represents meaning instead of vague phrasing, which improves retrieval consistency.


Step 10: Use Metadata Filters to Reduce Noise

Better embeddings are not only about vector similarity. Metadata is equally important in real systems.

Store metadata such as:

  • Document type (blog, docs, FAQ).
  • Category (pricing, troubleshooting, setup).
  • Language.
  • Source URL or source system.
  • Timestamp or version.
  • Product name or project name.

Then filter retrieval based on context. This reduces irrelevant matches and makes similarity search more stable.


Step 11: Use Hybrid Search (Dense + Keyword)

Dense embeddings are strong for semantic similarity, but they often fail for:

  • Exact model names.
  • API endpoints and file paths.
  • Error codes.
  • Version numbers.

That is why many production systems combine dense vector search with keyword search (BM25). This approach is usually called hybrid search.

If you work with technical documentation, hybrid search almost always improves retrieval quality.


Step 12: Evaluate Embedding Quality Properly

One of the biggest mistakes teams make is relying on random manual testing. Embedding quality should be evaluated systematically.

A good evaluation dataset contains:

  • Real user queries from logs.
  • Expected relevant chunks for each query.
  • Expected irrelevant chunks for comparison.

Then you can measure:

  • Recall@K.
  • Precision@K.
  • MRR (Mean Reciprocal Rank).
  • nDCG (Normalized Discounted Cumulative Gain).

This is how you make improvements based on data instead of guessing.


Step 13: Remove Duplicate and Near-Duplicate Chunks

Duplicate chunks cause your vector search to return the same content repeatedly. This wastes your context window and reduces retrieval diversity.

You should deduplicate chunks before inserting them into your vector database.

  • Use hashing to remove exact duplicates.
  • Use similarity matching to remove near duplicates.
  • Remove boilerplate repeated across many pages.

Step 14: Add a Reranker for Better Precision

Even with good embeddings, similarity search is not perfect. Many teams improve final retrieval quality by using reranking.

A common pipeline looks like this:

  1. Vector search retrieves top 20 candidate chunks.
  2. A reranker model scores candidates against the query.
  3. The system selects the top 5 most relevant chunks.

This improves precision and reduces irrelevant content passed into the LLM.

If your system is used in production, reranking is often worth the extra latency.


Step 15: Cache Embeddings to Reduce Cost

Embedding calls can become expensive at scale, especially when you are embedding queries in real time.

Caching helps reduce cost and improves performance:

  • Cache query embeddings for repeated user queries.
  • Cache document embeddings for stable content.
  • Avoid re-embedding documents that have not changed.

Caching does not improve retrieval accuracy directly, but it allows you to experiment and scale without unnecessary cost.


A Recommended Embedding Pipeline (Good Baseline Setup)

If you want a strong production baseline, this is a reliable pipeline:

  1. Clean input text and remove boilerplate noise.
  2. Chunk by headings and paragraph boundaries.
  3. Apply overlap (100–200 tokens).
  4. Prepend each chunk with document title and section name.
  5. Generate embeddings using a reliable embedding model.
  6. Store metadata fields for filtering and ranking.
  7. Deduplicate chunks before insertion.
  8. Evaluate retrieval using Recall@K and MRR.

Most RAG systems become significantly better just by following these steps consistently.


Common Mistakes That Kill Retrieval Quality

  • Embedding raw HTML and PDF artifacts.
  • Using random chunk splitting without semantic structure.
  • Storing chunks without document or section context.
  • Keeping duplicated chunks in your vector database.
  • Retrieving too many irrelevant chunks.
  • Skipping evaluation and relying on manual testing only.
  • Ignoring metadata filters.

Conclusion

Most embedding problems are not caused by the embedding model itself. They are caused by bad chunking, messy preprocessing, and lack of evaluation.

If you want your vector search system to work reliably, focus on making chunks clean, meaningful, and context-rich. Once your embedding pipeline is strong, your retrieval becomes dramatically more accurate even without changing your vector database.

In production, the teams that win are not the ones with the fanciest models. They are the ones who treat embeddings as an engineering pipeline instead of a single API call.


Key Takeaways

  • Chunking and preprocessing have more impact than model choice.
  • Chunk by meaning, not by token count.
  • Prepend chunks with context headers for better retrieval.
  • Use overlap carefully to preserve boundaries.
  • Metadata filtering improves retrieval accuracy significantly.
  • Hybrid search is useful for technical content.
  • Reranking improves precision when similarity search is not enough.
  • Always evaluate embeddings with real queries and metrics.

Related Articles