RAG Chunking Strategies: How to Split Documents for Better Retrieval
How you split documents quietly decides your RAG quality. The strategies, the tradeoffs, and how to choose.
Chunking is how you split documents into the small passages you embed and store for retrieval, and it is the single most underrated lever on RAG quality. Chunk too large and relevance gets diluted; chunk too small and you lose the context an answer needs. This guide covers the main strategies — fixed-size, recursive, sentence and semantic, structure-aware (Markdown and code), and parent-document retrieval with overlap — plus how chunking interacts with your embedding model’s context window, and a decision table to pick a starting point.
This is a deeper dive under the Self-Hosted RAG complete guide. If you have not built a pipeline yet, read that first; this page assumes you know what embeddings and retrieval are.
Why chunking matters more than model choice
Most teams spend their tuning budget swapping embedding models and LLMs. In practice, bad chunking breaks more RAG systems than model choice does. The reason is mechanical: you retrieve chunks, not documents. Whatever you put inside a chunk boundary is the unit the embedding model has to represent as a single vector, and the unit the retriever can return whole or not at all.
Two failure modes dominate:
- Dilution. A 3,000-token chunk that mentions your target fact in one sentence produces a vector that averages across everything else in the chunk. The relevant signal gets washed out, and the chunk ranks lower than it should for the query that needs it.
- Fragmentation. A 100-token chunk that cuts a procedure in half retrieves cleanly but answers incompletely. The model never sees step 4 because it lived in the next chunk that did not get retrieved.
Good chunking keeps each chunk coherent (one idea, one topic, readable on its own) and right-sized for both the embedding model and the answer. Everything below is in service of those two properties.
Fixed-size chunking
The simplest strategy: split text every N characters or tokens, optionally with overlap. Cut at 800 tokens, slide a window, repeat.
It is fast, predictable, and trivially parallel, which is why it is the default in many tutorials. The problem is that it ignores meaning entirely — it will happily slice through the middle of a sentence, a table row, or a code block. For clean prose with no strong structure it is a reasonable baseline. For anything with headings, lists, or code, you can do better for almost no extra effort.
Use fixed-size as your floor, not your finish line.
Recursive character/token splitting
This is the sensible default for most text in 2026, and what RecursiveCharacterTextSplitter (LangChain) and node parsers in LlamaIndex do out of the box. You give it an ordered list of separators — typically paragraph breaks, then line breaks, then sentences, then spaces — and a target chunk size. It tries to split on the largest separator first and only falls back to smaller ones when a piece is still over the limit.
The effect is that it respects natural boundaries when it can and only cuts mid-sentence as a last resort. You get chunks that mostly start and end at paragraph or sentence edges while staying under your size budget. For a heterogeneous pile of docs where you do not want to write custom parsers, recursive splitting with a sensible size and overlap is hard to beat as a starting point.
Sentence and semantic chunking
Two more meaning-aware approaches sit above recursive splitting.
Sentence-window / sentence-based splitting groups whole sentences up to a size budget, never cutting one in half. Cheap, and it removes the most jarring boundary artifacts.
Semantic chunking goes further: it embeds sentences (or small spans), then places a boundary wherever the embedding similarity between adjacent spans drops below a threshold — i.e., wherever the topic shifts. The result is chunks that track topic changes rather than arbitrary length. It produces noticeably more coherent chunks on mixed-topic documents.
The catch is cost and complexity. Semantic chunking runs the embedding model during ingestion just to decide boundaries, which adds time and moving parts. It shines on long, rambling documents (meeting transcripts, research papers, sprawling wikis) where topic drift inside a fixed window is the real problem. For tidy, well-structured docs the payoff is smaller — structure-aware chunking often gets you there more cheaply.
Structure-aware chunking (Markdown, HTML, code)
If your documents have explicit structure, use it. This is frequently the highest-leverage move available.
Markdown and HTML. Split on the heading hierarchy. A Markdown-aware splitter keeps each section under its heading together and can attach the heading path (# Setup > ## Database > ### Migrations) as metadata on every chunk. That metadata is gold: it gives the model context about where a chunk lives, and it gives you a filterable field. Splitting an FAQ on its ## boundaries so each question-and-answer is one chunk is often the entire optimization.
Code. Naive character splitting destroys code — it severs functions and orphans braces. Use a syntax-aware splitter (LangChain and LlamaIndex ship language-aware parsers; tree-sitter underpins the better ones) that splits on function and class boundaries so each chunk is a complete, parseable unit. Keep the file path and symbol name in metadata.
Tables and PDFs. Tables and multi-column PDF layouts are the perennial pain. Keep a table as a single chunk where you can, and consider a layout-aware extractor (tools like Docling or Unstructured) before chunking rather than after — once a PDF is flattened to broken text, no splitter can recover the rows.
Structure-aware chunking pairs naturally with the structured metadata you store alongside vectors, which you can then filter on at query time.
Overlap and parent-document retrieval
Two techniques fight the fragmentation problem directly.
Overlap repeats a slice of the previous chunk at the start of the next one — commonly 10–15% of chunk size. A sentence that straddles a boundary then appears in both chunks, so retrieving either one preserves the thought. It is cheap insurance against context loss at seams. The cost is mild redundancy in your index and occasional duplicate text in retrieved context; keep overlap modest rather than cranking it to 50%.
Parent-document (small-to-big) retrieval decouples the unit you search from the unit you return. You embed small, precise child chunks (a sentence or short paragraph) so retrieval is sharp, but you store a pointer from each child to its larger parent (the full section or page). At query time you match on the small child for precision, then hand the LLM the big parent for context. This is one of the most effective patterns in production RAG: it sidesteps the size tradeoff by refusing to make it. LlamaIndex calls this auto-merging / hierarchical retrieval; LangChain ships a ParentDocumentRetriever. Both implement the same idea.
If you only adopt one advanced technique from this guide, make it parent-document retrieval.
Chunk size and how it meets the embedding context window
Chunk size is a tradeoff, not a setting with a correct value. Smaller chunks give sharper retrieval and waste less of the LLM’s context window, but risk fragmentation. Larger chunks preserve context but dilute relevance and cost more tokens per retrieved chunk.
Two hard constraints bound your choice:
- The embedding model’s context window. Every embedding model has a maximum input length. Many strong local models —
nomic-embed-text,bge,e5— handle roughly 512 tokens of meaningful input; some newer models stretch to 8k. Chunks longer than the model’s effective window are silently truncated, so the tail of an oversized chunk never gets embedded at all. Know your model’s limit before you set chunk size. See best local embedding models for the per-model numbers. - The generation LLM’s context budget. You retrieve k chunks and paste them into the prompt.
k × chunk_sizeplus the question and instructions must fit comfortably inside the LLM’s window with room for the answer. Big chunks and a large k blow this fast.
A practical default that respects both: 256–512 tokens per chunk with ~10–15% overlap, retrieving k = 4–8. Tune from there against a real evaluation set — never by eyeballing the demo.
Decision table
A starting point by document type. These are defaults to measure against, not laws.
| Document type | Strategy | Chunk size (approx) | Overlap | Notes |
|---|---|---|---|---|
| Clean prose (articles, books) | Recursive | 400–600 tokens | 10–15% | Sentence-aware boundaries; safe default. |
| Long / rambling (transcripts, papers) | Semantic or parent-document | 200–400 child | low | Topic drift is the enemy; small-to-big helps most. |
| Markdown / docs sites | Structure-aware (headings) | per section | none–low | Split on ##; attach heading path as metadata. |
| FAQ / Q&A | Structure-aware | one Q&A per chunk | none | Each answer is self-contained already. |
| Source code | Syntax-aware | per function/class | none | Keep symbols whole; store file path + symbol. |
| Tables / spreadsheets | Keep table whole | one table/row group | none | Layout-aware extraction before chunking. |
| Mixed corpus (no time to tune) | Recursive | 512 tokens | 12% | The reliable baseline for everything. |
How to actually choose
Do not pick a strategy from a blog post and ship it. Pick a sensible default from the table, then measure:
- Build a small evaluation set — 20–50 real questions, each tagged with the document that should answer it. The how to evaluate RAG guide walks through this end to end.
- Run the set and record retrieval recall — how often the right source lands in the top k.
- Change one thing (chunk size, overlap, or strategy), re-run, and keep the change only if recall went up.
Chunking is iterative by nature. The teams with good RAG are not the ones who guessed the perfect chunk size; they are the ones who measured three options and kept the winner. Once chunking is solid, the next lever is usually hybrid search and reranking.
FAQ
What is the best chunk size for RAG? There is no universal best. A solid default is 256–512 tokens with ~10–15% overlap, but the right value depends on your documents and your embedding model’s context window. Measure two or three sizes against an evaluation set and keep the one with the highest retrieval recall.
Should chunks overlap, and by how much? Usually yes. Around 10–15% of chunk size is a good default — enough to preserve thoughts that straddle a boundary without bloating your index. Heavy overlap (40%+) mostly adds redundancy. Parent-document retrieval is a cleaner alternative to large overlap.
What is parent-document (small-to-big) retrieval?
You embed small child chunks for sharp retrieval but return their larger parent section to the LLM for context. It gives you precision and context at once, sidestepping the usual chunk-size tradeoff. LlamaIndex (auto-merging) and LangChain (ParentDocumentRetriever) both implement it.
Does chunk size need to fit the embedding model’s window? Yes. Any text past the model’s maximum input length is silently truncated and never embedded. Many local embedding models have a roughly 512-token effective window, so keep chunks under that limit unless you are using a long-context model.
Is semantic chunking worth the extra cost? On long, topically drifting documents, often yes — it produces more coherent chunks than fixed windows. On tidy, well-structured docs, structure-aware splitting (on headings) usually gets you the same benefit more cheaply. Start with recursive or structure-aware and reach for semantic only if evaluation says you need it.
Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. Chunking is where a lot of RAG quality is won or lost, so it is worth getting right on your own infrastructure. Explore more guides or subscribe to the newsletter for honest, vendor-neutral writeups. Own your search.
Keep going
More guides on self-hosted AI search, RAG, and vector databases.