RAG Reranking: How a Two-Stage Retrieve-Then-Rerank Pipeline Beats Raw Top-K

Q: Can I run a reranker without a GPU?

Yes, for small candidate sets. bge-reranker-v2-m3 reranks a few dozen chunks per query in well under a second on CPU. Larger LLM-based rerankers (the ~2B-param class) want a GPU for interactive latency. Cap your candidate count to stay in budget.

Q: What is the difference between a reranker and hybrid search?

Hybrid search improves the candidate pool by combining vector and keyword retrieval before ranking. A reranker improves the ordering of whatever candidates you retrieved, using a more accurate model. They stack: hybrid retrieve, then rerank.

A reranker is a second-stage model that re-scores the chunks your vector search returned, so the most relevant ones rise to the top of the prompt. The pattern is retrieve-then-rerank: pull a wide net of candidates with fast vector search (say the top 50), then have a slower, more accurate model re-order them and keep the best 5. This two-stage pipeline reliably beats raw vector top-k because the model that scores relevance at the end is far more precise than the one that scored it during the first sweep. This guide explains why that works, the difference between bi-encoders and cross-encoders, which rerankers you can self-host, the latency you pay, and exactly how to bolt one onto an existing pipeline.

This is a deeper dive under the Self-Hosted RAG complete guide. If you have not built a pipeline yet, read that first — this page assumes you already retrieve chunks and stuff them into a prompt.

Why raw vector top-k leaves quality on the table

Your vector store finds the k nearest chunks by comparing a single embedding of the query against a single embedding of each chunk. That comparison is fast — it is a dot product over pre-computed vectors — but it is also lossy. The entire meaning of a chunk has been crushed into one fixed-length vector, and so has the query. Subtle relevance signals (does this passage actually answer the question, or just mention the same nouns?) get blurred away in that compression.

The result is a recall/precision gap. Vector search is good at recall — the right chunk is usually somewhere in the top 20 or 50. It is mediocre at precision — the right chunk is often not in the top 3, which is all you can afford to put in the prompt. Reranking exploits exactly this: cast a wide net to get recall from cheap vector search, then spend a little more compute to fix the ordering and get precision.

A concrete way to see it: if your relevant chunk lands at rank 11 but you only pass the top 5 to the LLM, your generation step never had a chance — the context simply was not there. A reranker that promotes that chunk from rank 11 to rank 2 turns a wrong answer into a right one without touching your embedding model, your chunking, or your LLM. That is why reranking is usually the highest return-on-effort change you can make to a working-but-mediocre RAG system.

Bi-encoders vs cross-encoders: the core idea

The whole technique comes down to when the query and the document meet.

A bi-encoder (your embedding model) encodes the query and each document separately and in advance. The document vectors are computed once at index time and stored. At query time you only encode the query, then do cheap vector math against millions of pre-stored vectors. This is what makes vector search scale — but query and document never actually “see” each other, so fine-grained interactions are lost.

A cross-encoder (the classic reranker) feeds the query and one candidate document together into a transformer and outputs a single relevance score. Because the model attends across both texts at once, it captures interactions a bi-encoder cannot — negation, qualifiers, “this passage is about X but in the wrong context.” The catch: you cannot pre-compute anything. Every (query, document) pair is a fresh forward pass, so you can only afford to run it on a small candidate set, not your whole corpus. Hence the two stages.

Late-interaction models (ColBERT and its descendants) sit between the two. They store a vector per token rather than one per document, and compute a fine-grained token-level similarity at query time. This recovers much of the cross-encoder’s accuracy while staying closer to bi-encoder speed — at the cost of a larger index, because you are storing many vectors per chunk.

Approach	When query meets doc	Speed	Accuracy	Role in pipeline
Bi-encoder (embeddings)	Never (separate)	Fastest	Good recall	First-stage retrieval over the whole corpus
Late interaction (ColBERT)	Per-token, at query time	Fast	High	First-stage or reranker; bigger index
Cross-encoder (reranker)	Jointly, full attention	Slow	Highest	Second-stage rerank of a small candidate set

Self-hostable rerankers worth knowing

You do not need a managed reranking API to do this. Several strong rerankers ship with permissive licenses and run on the same box as the rest of your private stack — which matters, because shipping every candidate chunk to a third-party reranker would undo the privacy rationale for self-hosting in the first place.

Model	Type	License	Notes
bge-reranker-v2-m3	Cross-encoder	Apache 2.0	Built on BAAI’s bge-m3; lightweight, strongly multilingual, fast inference, easy to deploy. The default starting point for most self-hosted RAG.
bge-reranker-v2 (gemma / minicpm variants)	Cross-encoder / LLM-based	Apache 2.0	Larger, higher-accuracy members of the bge-reranker-v2 family for when m3 is not enough and you have the compute.
mxbai-rerank-v2 (base / large)	Cross-encoder	Apache 2.0	Mixedbread’s v2 family; the large variant is ~2B params (Qwen2-based), trained with a reinforcement-learning recipe. Positioned as open-weight state of the art. (As of June 2026.)
ColBERT / ColBERTv2	Late interaction	Permissive (research-friendly)	Token-level late interaction; great when you want reranking-grade accuracy on a tight latency budget. Larger index footprint.

License details verified against the model cards: bge-reranker-v2-m3 and mxbai-rerank-large-v2 (both Apache 2.0, as of June 2026). All four run locally through the Hugging Face transformers stack, sentence-transformers, FlagEmbedding, or a unified wrapper like the open-source rerankers library — no external API call required.

If you only take one recommendation: start with bge-reranker-v2-m3. It is small, fast on CPU for modest candidate sets, multilingual, Apache-licensed, and good enough that most teams never need to move off it.

The latency / quality tradeoff

Reranking is not free, and the cost is real but bounded. A cross-encoder forward pass per candidate is orders of magnitude more expensive than a vector dot product. The honest framing: you are trading a few tens to low hundreds of milliseconds for a meaningful jump in answer quality.

The lever you control is the candidate count — how many chunks the first stage hands to the reranker. Rerank the top 50 and you pay for 50 forward passes; rerank the top 100 and you pay for 100. Because cross-encoder cost scales linearly with candidates, this is your throttle:

Top 20–30 candidates — cheap, fast, recovers most of the win. A sane default.
Top 50 — the common production sweet spot; diminishing returns start here.
Top 100+ — only if your first-stage recall is genuinely poor and you cannot fix it upstream.

Practical knobs to keep latency in budget on self-hosted hardware:

Pick a small reranker. bge-reranker-v2-m3 reranks a few dozen candidates in well under a second on CPU; a 2B-param reranker wants a GPU for snappy interactive latency.
Cap the candidate set. Most of the relevance gain is in the first 30–50 candidates. Reranking 200 rarely beats reranking 50 by enough to justify the latency.
Truncate candidate text. Rerankers have their own input limits; feed the chunk, not a whole parent document, unless you are deliberately using late interaction.
Batch the forward passes. Score all candidates in one batched call rather than a loop of single calls.

If your latency budget is brutal and a cross-encoder is too slow even at top-20, a late-interaction reranker like ColBERT is the escape hatch: most of the accuracy, closer to bi-encoder speed.

How to add a reranker to an existing pipeline

The change is small and surgical. You insert one step between retrieve and generate, and you widen the first stage to feed it.

Widen first-stage retrieval. If you were fetching similarity_top_k=5, bump it to top_k=50. You are now retrieving for recall, not final precision — let vector search be generous.
Rerank the candidates. Run the (query, chunk) pairs through your reranker. It returns a relevance score per chunk.
Keep the top N. Sort by the reranker’s score and keep the best 4–6 chunks. These go into the prompt.
Generate as before. Your prompt template, LLM, and citation logic do not change at all.

In LlamaIndex this is a node postprocessor; in LangChain it is a ContextualCompressionRetriever wrapping a cross-encoder; with the rerankers library it is a few lines around any of the models above. Conceptually:

# 1. Retrieve wide (recall) — bump k from 5 to 50
candidates = vector_store.query(query_embedding, top_k=50)

# 2. Rerank the candidates with a cross-encoder (precision)
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
pairs = [(user_query, c.text) for c in candidates]
scores = reranker.predict(pairs)               # one batched forward pass

# 3. Keep the best few for the prompt
ranked = sorted(zip(scores, candidates), reverse=True)
top_chunks = [c for _, c in ranked[:5]]

# 4. Generate exactly as before, now with better context

This pairs naturally with hybrid search (vector + BM25 keyword): hybrid widens and diversifies the candidate pool, the reranker cleans up the ordering. Many of the strongest 2026 self-hosted systems do all three — hybrid retrieval, fusion, then a reranker. SurfSense, for example, runs hybrid search with Reciprocal Rank Fusion and supports rerankers on top (as documented in our notes on open-source AI search).

Prove it actually helped

Do not add a reranker on faith. It is a textbook case for the evaluation loop: it should move your retrieval metrics, and you can verify that it does. Run your golden set with and without the reranker and compare recall@k, MRR, and nDCG — reranking specifically should lift MRR and nDCG (it improves ordering), and lift end-to-end faithfulness because the LLM finally gets the right chunk near the top. If the numbers do not move, you do not need it. See How to Evaluate a RAG System for the full method, and fold the reranked-vs-not comparison into your CI gate as described in Production RAG.

FAQ

Do I actually need a reranker, or should I just fix chunking and embeddings first? Fix the cheap things first. Good chunking and a strong embedding model raise first-stage recall, which is the foundation. But once recall is decent and precision is still weak — the right chunk is in the top 30 but not the top 3 — a reranker is the highest-leverage next step. The two are complementary, not either/or.

Can I run a reranker without a GPU? Yes, for small candidate sets. bge-reranker-v2-m3 reranks a few dozen chunks per query in well under a second on CPU. Larger LLM-based rerankers (the ~2B-param class) want a GPU for interactive latency. Cap your candidate count to stay in budget.

What is the difference between a reranker and hybrid search? Hybrid search improves the candidate pool by combining vector and keyword retrieval before ranking. A reranker improves the ordering of whatever candidates you retrieved, using a more accurate model. They stack: hybrid retrieve, then rerank.

Is ColBERT a reranker or a retriever? Both, depending on how you deploy it. Late interaction can serve as a first-stage retriever (storing per-token vectors) or as a reranker over candidates. Its appeal is cross-encoder-grade accuracy at near-bi-encoder speed, at the cost of a larger index.

Will a reranker fix hallucinations? Indirectly. It cannot make the LLM honest, but by putting the genuinely relevant chunk near the top of the context, it gives the model the right facts to ground on — which measurably improves faithfulness. If the relevant chunk was never retrieved at all, no reranker can help; that is a first-stage recall problem.

Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. A reranker is one of the few RAG upgrades that is cheap, self-hostable, and reliably worth it. Explore more guides or subscribe to the newsletter for honest, vendor-neutral writeups on RAG, vector databases, and semantic search. Own your search.