How to Evaluate a RAG System: Metrics, Golden Sets, and Regression Testing
If you can't measure your RAG system, you can't improve it. The metrics that matter and how to run them yourself.
Evaluating a RAG system means measuring two things separately: how well retrieval finds the right context, and how well generation uses it. The reliable way to do this is to build a golden evaluation set of question/expected-answer pairs, score retrieval with metrics like recall@k, MRR, and nDCG, score generation for faithfulness and answer relevance (using RAGAS-style or LLM-as-judge methods), and re-run the whole thing as a regression test whenever you change anything. This guide walks through each piece and how to run it on your own infrastructure.
This is a deeper dive under the Self-Hosted RAG complete guide. “The demo looked good” is not evaluation — it is the thing evaluation exists to replace.
Why you must split retrieval from generation
A RAG answer can be wrong for two completely different reasons, and they have different fixes:
- Retrieval failed. The right chunk was never fetched, so the LLM had no chance. Fix: chunking, embedding model, k, hybrid search, reranking.
- Generation failed. The right chunk was retrieved, but the LLM ignored it, contradicted it, or padded the answer with invented detail. Fix: prompt, model, instructions.
If you only measure the final answer, you cannot tell which half is broken, and you will tune the wrong thing for a week. Always score retrieval and generation separately first, then look at end-to-end answer quality. This single discipline is what separates teams who improve their RAG from teams who guess.
Building a golden / evaluation set
Everything downstream depends on this. A golden set is a fixed collection of test cases you score against repeatedly. Build it before you touch metrics.
Each case needs:
- A question a real user would actually ask, in their real phrasing (including the messy, abbreviated, typo-ridden ones).
- The expected source(s) — which document or chunk should be retrieved to answer it. This powers your retrieval metrics.
- A reference answer — the correct answer, ideally with the supporting facts. This powers generation metrics.
Aim for 30–100 cases to start. Bias toward coverage over volume: include easy lookups, multi-document questions, questions whose answer is genuinely not in your corpus (the system should say “I don’t know” — test that it does), and known edge cases. Pull real questions from support logs, user interviews, or your own usage rather than inventing tidy ones. A golden set built from real queries is worth ten times one built from imagined ones.
Keep it in version control as plain JSON or CSV. It is a living asset — add a case every time you find a failure in production.
Retrieval metrics
These score the retriever alone: given a query, did the right chunk come back, and how high? They need only your expected-source labels, no LLM, and they are cheap and deterministic — run them constantly.
- Recall@k — of all the relevant chunks, what fraction appear in the top k results? The headline retrieval number. If recall@k is low, generation cannot save you; the context simply is not there.
- Precision@k — of the top k returned, what fraction are actually relevant? Low precision means you are flooding the prompt with noise that can confuse the LLM and waste context budget.
- MRR (Mean Reciprocal Rank) — averages 1/(rank of the first relevant result). Rewards putting the right chunk near the top. Great when there is one clearly correct source per question.
- nDCG (normalized Discounted Cumulative Gain) — rewards relevant results and ranks them by how relevant, discounting lower positions. The most complete retrieval metric when relevance is graded rather than binary.
- Hit rate — the simple “did any relevant chunk make the top k, yes/no” — a useful, blunt sanity check.
Start with recall@k and MRR. They are easy to compute and catch most retrieval problems. Reach for nDCG when you have graded relevance and care about ordering.
Generation metrics
These score the answer given the retrieved context. They are fuzzier — there is rarely one exact correct string — which is why this is where LLM-as-judge methods earn their place.
- Faithfulness / groundedness — is every claim in the answer actually supported by the retrieved context? This is the anti-hallucination metric and arguably the most important one in RAG. An answer can be relevant and fluent and still confidently invent a fact the context never stated.
- Answer relevance — does the answer address the question that was actually asked, without padding or evasion?
- Context precision / context recall — RAGAS-style metrics that judge whether the retrieved context was both relevant (precision) and sufficient (recall) to answer, bridging the two halves.
- Answer correctness — does the answer match your reference answer? Needs the golden set’s reference answers.
Faithfulness and answer relevance are the two to start with. Together they catch the two scariest failure modes: making things up, and not answering the question.
Metrics at a glance
| Metric | Measures | Needs | Start here? |
|---|---|---|---|
| Recall@k | Did relevant chunks reach top k | Expected-source labels | Yes |
| Precision@k | How much of top k is relevant | Expected-source labels | Optional |
| MRR | Rank of first relevant result | Expected-source labels | Yes |
| nDCG | Graded relevance + ordering | Graded labels | When ranking matters |
| Faithfulness | Answer grounded in context | Context + answer (+ judge) | Yes |
| Answer relevance | Answer addresses the question | Question + answer (+ judge) | Yes |
| Context precision/recall | Was context relevant & sufficient | Context + reference | Optional |
| Answer correctness | Answer matches reference | Reference answer (+ judge) | When you have references |
RAGAS-style and LLM-as-judge evaluation
You cannot score faithfulness with string matching — “the API key rotates every 90 days” and “keys are rotated quarterly” are the same fact in different words. The modern answer is to use an LLM as the judge.
LLM-as-judge hands a capable model the question, the retrieved context, and the answer, with a rubric: “Is every claim in the answer supported by the context? Score 0–1 and cite the unsupported claims.” The judge model evaluates semantic correctness the way a human reviewer would, at machine scale. It is the standard technique behind faithfulness and answer-relevance scoring in 2026.
RAGAS is the best-known open-source framework that packages this up. It provides ready-made faithfulness, answer-relevance, and context precision/recall metrics built on the LLM-as-judge pattern, so you do not write the judging prompts yourself. Other frameworks (DeepEval, TruLens, promptfoo) cover similar ground; the concepts transfer.
Two cautions on LLM-as-judge:
- The judge has biases — it can favor longer answers, its own phrasing style, or whatever is listed first. Validate the judge against a sample of human-labeled cases before you trust its scores blindly.
- The judge costs tokens — and if you send your context and answers to a cloud judge, you have reintroduced the exact data-egress problem self-hosting was meant to solve.
Running evaluation self-hosted
If the point of your stack is that data never leaves your infrastructure, your evaluation has to honor that too. Shipping every chunk and answer to a cloud judge to score “privacy” would be self-defeating.
The good news: LLM-as-judge does not require a frontier model. A strong local model run through Ollama — a capable 8B–14B instruct model, or larger if you have the VRAM — is a perfectly serviceable judge for faithfulness and relevance scoring, especially with a tight rubric and few-shot examples. RAGAS and DeepEval both let you point their judge at a local Ollama or vLLM endpoint instead of OpenAI. Your retrieval metrics (recall@k, MRR, nDCG) need no LLM at all and run anywhere.
A fully self-hosted eval loop:
- Local embedding model (the same one your RAG uses) and vector store for the retrieval scoring.
- A local judge model via Ollama or vLLM for the generation metrics.
- RAGAS or DeepEval configured to call those local endpoints.
Nothing leaves the box. This pairs directly with the build-a-private-RAG-on-a-VPS stack — point the evaluator at the same Ollama instance.
Regression testing: the real payoff
Evaluation is not a one-time report card. Its real value is as a regression test: a number you re-run every time you change something, so improvements are proven and regressions are caught before users find them.
Wire it into your workflow:
- On every meaningful change — new chunking strategy, different embedding model, prompt tweak, model upgrade, reranker added — re-run the full golden set and compare against the last baseline. A change to fix one query routinely breaks five others; only the eval set catches that.
- In CI, ideally — fail the build if recall@k or faithfulness drops below a threshold you set. This turns “we think it’s better” into a gate.
- Track the trend — keep a log of each run’s scores. Watching recall climb from 0.6 to 0.9 across ten experiments is how you know the work is real and not vibes.
This closes the loop with the levers in the rest of this cluster: change your chunking strategy or embedding model, re-run the set, keep what wins. Without the eval set, every change is a coin flip. With it, every change is a measurement.
FAQ
What is the difference between retrieval and generation metrics? Retrieval metrics (recall@k, MRR, nDCG) score whether the right context was fetched, using your expected-source labels and no LLM. Generation metrics (faithfulness, answer relevance) score whether the answer correctly used that context, usually via an LLM judge. Measure them separately so you know which half to fix.
How big should my RAG evaluation set be? Start with 30–100 cases and grow it. Coverage matters more than raw size — include easy lookups, multi-document questions, and questions whose answer is deliberately absent so you can test that the system says “I don’t know.” Add a case every time you hit a real failure.
What is RAGAS? RAGAS is a popular open-source RAG evaluation framework that provides ready-made metrics — faithfulness, answer relevance, context precision and recall — built on the LLM-as-judge pattern. You can point its judge model at a local Ollama or vLLM endpoint to keep evaluation fully self-hosted.
Can I evaluate RAG without sending data to OpenAI? Yes. Retrieval metrics need no LLM. For generation metrics, use a strong local model through Ollama or vLLM as the judge and configure RAGAS or DeepEval to call it. Nothing leaves your infrastructure, which matters if privacy was your reason for self-hosting.
Is LLM-as-judge reliable? It is good enough to drive iteration and far better than string matching for fuzzy, semantic correctness — but judges have biases (favoring longer or first-listed answers). Validate the judge against a sample of human-labeled cases before trusting it, and keep the rubric tight.
Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. A RAG system you cannot measure is a RAG system you cannot trust, and you can measure it without handing your data to anyone. Explore more guides or subscribe to the newsletter for honest, vendor-neutral writeups. Own your search.
Keep going
More guides on self-hosted AI search, RAG, and vector databases.