Production RAG: Taking Self-Hosted Retrieval From Demo to Reliable Service

Taking self-hosted RAG to production means hardening seven things the tutorial skipped: caching so you stop recomputing the same work, observability so you can see what retrieval and generation actually did, latency and cost control so the service stays fast and affordable under load, access control so users only see what they are allowed to, data freshness so the index does not rot, evaluation in CI so changes cannot silently regress quality, and a scaling plan for the vector store as the corpus grows. A demo proves the idea works once; production keeps it working for everyone, every day, without leaking data or your budget. This guide covers each, with a checklist at the end.

This is the production capstone of the RAG cluster. It assumes you have a working pipeline from the Self-Hosted RAG complete guide and have stood one up on a box, e.g. via build a private RAG on a VPS.

Caching: stop paying for the same work twice

Caching is the highest-leverage production optimization because RAG repeats itself constantly. There are three layers, and you want all three.

Embedding cache. Embedding the same query text twice is pure waste. Cache query embeddings keyed by the normalized query string. At index time, cache document embeddings keyed by a content hash so unchanged chunks are never re-embedded on a re-index. This alone slashes re-indexing time.
Retrieval cache. For a repeated query, the set of retrieved chunk IDs is deterministic given a fixed index. Cache it (with a short TTL or an index-version key so it invalidates when you re-index). Popular questions hit this constantly.
Generation / semantic cache. The biggest saver. Cache full answers keyed by the query — and go further with a semantic cache: embed the incoming query and, if it is near-identical to a previously answered one, return the cached answer instead of calling the LLM at all. “How do I reset my password?” and “password reset steps” should not both pay for generation.

A simple in-process or Redis-backed cache covers the first two; the semantic cache reuses your existing embedding model. Always key caches by index version so a re-index invalidates stale entries — a cache that serves answers from last week’s deleted document is a correctness bug, not an optimization.

Observability: you cannot fix what you cannot see

A RAG answer passes through retrieval and generation, and either can fail. In a demo you eyeball it; in production you need to log it. For every request, capture:

The raw query, the retrieved chunk IDs and their scores, and (sampled) the chunk text.
Which reranker ordering, if any, and the final chunks sent to the LLM.
The prompt token count, the answer, and the answer’s cited sources.
Latency broken down by stage — embed, retrieve, rerank, generate — and total.
Errors and timeouts at each stage.

This stage-level tracing is what lets you answer “why was this answer bad?” without guessing. If retrieval returned the wrong chunks, that is a recall/chunking problem; if the right chunk was retrieved but the answer ignored it, that is a generation/prompt problem. The split mirrors the evaluation discipline: you must be able to attribute a failure to the right half. Tools like OpenTelemetry traces, or RAG-specific tracing in frameworks and open-source tools (Langfuse, Phoenix), give you this; the key is that you can reconstruct any single bad answer end to end.

Also log a feedback signal — a thumbs up/down, or whether the user rephrased and re-asked (a strong implicit “that was wrong”). Real failures captured this way become new cases in your golden set.

Latency and cost control under load

A pipeline that is fine for one user can fall over at a hundred. Control both:

Lever	Effect	Notes
Semantic answer cache	Skips generation entirely on repeats	Biggest single win on both cost and latency
Cap retrieval `k` and candidate count	Less reranking and prompt-token cost	Tune against your eval set, do not guess
Stream the answer	Lower perceived latency	First token fast even if full answer is slow
Batch embeddings	Higher throughput at index time	Re-indexing finishes faster
Right-size the generation model	Major cost lever	A smaller local model is often enough; reserve a bigger one for hard queries
Concurrency limits / queue	Protects the box from overload	A bounded queue beats a crashed GPU

Two production-specific cost notes for self-hosting. First, your “token cost” is GPU and CPU time — keeping prompts short (good retrieval, not stuffing; see RAG vs long context) directly keeps the box responsive and your bill flat. Second, the generation LLM dominates compute, so a tiered approach — answer easy queries with a small local model, escalate only hard ones — usually beats running one large model for everything.

Security and access control

This is where self-hosting earns its keep, and also where teams cut corners. Two distinct concerns:

Infrastructure security. Standard service hygiene, but easy to forget on an internal tool: put the API behind authentication, do not expose the vector store or Ollama/vLLM ports to the public internet, use TLS, rotate keys, and rate-limit per client. A self-hosted RAG box often holds your most sensitive documents in one place — treat it accordingly.

Per-user access control (the hard one). If different users may see different documents, you cannot retrieve from one shared index and trust the LLM to omit forbidden content — that is a leak waiting to happen. Enforce permissions at retrieval time:

Tag every chunk with access metadata at ingestion (owner, team, classification, allowed roles).
Apply a metadata filter on every query so the vector search only ever returns chunks the requesting user is authorized to see.
Never rely on prompt instructions (“do not reveal X”) for access control — a filtered retrieval is enforcement; a prompt is a suggestion.

This is also a strong argument for retrieval over stuffing a giant context: with RAG, the user’s permissions shape what is even retrievable, so unauthorized content never reaches the model. Most self-hostable vector stores (pgvector, Qdrant, Weaviate) support metadata filtering inside the similarity query — see what is a vector database. And because the whole stack is on your infrastructure, queries and documents never cross a network boundary you do not control.

Data freshness and re-indexing

A knowledge base that does not update becomes a confidently-wrong knowledge base. Plan for change from day one:

Incremental updates, not full rebuilds. When a document changes, re-chunk and re-embed only that document, and upsert by stable chunk ID. Content-hash your chunks so unchanged ones are skipped — pairs perfectly with the embedding cache above.
Handle deletes. When a source document is removed or access is revoked, delete its vectors. A “deleted” doc that still answers queries is both a staleness bug and a security incident.
Schedule it. A cron or event-driven job (on file change, on a webhook, nightly) keeps the index current. Track a “last indexed” timestamp per source so you can spot stale corners.
Version the index. Tag the index with a version; bump it on re-index so caches invalidate and you can roll back if a bad ingestion corrupts retrieval.
Re-embed on model change. If you ever change embedding models, you must re-embed the entire corpus — vectors from different models are not comparable. Treat an embedding-model swap as a full rebuild, and validate it against your eval set before cutting over.

Evaluation in CI: the guardrail

The difference between a RAG system that improves and one that silently rots is whether quality is gated, not just measured. Wire your golden set into CI:

Keep a golden set (question, expected source, reference answer) in version control — see how to evaluate RAG.
On every change to chunking, embeddings, the reranker, the prompt, or the model, re-run the set in CI.
Fail the build if recall@k, MRR, or faithfulness drops below a threshold you set. A change that fixes one query routinely breaks five others; only the gate catches that before users do.
Keep evaluation self-hosted too — local embedding model for retrieval metrics, a local judge model (via Ollama or vLLM) for faithfulness — so scoring “privacy” does not itself leak data.

This turns “we think the new chunker is better” into a number that either cleared the bar or did not. It is the single practice that most separates a production RAG system from a fragile one.

Scaling the vector store

The vector store is usually the component that forces an architecture decision as you grow. Stages, roughly:

Tens of thousands of chunks. Almost anything works on a single box — pgvector, Chroma, Qdrant. Memory, not disk, is typically the first constraint for in-memory HNSW indexes. A 768-dim float32 vector is ~3 KB raw; budget a few KB per chunk plus your text and you will not be surprised.
Millions of chunks. Move to an engine built for it — Qdrant or Weaviate for a dedicated store, or pgvector if you want vectors to live beside relational data with one backup story. Tune the HNSW parameters (build-time m/ef_construction, query-time ef) to trade recall against latency. Add quantization to cut the memory footprint.
Tens of millions and up. Now you care about sharding, replication for availability, and distributed deployment. Qdrant and Weaviate support clustering; Milvus is purpose-built for this scale. This is also where filtered-search performance (fast metadata filtering inside the ANN search) matters most — relevant to the access-control story above.

Match the engine to your real corpus size and growth, not to a benchmark headline. The best self-hosted vector databases comparison walks through the tradeoffs; for the conceptual grounding see what is a vector database. And budget storage realistically — re-read the self-hosted cost breakdown before you size hardware.

The production checklist

You do not need all of it on day one. Caching, observability, access control, and eval-in-CI are the four that matter first — they are what turn a working demo into a service people can trust.

FAQ

What is the single most important thing to add when moving RAG to production? Evaluation in CI. A golden set that gates every change is what stops silent regressions — the failure mode that quietly kills RAG systems. Caching and per-request observability are close seconds.

How do I keep RAG fast under load? A semantic answer cache (skip generation on repeated/near-identical queries) is the biggest win. Then stream answers for low perceived latency, cap retrieval k and reranker candidates, right-size the generation model, and put a bounded concurrency queue in front of the box.

How do I enforce per-user permissions in RAG? Tag every chunk with access metadata at ingestion, then apply a metadata filter on every retrieval so the vector search only returns chunks the requesting user may see. Never use prompt instructions for access control — filtered retrieval is enforcement; a prompt is not.

Do I have to re-index everything when documents change? No. Re-chunk and re-embed only the changed documents and upsert by stable chunk ID; content-hash chunks so unchanged ones are skipped. The exception is changing your embedding model, which requires a full re-embed because vectors from different models are not comparable.

When do I need to move off pgvector or Chroma? Around the millions-of-chunks mark, or when memory pressure and latency degrade. Move to Qdrant or Weaviate for a dedicated store (or stay on pgvector if you want vectors beside relational data), and consider Milvus at tens of millions. Let your real corpus size and eval-measured latency drive the call, not a benchmark headline.

Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. Production RAG is mostly unglamorous discipline: cache, observe, gate, and keep the data fresh and access-controlled — all of it on infrastructure you control. Explore more guides or subscribe to the newsletter for honest, vendor-neutral writeups on RAG, vector databases, and semantic search. Own your search.