Self-Hosted RAG: The Complete Guide to Private AI Knowledge Bases
Run retrieval-augmented generation on your own infrastructure — your data, your embeddings, your index.
Self-hosted RAG means running the entire retrieval-augmented generation pipeline — ingestion, embeddings, vector storage, retrieval, and generation — on infrastructure you control, instead of sending your documents and queries to a managed API. You get the same “chat with your knowledge base” capability as a SaaS product, but your data never leaves your servers. This guide walks through why developers self-host RAG, the reference stack that works in 2026, and the tradeoffs nobody on a vendor’s pricing page will tell you.
It’s a long read because RAG has a lot of moving parts. Use the headings to jump to what you need.
▶ Run it yourself — aquila-starter is the one-command, fully self-hosted version of this guide: Ollama + Qdrant + FastAPI via
docker compose up. Fork it and make it your own.
What is RAG, in one paragraph
Retrieval-Augmented Generation gives a large language model (LLM) access to your specific documents at query time. Instead of relying only on what the model memorized during training, you retrieve the most relevant chunks of your content and paste them into the prompt as context. The LLM then answers grounded in those chunks. RAG is how you get an LLM to answer questions about your internal wiki, your codebase, your support tickets, or last quarter’s contracts — without fine-tuning and without the model hallucinating facts it never saw.
If you want the deeper conceptual primer on the retrieval half, see What Is Semantic Search.
Why self-host RAG instead of using an API
There are three honest reasons. Be clear about which one is driving you, because it changes your stack.
Privacy and data residency
This is the strongest reason. With a managed RAG API, every document you ingest and every question your users ask is transmitted to a third party. For regulated data (health, legal, financial), customer PII, or proprietary source code, that’s often a non-starter — or at least a procurement and compliance headache. Self-hosting keeps embeddings and raw text inside your own VPC or on-prem box. Nothing crosses your network boundary unless you decide it does.
Cost at steady-state volume
Managed embeddings + a hosted vector database are cheap to start and expensive to scale. A fixed-cost VPS running open-source components has a flat monthly bill regardless of query volume, which usually wins once you’re past a trickle of traffic. We break the math down in detail in Self-Hosted RAG vs OpenAI + Pinecone: A Real Cost Breakdown.
Control and no lock-in
You choose the embedding model, the chunking strategy, the vector index, and the LLM. You can swap any layer without a migration project, pin versions so a vendor’s silent model update doesn’t change your retrieval quality overnight, and run fully offline (air-gapped) if you have to.
The honest counterpoint: self-hosting means you are now on call for the database, the GPU drivers, the upgrades, and the 2 a.m. page. We cover when not to self-host at the end.
The reference stack
A self-hosted RAG system is five components glued together. Here’s a stack that’s boring, proven, and entirely open-source:
| Layer | Role | Solid open-source options |
|---|---|---|
| LLM runtime | Generates the answer locally | Ollama, llama.cpp, vLLM |
| Embedding model | Turns text into vectors | nomic-embed-text, mxbai-embed-large (via Ollama), or bge/e5 models |
| Vector store | Stores + searches embeddings | pgvector, Qdrant, Chroma, Weaviate |
| Orchestration | Chunking, retrieval, prompt assembly | LangChain or LlamaIndex |
| API / app layer | Exposes the pipeline to your app | FastAPI (Python) |
You do not need all of these to be heavyweight. A perfectly capable starter stack is Ollama + Chroma + LlamaIndex + FastAPI running on a single VPS, with no GPU, using a small local embedding model and either a local 7-8B chat model or a cloud LLM for the final generation step only.
How the pipeline works, step by step
RAG splits cleanly into an indexing phase (done once, or whenever your documents change) and a query phase (done on every request).
1. Ingest
Pull in your source documents — PDFs, Markdown, HTML, Notion exports, database rows, transcripts. Loaders in LlamaIndex and LangChain handle most formats. The output is plain text plus metadata (source URL, author, date, section) that you’ll want to filter on later.
2. Chunk
LLMs and embedding models have a context limit, and you get better retrieval by embedding small, coherent passages rather than whole documents. Split text into chunks — a common starting point is 500–1,000 tokens with ~10–15% overlap so a sentence that straddles a boundary isn’t lost. Chunking is the single most underrated lever on RAG quality. Respect document structure (split on headings/paragraphs, not mid-sentence) and keep the metadata attached to each chunk.
3. Embed
Run each chunk through an embedding model to get a vector — a list of floating-point numbers that captures the chunk’s meaning. nomic-embed-text produces 768-dimensional vectors and, per Nomic’s benchmarks, matches or beats OpenAI’s older embedding models on many tasks; mxbai-embed-large produces 1024-dimensional vectors and is a strong larger option. Both run locally through Ollama. The rule that matters: you must use the exact same embedding model for indexing and for querying — vectors from different models aren’t comparable.
4. Store
Write the vectors plus their metadata and source text into a vector store. This builds an index (typically HNSW) that makes nearest-neighbor search fast.
5. Retrieve
At query time, embed the user’s question with the same model, then ask the vector store for the k most similar chunks (k=4 to 8 is a sane default). Add metadata filters here if you need them (“only docs from this project,” “only the last 12 months”). For better precision, many production setups add hybrid search (combine semantic similarity with keyword/BM25 matching) and a reranker that re-scores the top candidates.
6. Generate
Stuff the retrieved chunks into a prompt template along with the user’s question and instructions (“answer only from the context; cite sources; say you don’t know if it’s not there”), then send it to the LLM. Return the answer with citations back to the source chunks so users — and you — can verify it.
Choosing an embedding model: local vs cloud
This is the first real fork in the road.
Local embeddings (recommended for self-hosting). Models like nomic-embed-text and mxbai-embed-large run on CPU for indexing modest corpora and are free per token. Your text never leaves the box. The download is small (hundreds of MB), and they’re more than good enough for most knowledge-base use cases. Pick local if privacy is your reason for self-hosting at all — it would be odd to keep your vector DB private but ship every chunk to a third party to embed it.
Cloud embeddings (e.g. OpenAI text-embedding-3-small). At $0.02 per million tokens (OpenAI list price as of June 2026; batch pricing is cheaper, and text-embedding-3-large is ~$0.13 per million), embeddings are nearly free to call, and the quality is excellent with zero ops. The catch: every chunk and every query is transmitted to OpenAI, which defeats the privacy rationale, and you’ve reintroduced a vendor dependency.
A pragmatic middle path: local embeddings + a cloud LLM only for generation. Your documents stay private during indexing; only the small set of retrieved chunks for a given query (not your whole corpus) is sent to the LLM at answer time. Decide based on how sensitive the retrieved text is.
Choosing a vector store
There’s no universally correct answer; match it to your situation.
| Store | Best when | Notes |
|---|---|---|
| pgvector | You already run PostgreSQL | A Postgres extension; vector + relational data in one place, one backup story, SQL filters. Lowest new ops burden. |
| Chroma | Prototyping, small/medium corpora | Lightweight, dead-simple Python API; great for getting started fast. |
| Qdrant | Production RAG, larger datasets | Rust, fast, native hybrid (dense + sparse) search, strong filtered-search performance, easy Docker deploy. |
| Weaviate | You want a feature-rich engine | Built-in hybrid search, modules, multi-tenancy, GraphQL; larger surface area. |
If you’re already on Postgres, start with pgvector — it collapses your stack and you back up your vectors the same way you back up everything else. If you’re standing something up fresh and expect to grow, Qdrant is a safe production pick. Use Chroma to prototype, then graduate if you outgrow it. (Khoj, the open-source AI assistant, uses pgvector under the hood — a reasonable validation of the choice for document Q&A.)
Hardware and VPS sizing
You can run a self-hosted RAG system without a GPU. The question is what you run locally.
- Embeddings + vector store + cloud LLM: A 2 vCPU / 4 GB RAM VPS (~$20–30/mo) comfortably handles a knowledge base of tens of thousands of chunks. Indexing is the heaviest step and is one-time/incremental. This is the sweet spot for most teams.
- Fully local, including a 7-8B chat model on CPU: Workable but slow; expect multi-second responses. Bump to 8–16 GB RAM.
- Fully local with a GPU: A consumer GPU (e.g. 12–24 GB VRAM) makes local generation genuinely fast. Now you’re into dedicated hardware or a GPU cloud instance, and the cost math changes — re-read the cost breakdown before committing.
Sizing rule of thumb for the vector store: a 768-dim float32 vector is ~3 KB raw; an HNSW index adds overhead, so budget several KB per chunk plus your raw text. 100k chunks is well under a gigabyte — RAM, not disk, is usually the first constraint for in-memory engines.
A concrete starter walkthrough
To make the stack feel real, here’s the shape of a minimal pipeline using Ollama, Chroma, and LlamaIndex. This is illustrative pseudocode to show the flow — check each library’s current docs for exact APIs, which move between releases.
First, pull the models locally:
ollama pull nomic-embed-text # embedding model
ollama pull llama3.1:8b # local chat model (optional)
Then the indexing phase — load documents, chunk them, embed, and store:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
# 1. Ingest
docs = SimpleDirectoryReader("./knowledge").load_data()
# 2. Embedding model (local, via Ollama)
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# 3+4. Chunk, embed, and store in Chroma
client = chromadb.PersistentClient(path="./chroma_db")
vector_store = ChromaVectorStore(chroma_collection=client.create_collection("kb"))
index = VectorStoreIndex.from_documents(
docs, embed_model=embed_model, vector_store=vector_store
)
Then the query phase — retrieve relevant chunks and generate an answer:
# 5+6. Retrieve top-k chunks and generate with citations
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("How do I rotate the API keys?")
print(response) # the grounded answer
print(response.source_nodes) # the chunks it cited
Wrap that query call in a FastAPI endpoint and you have a private RAG service. The point isn’t the exact lines — it’s that the six conceptual steps map directly onto a handful of library calls. You can swap Chroma for Qdrant or pgvector, or swap the local LLM for a cloud one, without changing the overall shape.
Evaluating retrieval quality
Most teams skip this and pay for it later. Before tuning chunk sizes or swapping models, build a small evaluation set:
- Write 20–50 representative questions a real user would ask.
- For each, note which document(s) should be retrieved.
- Run them through your pipeline and measure how often the right source appears in the top k results (recall) and how high it ranks.
This turns “the demo looked good” into a number you can improve. When you change chunk size, switch embedding models, or add a reranker, re-run the set and confirm the metric actually went up. Retrieval quality — not the LLM — is usually the bottleneck for self-hosted RAG, and it’s the part you have the most control over.
Common pitfalls
- Bad chunking. Chunks too large dilute relevance; too small lose context. This breaks more RAG systems than model choice does. Iterate on chunk size and overlap with real queries.
- Mismatched embedding models. Indexing with one model and querying with another silently returns garbage. Pin the model and version.
- No evaluation loop. “It looked good in the demo” is not evaluation. Build a small set of question/expected-source pairs and measure retrieval before you tune anything.
- Retrieving too much or too little. Too few chunks and the answer is incomplete; too many and you blow the context window and confuse the model. Tune k.
- No source citations. Always return where each answer came from. It’s the difference between a tool people trust and a hallucination machine.
- Ignoring re-indexing. Documents change. Have a plan to update or delete stale vectors, or your “knowledge base” rots.
- Skipping hybrid search. Pure semantic search misses exact matches (error codes, SKUs, names). Adding keyword search recovers them.
When NOT to self-host
Self-hosting is a real operational commitment. Use a managed service if:
- You’re at proof-of-concept stage. Ship on a managed API first, validate that RAG even helps your use case, then move in-house once volume and requirements are clear.
- You have no one to own the infrastructure. Someone has to patch the box, watch the disk, and handle upgrades. If that’s nobody, managed is cheaper than an outage.
- Your volume is genuinely tiny. A handful of queries a day on a managed free/starter tier may cost less than a VPS and a sliver of your time.
- You need elastic, spiky scale immediately. Serverless vector DBs and hosted LLM APIs absorb bursts without capacity planning.
Self-hosting wins decisively when privacy is mandatory, when you’re at steady non-trivial volume, or when lock-in and silent model changes are unacceptable.
FAQ
Do I need a GPU to self-host RAG? No. You can run embeddings and the vector store on a cheap CPU VPS and either use a small local LLM (slower) or call a cloud LLM only for the final generation step. A GPU mainly speeds up local answer generation.
Can I do RAG without sending any data to OpenAI?
Yes. Use a local embedding model (e.g. nomic-embed-text via Ollama), a self-hosted vector store (pgvector/Qdrant/Chroma), and a local LLM through Ollama. The entire pipeline runs on your hardware with nothing leaving your network.
What’s the simplest stack to start with? Ollama + Chroma + LlamaIndex + FastAPI on a single small VPS. It’s enough to ingest documents, embed them, retrieve, and answer — then swap components as you learn what you need.
How is RAG different from fine-tuning? Fine-tuning bakes knowledge into the model’s weights and is costly to update. RAG keeps knowledge in an external store you can edit instantly, returns source citations, and works with any model. For most “answer from my documents” tasks, RAG is the right tool.
Which vector database should I pick? If you already run PostgreSQL, use pgvector. If you’re building fresh for production and expect growth, use Qdrant. For quick prototypes, Chroma. There’s no single “best” — match it to your existing stack and scale.
Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. If this was useful, explore more guides or subscribe to the newsletter for honest, vendor-neutral writeups on RAG, vector databases, and semantic search. Own your search.
Keep going
More guides on self-hosted AI search, RAG, and vector databases.