Build a Private RAG System on a VPS: A Step-by-Step Tutorial
Provision a VPS, run Ollama and a vector store, and ship a working private RAG API — every command included.
This is a hands-on tutorial: by the end you’ll have a private RAG system running on a single VPS, answering questions about your own documents, with nothing leaving your server. We provision the box, install Ollama and pull a model, stand up a vector store, build the ingestion → chunk → embed → store pipeline, wire retrieval and generation with LlamaIndex, and expose the whole thing behind a FastAPI endpoint. If you want the conceptual background first — what RAG is and why you’d self-host it — read the Self-Hosted RAG complete guide. This page is the how-to.
Every command below is real and runnable. The only things you’ll change are your domain, your documents, and your model picks.
▶ Run it yourself — aquila-starter is the one-command, fully self-hosted version of this guide: Ollama + Qdrant + FastAPI via
docker compose up. Fork it and make it your own.
What you’ll build
A request hits your FastAPI endpoint with a question. The API embeds the question locally, asks the vector store for the most relevant chunks of your documents, stuffs those chunks into a prompt, and sends it to a local LLM for the answer — returned with citations. No managed embedding API, no hosted vector database, no third party in the loop.
[Your docs] -> ingest -> chunk -> embed (Ollama) -> store (Qdrant)
|
HTTP request -> FastAPI -> embed query -> retrieve --+--> LLM (Ollama) -> answer + sources
The stack: Ollama (embeddings + generation), Qdrant (vector store), LlamaIndex (orchestration), FastAPI (the API). All open-source, all on one box.
Step 0 — Size and provision the VPS
You do not need a GPU to run this. What you run locally decides the spec.
| What you run locally | Recommended VPS | Notes |
|---|---|---|
| Embeddings + vector store, cloud LLM for generation only | 2 vCPU / 4 GB RAM (~$20–30/mo) | The sweet spot. Tens of thousands of chunks, snappy retrieval. |
| Everything local incl. a 7–8B chat model on CPU | 4 vCPU / 16 GB RAM | Works, but generation is multi-second. Fine for internal tools. |
| Everything local with a consumer GPU | GPU instance (12–24 GB VRAM) | Fast local generation; different cost tier — see the cost breakdown. |
A 2 vCPU / 4 GB box runs about $24/mo on a DigitalOcean Basic Droplet, or roughly €4–8/mo on Hetzner (CX23/CPX22 class) as of June 2026. For this tutorial we’ll assume 4 vCPU / 16 GB RAM so you can optionally run a local chat model too; drop to 4 GB if you’ll use a cloud LLM only for generation.
Spin up an Ubuntu 24.04 LTS instance, then SSH in and do the basics:
ssh root@YOUR_SERVER_IP
# Create a non-root user and harden a little
adduser rag && usermod -aG sudo rag
ufw allow OpenSSH && ufw enable
# Base packages
apt update && apt upgrade -y
apt install -y python3-venv python3-pip curl docker.io
systemctl enable --now docker
Log back in as rag for the rest.
Step 1 — Install Ollama and pull models
Ollama runs your embedding model and (optionally) your chat model locally. One command installs it:
curl -fsSL https://ollama.com/install.sh | sh
Pull an embedding model and, optionally, a small chat model:
ollama pull nomic-embed-text # embeddings — small, fast, runs on CPU
ollama pull llama3.1:8b # local chat model (optional; skip if using a cloud LLM)
nomic-embed-text is a strong default: a few hundred MB, runs comfortably on CPU, and is more than good enough for most knowledge bases. If you want to weigh alternatives, see the best local embedding models. Confirm Ollama is serving:
curl http://localhost:11434/api/tags # should list the models you pulled
Ollama listens on localhost:11434 by default — keep it that way so it’s never exposed to the internet.
Step 2 — Stand up the vector store
We’ll use Qdrant — a fast, Rust-based vector database with native hybrid search and a clean Docker deploy. It’s a safe production pick. (If you already run PostgreSQL, pgvector is a great alternative that collapses your stack — see pgvector vs Qdrant. For a broader survey, the best self-hosted vector databases.)
Run Qdrant as a container, persisting data to disk:
docker run -d --name qdrant \
-p 127.0.0.1:6333:6333 \
-v $(pwd)/qdrant_storage:/qdrant/storage \
qdrant/qdrant
Note the 127.0.0.1: bind — Qdrant is reachable only from the box, not the public internet. Check it’s up:
curl http://localhost:6333/healthz # "healthz check passed"
That’s your entire data layer. New to the concept? What is a vector database explains why this matters for retrieval.
Step 3 — Set up the Python project
Create a virtualenv and install the orchestration libraries:
mkdir ~/private-rag && cd ~/private-rag
python3 -m venv .venv && source .venv/bin/activate
pip install \
llama-index-core \
llama-index-embeddings-ollama \
llama-index-llms-ollama \
llama-index-vector-stores-qdrant \
qdrant-client fastapi "uvicorn[standard]"
Drop your source documents — PDFs, Markdown, text, HTML — into a ./knowledge folder:
mkdir knowledge
# scp/rsync your docs in, e.g.:
# rsync -av ~/Documents/handbook/ rag@YOUR_SERVER_IP:~/private-rag/knowledge/
Step 4 — Build the ingestion pipeline (ingest → chunk → embed → store)
This script loads your documents, splits them into chunks, embeds each chunk locally via Ollama, and writes the vectors into Qdrant. Run it once now, and again whenever your documents change.
Create ingest.py:
import qdrant_client
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
COLLECTION = "kb"
# 1. Ingest — load every file under ./knowledge (PDF, md, txt, html, ...)
docs = SimpleDirectoryReader("./knowledge").load_data()
# 2. Chunk — ~800 tokens with overlap so boundaries aren't lost
splitter = SentenceSplitter(chunk_size=800, chunk_overlap=100)
# 3. Embed — local model via Ollama (same model MUST be used at query time)
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# 4. Store — write vectors + text + metadata into Qdrant
client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION)
storage = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
docs,
storage_context=storage,
embed_model=embed_model,
transformations=[splitter],
)
print(f"Indexed {len(docs)} documents into Qdrant collection '{COLLECTION}'.")
Run it:
python ingest.py
Indexing is the heaviest step and it’s one-time or incremental — it’s fine for it to take a few minutes on CPU. The chunk size and overlap here are the single most important quality lever in RAG: too large and relevance gets diluted, too small and chunks lose context. Start at 800/100 and tune against real questions.
Step 5 — Wire retrieval + generation
Now the query side: embed the question with the same model, retrieve the top chunks from Qdrant, and generate a grounded answer with a local LLM. Create query.py:
import qdrant_client
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.qdrant import QdrantVectorStore
COLLECTION = "kb"
client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION)
embed_model = OllamaEmbedding(model_name="nomic-embed-text")
llm = Ollama(model="llama3.1:8b", request_timeout=120.0) # or a cloud LLM
# Rebuild the index handle from the existing Qdrant collection
index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
query_engine = index.as_query_engine(llm=llm, similarity_top_k=5)
def answer(question: str):
resp = query_engine.query(question)
sources = [n.metadata.get("file_name", "unknown") for n in resp.source_nodes]
return str(resp), sources
if __name__ == "__main__":
text, sources = answer("How do I rotate the API keys?")
print(text)
print("Sources:", sources)
similarity_top_k=5 asks for the five most relevant chunks — a sane default. Too few and answers are incomplete; too many and you blow the context window and confuse the model. Test it from the command line:
python query.py
If you’d rather use a cloud LLM only for the final generation step (faster, while keeping your documents private during indexing), swap the Ollama(...) line for that provider’s LlamaIndex LLM — only the handful of retrieved chunks for a given query travel off-box, never your whole corpus. The semantic-matching half of this is explained in what is semantic search, and the vectors themselves in what are embeddings.
Step 6 — Expose it via FastAPI
Wrap the query function in an HTTP endpoint. Create app.py:
from fastapi import FastAPI
from pydantic import BaseModel
from query import answer
app = FastAPI(title="Private RAG")
class Query(BaseModel):
question: str
@app.get("/healthz")
def healthz():
return {"status": "ok"}
@app.post("/ask")
def ask(q: Query):
text, sources = answer(q.question)
return {"answer": text, "sources": sources}
Run it:
uvicorn app:app --host 127.0.0.1 --port 8000
Test from another shell:
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"question": "How do I rotate the API keys?"}'
You now have a private RAG API. The whole pipeline — embeddings, vector search, generation — runs on your VPS. Bind it to 127.0.0.1 and put a reverse proxy in front (next section) rather than exposing uvicorn directly.
Going to production
The tutorial above is a working system, not yet a hardened one. Before real users touch it:
- Reverse proxy + TLS. Put Caddy or Nginx in front of FastAPI, terminate HTTPS (Caddy gets you a free Let’s Encrypt cert in two lines), and only expose ports 80/443. Keep Ollama (
11434), Qdrant (6333), and uvicorn (8000) bound to127.0.0.1. - Run uvicorn under a process manager. Use a
systemdunit (orgunicornwith uvicorn workers) so the API restarts on crash and survives reboots. Same for the Qdrant container (--restart unless-stopped). - Add an auth layer. An API key header or your existing auth in front of
/ask. Never ship an open RAG endpoint. - Incremental re-indexing. Documents change. Run
ingest.pyon a schedule, or upsert/delete by document ID so stale chunks don’t rot your answers. - Hybrid search + reranking. Pure semantic search misses exact matches — error codes, SKUs, names. Qdrant supports dense + sparse vectors natively; add a keyword/BM25 component and a reranker to lift precision.
- Backups. Snapshot the Qdrant
storagevolume (and your rawknowledgefolder) on a schedule. Your index is data; treat it like a database. - An evaluation loop. Write 20–50 representative questions, note which document should answer each, and measure how often the right source lands in the top-k. Re-run it every time you change chunk size, swap the embedding model, or add a reranker — otherwise “the demo looked good” is your only metric.
Cost reality
Once it’s running, your bill is the VPS — a flat ~$20–30/mo on a DigitalOcean-class box (cheaper on Hetzner), regardless of query volume. There’s no per-token embedding charge and no per-vector storage charge, because both run on hardware you already pay for. The full comparison against a managed OpenAI + Pinecone setup — including where the managed bill actually goes — is in Self-Hosted RAG vs OpenAI + Pinecone: a real cost breakdown.
FAQ
Do I really need a GPU for this? No. Embeddings and vector search run fine on CPU. A GPU only speeds up local answer generation. The cheapest viable setup runs embeddings + Qdrant on a 2 vCPU / 4 GB VPS and calls a cloud LLM for the generation step only.
Can I swap Qdrant for Chroma or pgvector?
Yes — LlamaIndex abstracts the store. Use Chroma to prototype, pgvector if you already run PostgreSQL, Qdrant when you want a fast, production-ready engine. The pipeline shape doesn’t change; only the *VectorStore line does. Compare them in the best self-hosted vector databases.
Is anything sent to a third party in this setup?
With a local LLM (llama3.1:8b via Ollama), nothing leaves the box — embeddings, retrieval, and generation are all local. If you opt for a cloud LLM at the generation step, only the small set of retrieved chunks for that one query is sent, never your whole corpus.
LlamaIndex or LangChain? Either works. LlamaIndex is slightly more retrieval-focused and the snippets here use it; LangChain is fine if your app already depends on it. The same six steps — ingest, chunk, embed, store, retrieve, generate — map onto both.
My answers are wrong or vague. What do I fix first?
Retrieval, not the LLM, is almost always the bottleneck. Build the small evaluation set above, then tune chunk size/overlap and similarity_top_k, and add hybrid search before you reach for a bigger model.
Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. If this got you to a working RAG API, explore more guides or subscribe for honest, vendor-neutral writeups on RAG, vector databases, and embeddings. Own your search.
Keep going
More guides on self-hosted AI search, RAG, and vector databases.