Chat With Your Documents, Self-Hosted: Build a Private PDF Q&A Assistant

Drop in your PDFs and docs, ask questions, get cited answers — all running on your own box, nothing sent to anyone.

By Aquila Team Updated June 19, 2026

A self-hosted “chat with your documents” assistant lets you drop in PDFs, Word files, and Markdown, then ask plain-English questions and get answers with citations — entirely on your own infrastructure, with nothing sent to a third party. Under the hood it is a RAG pipeline (ingest → embed → store → retrieve → answer) wired to a local LLM and a simple chat UI. This guide builds one end to end with real commands and a concrete stack, and explains where this document-Q&A layer differs from a general RAG server.

This is the application layer sitting on top of the Self-Hosted RAG complete guide and the build-a-private-RAG-on-a-VPS tutorial. Where those cover the engine, this covers the product: a private assistant your team actually opens and uses.

▶ Run it yourselfaquila-starter is the one-command, fully self-hosted version of this guide: Ollama + Qdrant + FastAPI via docker compose up. Fork it and make it your own.

Why self-host document chat

The “chat with your PDF” category is crowded with SaaS tools, and they all share one problem: you upload your documents to someone else’s server. For a marketing one-pager, fine. For contracts, patient records, board decks, unreleased source, financials, or anything under NDA, uploading it to a third-party “chat with your docs” service is exactly the data leak your security team worries about.

Self-hosting removes that risk by construction. Your documents are embedded locally, stored locally, and answered by a model running locally. The privacy is not a policy promise on a vendor’s page — it is a property of where the bytes physically live. That is the private/self-hosted wedge this whole site is built on: own your index, do not rent it.

The honest tradeoff: a SaaS tool is a five-minute signup, and self-hosting is an afternoon of setup plus ownership of the box. If your documents are not sensitive and you just want answers, use the SaaS tool. If they are sensitive, read on.

How document chat differs from the VPS pillar

The VPS pillar builds a general RAG service — an API you call from your own apps. This guide builds a document-Q&A application on top of that idea, and the difference shows up in three places:

  • Ingestion is the hard part, not an afterthought. A general RAG demo loads clean text. A real document assistant has to extract text from messy PDFs — scanned pages, two-column layouts, tables, headers and footers — which is where most “it doesn’t work on my files” pain comes from. Document parsing carries more weight here than anywhere else in the pipeline.
  • It is conversational, not one-shot. Users ask follow-ups. “What about clause 4?” only makes sense after “Summarize the termination terms.” That means you carry chat history and rewrite follow-up questions into standalone queries before retrieval.
  • It ships with a UI. A general RAG API has no front end. A document assistant is something a non-developer opens in a browser, drags a file into, and chats with. The UI is part of the product, not optional polish.

Keep those three in mind and the build below is straightforward.

The stack

Everything here is open-source and runs on a single machine. A workstation or a VPS with 8–16 GB RAM is enough; a GPU makes local generation snappy but is not required for modest use.

LayerChoiceWhy
LLM runtimeOllamaOne command to run local chat and embedding models.
Embedding modelnomic-embed-textSmall, fast, runs on CPU; 768-dim. See best local embedding models.
Chat modelllama3.1:8b (or larger)Capable local answerer; swap up if you have VRAM.
Document parsingPyMuPDF / Unstructured / DoclingRobust text extraction from PDFs, DOCX, etc.
Vector storeChroma (persistent)Zero-config local store; graduate to Qdrant or pgvector later.
OrchestrationLlamaIndexTies chunking, retrieval, and chat together.
UIOpen WebUI, or a small Streamlit/Gradio appA browser chat front end with file upload.

If you would rather not assemble this yourself, mature self-hostable apps already package the whole thing: AnythingLLM, Open WebUI with its document feature, and Khoj (an AGPL “AI second brain” that answers from your PDFs, Markdown, and Word files and runs on pgvector, as of June 2026) all give you document chat out of the box. Build from parts when you need control over chunking and retrieval; install a packaged app when you just want it working today. The rest of this guide builds from parts so you understand every layer.

Step 1 — Run the models locally

Install Ollama, then pull an embedding model and a chat model:

# install Ollama (Linux); see ollama.com for macOS/Windows
curl -fsSL https://ollama.com/install.sh | sh

ollama pull nomic-embed-text   # embeddings
ollama pull llama3.1:8b        # the answerer

Confirm it is serving on localhost:11434:

ollama list

Both models now run entirely on your machine. No API keys, no egress.

Step 2 — Ingest and parse your documents

Drop your files into a folder and extract clean text. This is the step that makes or breaks a document assistant, so use a real parser rather than naive text extraction. A minimal Python ingestion using LlamaIndex (which can call PyMuPDF / Unstructured under the hood):

pip install llama-index llama-index-embeddings-ollama \
            llama-index-vector-stores-chroma chromadb pymupdf
from llama_index.core import SimpleDirectoryReader

# Reads PDFs, DOCX, Markdown, TXT, HTML from the folder
docs = SimpleDirectoryReader("./documents").load_data()
print(f"Loaded {len(docs)} document(s)")

For scanned PDFs (images, not text) you need OCR — point a layout-aware extractor like Unstructured or Docling at them before this step, or no amount of chunking will recover text that was never extracted.

Step 3 — Chunk, embed, and store

Split the documents, embed each chunk with the local model, and persist them to Chroma. How you chunk matters a lot for answer quality — this is the single biggest quality lever, covered in depth in RAG chunking strategies. A sensible default for mixed documents is ~512-token chunks with light overlap.

from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)

client = chromadb.PersistentClient(path="./chroma_db")
vector_store = ChromaVectorStore(
    chroma_collection=client.get_or_create_collection("documents")
)

index = VectorStoreIndex.from_documents(docs, vector_store=vector_store)

Your documents are now embedded and stored on disk in ./chroma_db. The raw text and the vectors never left the machine.

Step 4 — Retrieve and answer (with chat history)

For a document chat, use a chat engine rather than a one-shot query engine so follow-up questions work. LlamaIndex’s CondenseQuestionChatEngine rewrites a follow-up plus the conversation into a standalone query before retrieving:

from llama_index.llms.ollama import Ollama

Settings.llm = Ollama(model="llama3.1:8b", request_timeout=120.0)

chat_engine = index.as_chat_engine(
    chat_mode="condense_question",
    similarity_top_k=5,
)

resp = chat_engine.chat("What are the termination terms in this contract?")
print(resp)                 # grounded answer
print(resp.source_nodes)    # the chunks it cited

resp = chat_engine.chat("And the notice period for that?")  # follow-up works
print(resp)

Two things make answers trustworthy: a prompt instruction to answer only from the retrieved context and say “I don’t know” otherwise, and returning citations (source_nodes) so users can verify against the source document. Never ship a document assistant that answers without showing where the answer came from.

Step 5 — Add a UI

A browser front end turns this from a script into something your team uses. Two quick paths:

Packaged UI (fastest). Run Open WebUI in Docker and point it at your local Ollama; its document feature gives you drag-and-drop upload and chat with no code:

docker run -d --name open-webui -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000, connect it to Ollama, and upload documents.

Custom UI (more control). Wrap your own pipeline in a small Streamlit or Gradio app — a file uploader, a chat box, and a panel that shows the cited source chunks. A few dozen lines, and you control chunking and retrieval exactly. Choose custom when retrieval quality matters more than setup speed.

Hardening it for real use

A working demo is not a deployed tool. Before colleagues rely on it:

  • Re-indexing. Documents change and get added. Have a job that re-embeds new or edited files and deletes stale vectors, or your assistant slowly answers from outdated text.
  • Access control. If more than one person uses it, put it behind authentication (a reverse proxy with auth, or the UI’s built-in users) so the wrong people cannot query sensitive documents.
  • Evaluation. Build a small set of question/expected-source pairs and measure retrieval before and after every change — the full method is in how to evaluate RAG. “It answered my test question” is not the same as “it works.”
  • Hybrid search. Pure semantic search misses exact strings — clause numbers, case IDs, error codes. Adding keyword/BM25 matching recovers them; see what is semantic search for why.
  • Backups. Your vector store and source documents are now a knowledge asset. Back them up like one.

FAQ

Is “chat with your documents” the same as RAG? Yes — it is a RAG pipeline wrapped in a document-upload UI and conversational chat. The retrieval-augmented-generation engine is identical; document chat just adds robust file parsing, chat history, and a front end on top.

Can I chat with my PDFs completely offline? Yes. With Ollama running a local embedding model and a local chat model, plus a local vector store, the entire pipeline runs on your machine with no internet required after the initial model downloads. Nothing about your documents leaves the box.

What about scanned PDFs and images? Scanned PDFs contain images, not text, so you need OCR first. Use a layout-aware extractor like Unstructured or Docling (which can OCR) before chunking. If text was never extracted, retrieval has nothing to find.

Do I need a GPU? No. Embeddings and the vector store run fine on CPU, and an 8B chat model answers on CPU too — just slower (a few seconds per response). A GPU mainly makes local answer generation fast. For a handful of users, CPU is workable.

Is it faster to just install a packaged app? Often, yes. AnythingLLM, Open WebUI’s document feature, and Khoj give you self-hosted document chat with minimal setup. Build from parts (as above) when you need control over chunking, retrieval, and evaluation; install a packaged app when you want it running today.


Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. A document assistant you host yourself means your contracts, decks, and notes are answerable without ever being uploaded to anyone. Explore more guides or subscribe to the newsletter for honest, vendor-neutral writeups. Own your search.

Keep going

More guides on self-hosted AI search, RAG, and vector databases.