The Best Local Embedding Models for RAG (2026)
Open, self-hostable embedding models that keep your text on your own hardware — compared, ranked, and matched to your use case.
The best local embedding model for most self-hosted RAG systems is nomic-embed-text — small, fast on CPU, permissively licensed, and good enough that you’ll rarely need more. Step up to mxbai-embed-large or a bge model when retrieval quality matters more than speed, and reach for bge-m3 when you need long documents or multiple languages. This guide compares the open, self-hostable models worth running, shows how to pick by use case, and explains the quality-vs-cost tradeoff against OpenAI’s hosted embeddings.
An embedding model turns a chunk of text into a vector — a list of numbers that captures meaning — so similar passages land near each other in vector space. That’s the retrieval engine underneath RAG and semantic search. If you want the concept from scratch, read what are embeddings; for how those vectors get searched, what is a vector database. This page is about which model to run when you’re keeping that whole pipeline on your own hardware.
Why run embeddings locally at all
If you’re self-hosting RAG for privacy, embedding locally is the part you can’t skip. It would be odd to keep your vector database private but ship every chunk of every document to a third party just to vectorize it — that’s the exact text you were trying to protect. Local embedding models close that gap:
- Your text never leaves the box. Indexing and querying both run on your hardware.
- Zero per-token cost. No metered API; the model runs on CPU (or GPU) you already pay for.
- No vendor lock-in or silent model swaps. You pin a version; retrieval quality doesn’t change under you overnight.
- Offline / air-gapped capable. The model is a file you download once.
The catch is that you pick the model, and the picks differ in size, speed, dimension count, and quality. That’s what the rest of this guide is for.
The models, compared
These are the open, self-hostable models worth your attention for RAG in 2026. Dimensions and licenses are the durable facts; treat “quality” as relative positioning, not a leaderboard score — benchmarks (MTEB) drift and your corpus is what actually matters.
| Model | Dims | Approx size | Max context | License | Best for |
|---|---|---|---|---|---|
| nomic-embed-text | 768 | ~270 MB | long (8K-class) | Apache-2.0 | The default. Fast on CPU, long-context, great all-rounder. |
| mxbai-embed-large | 1024 | ~670 MB | 512 | Apache-2.0 | Higher-quality English retrieval when you can spare the compute. |
| bge-large-en-v1.5 | 1024 | ~1.3 GB (335M params) | 512 | MIT | Strong English quality, well-proven, widely supported. |
| bge-base-en-v1.5 | 768 | ~440 MB (109M params) | 512 | MIT | The quality/speed compromise of the BGE family. |
| bge-m3 | 1024 | ~2.3 GB | 8192 | MIT | Long documents + 100+ languages + dense/sparse/multi-vector. |
| e5-large-v2 | 1024 | ~1.3 GB (335M params) | 512 | MIT | Robust multilingual-ish retrieval; mature, well-documented. |
| gte-large | 1024 | ~670 MB | 512 | Apache-2.0 | Competitive English quality, compact for its dimension count. |
A few notes so the numbers don’t mislead:
- Sizes are approximate and depend on quantization (Ollama ships quantized GGUF builds, so the on-disk footprint is often smaller than the raw Hugging Face checkpoint). Use them for relative comparison, not capacity planning.
nomic-embed-textis 768-dim here to match the figures used across the Aquila guides; it’s the lightest credible default. Some sources quote a 1024-dim variant — pin whichever build you deploy and stay consistent across indexing and querying.- More dimensions is not strictly better. Higher-dimension vectors can capture more nuance but cost more storage and RAM and slightly slower search. 768 dims is plenty for most knowledge bases.
- Context length matters for chunking. A 512-token model means your chunks must be ~512 tokens or smaller; an 8K-class model (nomic, bge-m3) lets you embed larger, more coherent passages.
How to pick by use case
There’s no single best model — match it to your situation.
”I just want a solid default” → nomic-embed-text
Small download, runs on a CPU-only VPS, handles long context, Apache-2.0 license, and quality that holds its own against OpenAI’s older embeddings on many tasks. For the large majority of self-hosted RAG systems, start here and only move if your evaluation set tells you to.
”Quality matters more than speed” → mxbai-embed-large or bge-large-en-v1.5
When retrieval precision is the bottleneck and you have CPU headroom (or a GPU), a 1024-dim large model usually lifts recall on harder English corpora. mxbai-embed-large runs cleanly through Ollama; bge-large-en-v1.5 is the long-proven choice with the widest ecosystem support. Always confirm the gain on your questions — the difference is often smaller than the benchmark gap suggests.
”Long documents or many languages” → bge-m3
bge-m3 handles 8192-token inputs and 100+ languages, and it natively produces dense, sparse, and multi-vector representations — handy if you want hybrid search without bolting on a separate keyword index. It’s the heaviest model here; budget the RAM.
”Smallest possible footprint” → bge-base-en-v1.5 or nomic-embed-text
For a tiny VPS or edge box, a 768-dim base model keeps both the model and the vector index small. Embedding storage scales with dimensions: a 768-dim float32 vector is ~3 KB raw, a 1024-dim one ~4 KB, before index overhead — multiply by your chunk count.
”Multilingual is non-negotiable” → bge-m3 or e5 family
The BGE-M3 and E5 families were trained with multilingual retrieval in mind. If your documents and queries span languages, start there rather than an English-first model.
How to run them
Two paths, depending on your stack. Both keep the text on your hardware.
Via Ollama (simplest)
If you’re already running Ollama for generation (see building a private RAG on a VPS), embeddings are one pull away:
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
ollama pull bge-m3
Then call the embeddings endpoint directly:
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Own your search."
}'
Or via LlamaIndex / LangChain, which both have a one-line Ollama embedding wrapper that plugs straight into your indexing pipeline.
Via sentence-transformers (Python, full control)
For models on Hugging Face (the full BGE / E5 / GTE catalog) and direct control over batching and pooling:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
vectors = model.encode(
["Own your search.", "Self-hosted RAG keeps your data yours."],
normalize_embeddings=True, # cosine similarity then = dot product
)
print(vectors.shape) # (2, 1024)
One rule that overrides all model choice: use the exact same model and version for indexing and for querying. Vectors from different models live in different spaces and aren’t comparable — mixing them silently returns garbage. Some models (E5, BGE) also expect instruction prefixes like "query: " / "passage: "; follow each model card, and apply the prefix consistently on both sides.
Quality vs cost vs OpenAI
The honest comparison most pricing pages won’t make.
| Local (e.g. nomic-embed-text) | OpenAI text-embedding-3-small | |
|---|---|---|
| Per-token cost | $0 (runs on your hardware) | ~$0.02 / 1M tokens |
| Privacy | Text never leaves your box | Every chunk + query sent to OpenAI |
| Quality | Good → very good (model-dependent) | Excellent, zero tuning |
| Ops burden | You run the model | None |
| Lock-in | None; pin the version | Vendor dependency + silent updates |
Here’s the part people miss: embeddings are the cheap line item either way. At ~$0.02 per million tokens (text-embedding-3-small; text-embedding-3-large is ~$0.13), calling OpenAI to embed is nearly free in dollars — the generation LLM dominates a managed RAG bill, not the embeddings. So you don’t self-host embeddings to save money on the embedding API. You self-host them for privacy and control: with a local model, your documents are never transmitted to anyone, full stop. If privacy is why you’re self-hosting RAG at all, local embeddings aren’t optional — they’re the whole point. The full dollar comparison lives in Self-Hosted RAG vs OpenAI + Pinecone.
Quality-wise, the gap has largely closed for everyday knowledge-base retrieval. A good local model like nomic-embed-text or bge-large will not be the reason your RAG system underperforms — chunking and retrieval tuning will be. Spend your effort there.
Don’t pick on benchmarks alone
MTEB scores are a useful starting filter, not a verdict. The model that tops a public leaderboard on academic datasets may not win on your support tickets, contracts, or code. The reliable method:
- Pick 2–3 candidates from the table above that fit your size/license/language constraints.
- Build a small evaluation set — 20–50 real questions, each tagged with the document that should answer it.
- Index your corpus with each candidate and measure how often the right source lands in the top-k results.
- Pick the winner on your data, then move on — the marginal gains from chasing a better model are usually smaller than the gains from better chunking.
FAQ
Which local embedding model should I use for RAG?
Start with nomic-embed-text — it’s small, fast on CPU, long-context, Apache-2.0, and good enough for most knowledge bases. Move to mxbai-embed-large or bge-large-en-v1.5 if your evaluation set shows retrieval quality is the bottleneck, or bge-m3 for long or multilingual documents.
Are local embedding models as good as OpenAI’s? For typical knowledge-base retrieval, close enough that it rarely matters. OpenAI’s models are excellent with zero tuning, but a good local model won’t be what holds your RAG system back — chunking and retrieval strategy will. And local keeps your text private, which is the real reason to self-host.
Does a higher dimension count mean better retrieval?
Not reliably. More dimensions can capture more nuance but cost more storage, RAM, and search time. 768-dim models like nomic-embed-text are plenty for most use cases; only go to 1024+ if you measure a real gain on your data.
Can I change embedding models later? Yes, but you must re-embed your entire corpus — vectors from different models aren’t comparable, so you can’t mix old and new. Treat a model switch as a full re-index, and always use the same model for indexing and querying.
Do I need a GPU to run embeddings?
No. Models like nomic-embed-text and the BGE base models embed comfortably on CPU for modest corpora. A GPU speeds up bulk indexing of very large document sets but isn’t required to get started.
Aquila is the independent guide to private, self-hosted AI search — built on the belief that you should own your index, not rent it. Ready to put one of these models to work? Follow build a private RAG on a VPS, or explore more guides. Own your search.
Keep going
More guides on self-hosted AI search, RAG, and vector databases.