What Are Embeddings? A Plain-English Guide for Developers

How text and images become vectors a computer can compare by meaning — the foundation of semantic search and RAG.

By Aquila Team Updated June 19, 2026

An embedding is a list of numbers — a vector — that captures the meaning of a piece of content. An embedding model reads some text (or an image, or audio) and outputs a fixed-length array of numbers, say 768 of them. The crucial property: things that mean similar things get vectors that sit close together, and unrelated things get vectors that sit far apart. That single idea — turning meaning into geometry — is what powers semantic search, recommendations, clustering, and RAG. This guide explains embeddings from the ground up, no math degree required.

This is a conceptual primer. When you’re ready to actually pick a model, see Best Local Embedding Models; for the search technique that uses embeddings, see What Is Semantic Search.

The core idea: meaning as coordinates

Imagine a map where every word, sentence, or document is a point. On this map, “dog” and “puppy” are neighbors. “Dog” and “cat” are a bit further apart but still in the same animal neighborhood. “Dog” and “quarterly tax filing” are on opposite sides of town.

An embedding is just the coordinates of a point on that map — except the map isn’t 2D, it’s hundreds of dimensions. You can’t picture 768 dimensions, but the math works the same way it does in 2D: points close together are similar, points far apart are different. The embedding model’s entire job is to place text on this map so that distance equals meaning.

That’s the whole trick. Once meaning is coordinates, a computer can do something it could never do with raw words: measure how similar two things are by measuring the distance between their points.

How text becomes a vector

You don’t compute embeddings by hand — a trained neural network does it. The model has read enormous amounts of text and learned, from context, which words and phrases tend to mean similar things. (It learns that “the bank of the river” and “I deposited money at the bank” use “bank” differently, and places them accordingly.)

The flow is simple from the outside:

  1. You give the model some text — a word, a sentence, or a whole paragraph (a “chunk”).
  2. The model processes it through its layers.
  3. It outputs one fixed-length vector of numbers, regardless of how long the input was (within the model’s limit).

That output vector is the embedding. The same model always produces the same-length vector — that fixed length is the model’s dimension count.

Images and other data, too

Embeddings aren’t text-only. Image models (like CLIP-style architectures) embed pictures into the same kind of vector space, so you can search images by meaning — or even search images with text (“find photos that look like ‘a red bicycle in the rain’”) when text and images are embedded into a shared space. Audio, code, and other data types work the same way. The principle never changes: complex content in, a meaning-vector out.

Dimensions: what those hundreds of numbers mean

When someone says an embedding model is “768-dimensional” or “1536-dimensional,” they mean each vector is a list of that many numbers. Common sizes you’ll meet:

DimensionsTypical of
384Small, fast models (great on CPU)
768A very common middle ground
1024Larger open models
1536 / 3072Some cloud API models

You might assume each dimension is a human-readable feature — “dimension 12 = how much this is about sports.” It isn’t. The dimensions are learned and mostly not individually interpretable. They’re the model’s own internal coordinate system. What matters is the relationships between vectors, not any single number.

More dimensions isn’t automatically better. Higher-dimensional embeddings can capture more nuance, but they cost more to store and compare, and the gain has diminishing returns. A well-trained 768-dim model often beats a mediocre 1536-dim one. Dimension count is one input to a model choice, not a quality score — the model-selection tradeoffs live in Best Local Embedding Models.

Similarity: how we compare two vectors

Once you have two embeddings, you need a way to ask “how similar are these?” The standard answer is cosine similarity.

Cosine similarity measures the angle between two vectors, not the distance. Two vectors pointing in nearly the same direction (small angle) are very similar; vectors pointing in opposite directions are opposites. The score runs from 1.0 (identical direction / maximum similarity) through 0 (unrelated) to −1.0 (opposite). In practice with text embeddings you mostly see values between roughly 0 and 1.

Why the angle and not the straight-line distance? Because direction captures meaning more robustly than magnitude does — it focuses on what the content is about rather than how long or “loud” the text is. You’ll also see dot product and Euclidean distance used; for normalized embeddings they rank results almost identically. Cosine similarity is the default you’ll meet first.

The takeaway: comparing meaning becomes a tiny arithmetic operation between two arrays of numbers. That’s fast — and it’s what makes searching by meaning at scale possible.

Put the pieces together and you get semantic search — finding results by meaning instead of by matching keywords. The recipe:

  1. Ahead of time: embed all your documents (usually split into chunks) and store the vectors.
  2. At query time: embed the user’s query with the same model.
  3. Compare: find the stored vectors with the highest cosine similarity to the query vector.
  4. Return those nearest neighbors as the results.

Because the comparison is on meaning-vectors, a search for “how do I reset my password” can surface a doc titled “recovering account access” — no shared keywords required. (For the full keyword-vs-semantic-vs-hybrid breakdown, see What Is Semantic Search — we won’t repeat it here.)

One rule is non-negotiable: embed your documents and your queries with the same model. Vectors from two different models live in different, incompatible coordinate systems. Mixing them produces nonsense similarity scores.

How embeddings power RAG

Retrieval-Augmented Generation (RAG) is the technique behind “chat with your documents.” Embeddings are its retrieval engine:

  • You embed and store your knowledge base (chunks of your wiki, PDFs, support tickets — whatever).
  • When a user asks a question, you embed the question and pull the most similar chunks.
  • You paste those chunks into the prompt as context, and the LLM writes an answer grounded in your content.

Without embeddings, an LLM can only answer from what it memorized in training. Embeddings are what let it find and use the right slice of your private knowledge at the moment it’s needed.

Where embeddings live: the vector database

Comparing one query vector against a handful of documents is trivial. Comparing it against millions, in milliseconds, is not — checking every vector one by one gets slow fast. That’s the job of a vector database: it stores embeddings and uses clever index structures (like HNSW) to find the nearest neighbors approximately but very quickly. For a small project you might keep vectors in a file or a simple library; past a certain scale you want a real vector store. The embedding is the data; the vector database is where it lives and gets searched.

Local vs API embedding models

You can generate embeddings two ways, and the choice is mostly about privacy and cost:

Local / self-hosted models run on your own hardware (via Ollama, sentence-transformers, and similar). The text never leaves your machine, there’s no per-token charge, and you can run fully offline. The tradeoff is you provision the compute — though many strong embedding models run fine on a CPU. This is the on-brand path for anyone who wants to own their search stack.

API / cloud models (such as OpenAI’s text-embedding-3-small) are a quick HTTP call: no infrastructure, consistently good quality, very cheap per token. The catch is that every chunk you embed is sent to a third party — a real consideration for private or regulated data.

A common, pragmatic split: keep embeddings and document ingestion local (where your sensitive content is), and reserve any cloud calls for the final answer-generation step if you want a frontier model there. Which specific local model to use — and the size, speed, and quality tradeoffs — is exactly what Best Local Embedding Models is for.

FAQ

Are embeddings the same as vectors? Effectively, in this context. An embedding is a vector — a list of numbers. “Embedding” emphasizes that the vector was produced by a model to represent meaning. “Vector” is the generic term for the numeric array.

Do I need to understand the math to use embeddings? No. The mental model — “meaning becomes coordinates, and close coordinates mean similar things” — is enough to build semantic search and RAG. Libraries and vector databases handle the actual cosine-similarity math for you.

How many dimensions should my embeddings have? It’s determined by the model you choose, not something you set freely. Common values are 384, 768, and 1024 for open models, up to 1536 or more for some cloud models. More dimensions isn’t automatically better — model quality matters more than raw size.

Can I compare embeddings from two different models? No. Each model has its own coordinate system, so vectors from different models aren’t comparable. Always embed your documents and queries with the same model, and re-embed everything if you switch models.

Do embeddings need a GPU? For generating embeddings on a modest corpus, a CPU is fine — especially with smaller models. A GPU mainly speeds up embedding very large datasets or running large models.


Embeddings are the quiet foundation of the entire AI-search stack. Next, see where they live in a vector database, how they drive semantic search, or how to pick a model in Best Local Embedding Models. Aquila is the independent home for AI search you own. Own your search.

Keep going

More guides on self-hosted AI search, RAG, and vector databases.