Back to AI LabRAG Explorer

Learn Retrieval-Augmented Generation (Vector RAG), Step by Step

Paste your own content, ask a question, and watch a complete RAG pipeline run end to end: chunking, embeddings, vector search, retrieval, prompt construction, and a grounded answer. Each step explains what it does and why it exists.

Input content

Paste documentation, a blog article, an architecture note, or any technical text. A sample is loaded for you.

1473 words9003 characters

Your question

Ask something answerable from the content above.

Try examples
800 charsDefault

Smaller chunks make retrieval more precise; larger chunks keep more context per chunk. Try different sizes and re-run to compare.

Embeddings run locally with an open-source model. The answer is generated with Groq.

Pipeline overview
  1. Content
  2. Chunking
  3. Embeddings
  4. Vector Search
  5. Retrieved Chunks
  6. Prompt
  7. Answer

Ask a question to run the full pipeline.

You'll see chunking, embeddings, vector search, retrieval, prompt construction, and the grounded answer — each explained.

Learn the concepts

Retrieval-Augmented Generation, explained

A plain-English guide to the ideas behind the tool above — RAG, embeddings, vector search, and how modern AI assistants stay accurate.

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an AI architecture that combines a search system with a large language model. Instead of answering only from what a model memorized during training, a RAG system first retrieves relevant passages from an external knowledge source and supplies them to the model as context. The model then generates an answer grounded in that evidence.

RAG exists to solve two problems that language models have on their own. First, a model's knowledge is frozen at training time, so it cannot know about recent events or your private documents. Second, models hallucinate — they produce confident but incorrect answers when they don't actually know. By retrieving trustworthy text and asking the model to answer from it, RAG keeps responses current, accurate, and traceable to a source.

This is why RAG has become the default pattern for documentation assistants, customer support bots, internal knowledge tools, and AI search. It turns a general-purpose model into a domain expert on your data — without expensive retraining.

How Embeddings Work

An embedding converts a piece of text into a list of numbers — a vector — that represents its meaning. Embedding models are trained so that texts with similar meanings get vectors pointing in similar directions, while unrelated texts point in different directions. The words "car" and "automobile" share no letters, but a good model places their vectors very close together.

Embeddings exist because keyword search is brittle. Searching for the exact word "refund" misses a document that says "money back guarantee," even though they mean the same thing. Because embeddings capture meaning rather than spelling, semantic search finds the right passage even when the wording is completely different.

The tool above uses the open-source model BAAI/bge-large-en-v1.5, which produces 1024-dimensional vectors. Every chunk of your document and your question become one of these vectors, so they can be compared with simple math.

What Is Vector Search?

Vector search, also called semantic search, finds the stored vectors that are closest to a query vector. "Closest" is usually measured with cosine similarity, which compares the angle between two vectors: a score near 1.0 means a very strong match, while a score near 0 means they are unrelated.

When you ask a question, the system embeds it with the same model used for the document chunks. Now the question and every chunk live in the same vector space, so a vector database can rank all chunks by similarity and return the top matches — typically the top three to six. Many systems also apply a minimum similarity threshold to discard weak, off-topic matches.

Vector databases such as pgvector, Pinecone, Weaviate, and Qdrant use approximate nearest neighbor algorithms (like HNSW) to do this search in milliseconds, even across millions of vectors. That speed is what makes real-time RAG possible.

Why Chunking and Overlap Matter

Documents are often too long to embed or feed to a model in one piece, so RAG splits them into smaller passages called chunks. Good chunking keeps a single idea or section together. If chunks are too large, retrieval becomes vague; if they are too small, they lose the context that makes them meaningful.

Chunk overlap solves a boundary problem: if a key sentence is split between two chunks, neither holds the full thought. By repeating a little text from the previous chunk, overlap ensures ideas that span a boundary still appear, in full, in at least one chunk — which improves retrieval recall.

How Modern AI Assistants Use RAG

A production AI assistant runs the pipeline you see in the tool above every time you ask a question. Ahead of time, it ingests documents: chunking them, embedding each chunk, and storing the vectors in a vector database. At query time, it embeds your question, retrieves the most similar chunks, and constructs a grounded prompt.

That prompt combines a system instruction ("answer only from the context below"), the retrieved passages, and your question. The language model — here, Groq running llama-3.3-70b-versatile — then writes an answer using only that evidence, often citing which chunk it used.

The result is an assistant that is accurate, up to date, and verifiable. Understanding each step — chunking, embeddings, vector search, retrieval, grounding, and generation — is the foundation for building reliable AI systems, and the RAG Explorer above lets you watch all of it happen on your own text.

Frequently asked questions

What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is an AI architecture that combines a search system with a large language model. Instead of answering only from what it memorized during training, a RAG system first retrieves relevant passages from an external knowledge source and supplies them to the model as context. The model then generates an answer grounded in that evidence, which keeps responses current, accurate, and traceable to a source.
How do embeddings work?
An embedding converts a piece of text into a vector — a list of numbers — that represents its meaning. Embedding models are trained so that texts with similar meanings get vectors pointing in similar directions. This lets semantic search find the right passage even when the wording is completely different, because embeddings capture meaning rather than exact spelling.
What is vector search?
Vector search, also called semantic search, finds the stored vectors closest to a query vector. Closeness is usually measured with cosine similarity, where a score near 1.0 means a strong match and a score near 0 means unrelated. The question and document chunks are embedded with the same model, so a vector database can rank all chunks by similarity and return the top matches in milliseconds.
Why do chunking and overlap matter in RAG?
Documents are split into smaller passages called chunks because they are often too long to embed or feed to a model at once. Good chunking keeps a single idea together: chunks that are too large make retrieval vague, while chunks that are too small lose context. Overlap repeats a little text between adjacent chunks so an idea that spans a boundary still appears in full in at least one chunk, improving retrieval recall.
Does RAG reduce hallucinations?
Yes. Language models hallucinate when they answer from memory they don't actually have. By retrieving trustworthy text and instructing the model to answer only from that context, RAG grounds responses in real evidence. This reduces hallucination and lets the model cite which retrieved passage each part of its answer came from.