AEO Primer · 4 min read · May 2026
What is RAG? Retrieval-Augmented Generation, Defined
By Thinklytics Partners, Practitioner Notes
RAG (Retrieval-Augmented Generation) is a pattern where an LLM retrieves relevant context from an external knowledge store at inference time and grounds its response on that context, reducing hallucination and enabling fresh knowledge.
Topics covered
- RAG
- retrieval-augmented generation
- vector search
- LLM grounding
- embeddings
- vector database
- semantic search
Frequently asked questions
What is RAG in one sentence?
RAG (Retrieval-Augmented Generation) is a pattern where an LLM retrieves relevant context from an external knowledge store (typically a vector database) at inference time and grounds its response on that retrieved context, reducing hallucination and enabling the model to use knowledge that was not in its training data.
How does RAG work?
Three steps. (1) At ingestion time: split documents into chunks, generate embeddings for each chunk (using a model like OpenAI Ada or Cohere Embed), store the chunks and embeddings in a vector database. (2) At query time: embed the user query, find the top-k most similar chunks via cosine similarity. (3) Inject the retrieved chunks into the LLM prompt as context, generate the response.
What problem does RAG solve?
Three problems. LLM hallucination on facts the model does not know reliably. LLM knowledge staleness (the model's training data has a cutoff). Inability to cite specific source documents in the response. RAG addresses all three by grounding the response on retrieved source content.
Is RAG the same as fine-tuning?
No. Fine-tuning adjusts the model's weights to internalize new knowledge or behavior. RAG keeps the model frozen and injects knowledge at inference time. Fine-tuning is better for stylistic or behavioral changes; RAG is better for factual knowledge that changes frequently. Many production systems use both.
What is a vector database?
A database optimized for k-nearest-neighbor search over high-dimensional vectors (embeddings). Examples include Pinecone, Weaviate, Qdrant, Chroma, pgvector (Postgres extension), and the vector features inside Snowflake Cortex Search, Databricks Vector Search, OpenSearch, and Elastic.
What is the difference between RAG and semantic search?
Semantic search is the underlying retrieval mechanism (embedding-based similarity). RAG is the broader pattern that combines semantic search with LLM generation. You can have semantic search without RAG (just retrieval), but RAG without semantic search (or some equivalent retrieval) is just an LLM.
What are RAG's limitations?
Retrieval quality is the gating constraint. If the relevant chunk is not retrieved, the LLM cannot use it. Chunk boundaries can break context. Long documents may not fit in the model's context window even after retrieval. Multi-hop reasoning (combining facts from multiple chunks) is harder than single-document RAG. Production systems address these with hybrid search, reranking, query rewriting, and structured retrieval.
How does Thinklytics work on RAG?
We scope RAG engagements as data-architecture-first, not vector-database-first. The retrieval quality follows from data hygiene, chunking strategy, and metadata design. See [LLM grounding data architecture](/insights/llm-grounding-data-architecture).