Retrieval Augmented Generation
Retrieval-Augmented Generation (RAG) is an architecture pattern for generative AI in which the system retrieves relevant source documents before the model generates a response, and then conditions the generation on the retrieved content.
Definition
Retrieval-Augmented Generation (RAG) is an architecture pattern for generative AI in which the system retrieves relevant source documents before the model generates a response, and then conditions the generation on the retrieved content. The retrieval step uses semantic search over a vector index of source content. The generation step is the language model writing the answer. The pattern was introduced in a 2020 Facebook AI paper and has become the default approach for production GenAI applications that need answers grounded in authoritative data rather than training memory.
In Salesforce, RAG is the underlying architecture behind dynamic grounding, knowledge-based answers in Service Cloud Einstein, and file-grounded responses in Agentforce. The Salesforce-flavored name is grounding, but the mechanics are identical. Chunk the source content, embed each chunk into a vector, store the vectors in an index, retrieve the top-k matches at runtime, inject the matched chunks into the prompt, let the model generate. The advantage over fine-tuning is that source data can change without retraining the model. The disadvantage is that retrieval quality directly bounds answer quality.
How RAG works and where it sits in the Salesforce GenAI stack
The two-step architecture
RAG splits a single user request into two model calls. First, the retrieval step takes the user query, embeds it into a vector, and finds the most similar chunks in the source index. Second, the generation step receives the original query plus the retrieved chunks as context, and writes the response. The model never sees the entire corpus, only the slices selected as relevant. This is what makes RAG scale: a million-article knowledge base contributes only the top few chunks to any single prompt, so the model stays inside its context window while still drawing on a huge underlying corpus.
Embedding and the vector index
An embedding is a numerical representation of text in a high-dimensional vector space, where semantically similar text sits close together. The embedding model (a separate model from the generation model) reads a chunk of text and outputs a fixed-size vector, typically 768 or 1536 dimensions. The vector index stores all chunk vectors and supports fast nearest-neighbor search. In Salesforce, the Search Index in Data Cloud is the native store for embedded content. External vector databases like Pinecone or Weaviate are also viable when teams bring their own retrieval pipeline.
Retrieval: top-k, ranking, hybrid search
The retrieval step returns the top-k matching chunks ranked by similarity score. A typical k is 3 to 8. Higher k brings more context but dilutes attention and inflates token cost. Pure vector search retrieves on semantic similarity, which catches paraphrases but can miss exact keyword matches. Hybrid search combines vector similarity with BM25 keyword scoring to handle queries that need both. Production RAG systems often add a re-ranker, a second smaller model that re-scores the top 20 candidates and returns the best 5, trading a small latency hit for better relevance.
The injection step
Once the chunks are retrieved, they are formatted into the prompt. A common template includes a system instruction (answer only from the provided context), the retrieved chunks tagged with source IDs, the user question, and a structured output schema. The order of chunks matters. Models pay more attention to content at the start and end of the context than the middle. Putting the most relevant chunk first often improves the response. Including the source ID alongside each chunk lets the template require citations in the output.
RAG versus fine-tuning
Fine-tuning bakes knowledge into the model weights by training on examples. RAG keeps the knowledge external and feeds it in per request. Fine-tuning wins for style, format, and behavior. RAG wins for facts, policies, and any content that changes. The two are not mutually exclusive. Production systems often fine-tune for tone and use RAG for content. For Salesforce, fine-tuning is rare because the data changes daily. RAG via dynamic grounding is the default path.
Native Salesforce RAG versus Bring Your Own Retriever
Native Salesforce RAG covers knowledge articles, Data Cloud entities, files, and record fields. The platform manages chunking, embedding, indexing, and retrieval. For most teams this is enough. Bring Your Own Retriever (via Data Cloud and Apex callables) lets teams plug in custom retrieval logic when the corpus lives outside Salesforce, uses a non-standard chunking scheme, or needs a domain-specific embedding model. The trade is full control versus full ownership of the pipeline including its failures.
Failure modes unique to RAG
RAG has its own failure modes beyond hallucination. Retrieval can miss the right chunk because the embedding model never saw the right vocabulary, returning irrelevant context that the generation model then confidently uses. Retrieval can return the right chunk in a stale version because the index was not refreshed after the source changed. The generation can ignore the retrieved context and answer from training memory anyway, especially if the question seems general. Each of these requires its own monitoring. RAG is not set-and-forget.
How to set up RAG for a Salesforce GenAI feature
Setting up RAG in Salesforce usually means configuring grounding in Prompt Builder against a native source like Knowledge or Data Cloud. The steps below cover the native path; the bring-your-own path swaps the retrieval source for a custom Apex callable but the rest is identical.
- Pick the source corpus
Decide what the model should be able to cite from: published Knowledge articles, files in a Content library, records of a specific object, Data Cloud entities, or a mix. Narrower corpora give better retrieval than mixing everything together.
- Define the chunking strategy
For native sources, Salesforce handles chunking by default. For custom content, decide whether to split by heading, paragraph, or fixed token count. Semantic boundaries beat character counts. Test chunk sizes between 200 and 800 tokens for most use cases.
- Ensure the source is indexed
For Knowledge, the article must be published in the right channel and language. For Data Cloud, the Search Index must be configured on the entity. For files, the Content asset must be indexed. Indexing is asynchronous, so check status before testing.
- Wire up the prompt template
In Prompt Builder, add the source as a Resource. Configure the retrieval query and top-k. Reference the retrieved content in the prompt body using merge syntax. Add a citation field to the output schema.
- Preview, test, then gate behind a permission set
Preview with several record IDs covering rich, sparse, and edge-case contexts. Activate the template and grant Use Prompt Template via a permission set, starting with a pilot group.
Retrieval over published Knowledge articles. The native path for Service Cloud Einstein features like Reply Recommendations and Case Summary.
Retrieval over any Data Cloud entity, including unified profiles, ingested external data, and calculated insights. The path for cross-system grounding.
Custom Apex invocable that returns text chunks from any source. Used when retrieval needs custom logic, external vector stores, or domain-specific embeddings.
Combines vector similarity with keyword (BM25) scoring. Better than pure vector for queries that mention specific names, codes, or numbers.
An optional second pass that re-scores the top candidates using a more accurate but slower model. Costs latency, improves precision.
- Retrieval quality bounds answer quality. A perfect model cannot fix the wrong document being retrieved. Test retrieval separately from generation.
- The embedding model and the source content language must match. An English embedding model on Japanese content returns nonsense rankings.
- Stale indices ground confident wrong answers. Reindex on source change, not on a fixed schedule.
- Top-k too high dilutes context. Past 8 chunks the model often ignores the lower-ranked ones anyway. Spend the token budget elsewhere.
- RAG hides retrieval failures behind fluent answers. The model rarely says I could not find anything relevant on its own. The template must force that behavior.
Trust & references
Cross-checked against the following references.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksarXiv (Lewis et al.)
- Ground Prompt TemplatesSalesforce Help
Straight from the source - Salesforce's reference material on Retrieval Augmented Generation.
- Prompt Builder overviewSalesforce Help
- Data Cloud Search IndexSalesforce Help
Hands-on resources to go deeper on Retrieval Augmented Generation.
About the Author
Dipojjal Chakrabarti is a B2C Solution Architect with 29 Salesforce certifications and over 13 years in the Salesforce ecosystem. He runs salesforcedictionary.com to help admins, developers, architects, and cert/interview candidates sharpen their fundamentals. More about Dipojjal.
Test your knowledge
Q1. What is Retrieval Augmented Generation?
Q2. What problem does RAG solve?
Q3. What data sources can RAG use?
Discussion
Loading discussion…