LLMs in Salesforce are not a feature you toggle on. They are a layer behind every Agentforce, Einstein GPT, and Prompt Builder capability. The work that matters is choosing the right features for your use cases, configuring the Trust Layer correctly, and putting evaluation and monitoring in place before broad rollout.
- Inventory the LLM-powered features you plan to use
List the Agentforce agents, Einstein GPT features, and Prompt Builder templates the team will turn on. Each one is a separate evaluation question.
- Configure Trust Layer masking and residency rules
Setup, Einstein Trust Layer. Confirm PII masking is on for the data types your org handles, and data residency matches your region requirements. These defaults are usually correct but worth verifying explicitly.
- Decide between managed and Bring Your Own LLM per feature
For most features, stay with managed. For features with extreme volume or specific compliance needs, evaluate BYOLLM. The decision is per feature, not org-wide.
- Ground every prompt with explicit context
Custom Prompt Builder templates should always include record data, Data Library chunks, or other grounding context. Ungrounded prompts produce hallucinations at a much higher rate.
- Build Testing Center test sets that assert structural properties
Assert what must appear (specific values, citations), what must not appear (forbidden phrases, competitor names), and what tone the response must hit. Soft expectations for tone, hard expectations for content.
- Pilot LLM features for two to four weeks before broad rollout
Pilot data is the only honest evaluation. Vendor benchmarks rarely match real org performance. Two weeks of pilot data tells you what your users actually experience.
- Schedule weekly review of a random output sample
Pull 50 random LLM outputs per feature per week. Review with the feature owner. Catch drift, hallucination, and tone issues before users complain. This work never ends.
Whether the feature uses a Salesforce-managed model selection or a customer-specified vendor model. Trade-off is control vs operational burden.
Which PII categories the Trust Layer masks before sending prompts to the model. Defaults handle common categories; org-specific patterns can be added.
Which geographic region processes LLM calls. Critical for GDPR, regional data sovereignty requirements.
Which record data, Data Libraries, or Knowledge articles are injected as context into LLM prompts.
The set of Testing Center expectations and sampling cadence that ensures ongoing output quality.
- Vendor LLMs accessed directly do not have the no-training-on-customer-data guarantee that the Einstein Trust Layer provides. Going around the Trust Layer is a compliance issue, not a shortcut.
- Ungrounded prompts produce hallucinations at high rates. Every custom prompt should include explicit grounding context.
- Vendor benchmarks rarely match real org performance. Pilot with your actual data and your actual users before committing to a feature broadly.
- LLMs are bad at precise numerical reasoning. Do not use them for financial calculations; route to deterministic Apex or Flow logic for those.
- Unused LLM features still consume Trust Layer capacity and count against compliance review surface. Retire features that no one uses rather than letting them accumulate.