Agentforce Testing Center
Agentforce Testing Center is the regression testing tool for Agentforce agents.
Definition
Agentforce Testing Center is the regression testing tool for Agentforce agents. It lets admins capture conversations as test cases, save them into reusable test sets, and run those sets against the agent on demand or after every change. Each test case records the expected topic classification, the expected actions fired, and the expected response style, then compares actuals against expecteds to flag drift before the change reaches users.
Testing Center is the only way to know that fixing one topic did not break two others. Editing a topic's classification description, a Data Library scope, or a Prompt Template can change agent behavior on conversations you did not have in mind. Without a saved test set, the team only learns about regressions from customers or supervisors. Testing Center turns that feedback loop into seconds inside Setup rather than days in production.
Why an agent without a test set drifts in production
Where Testing Center lives in setup
Open Setup, search Agentforce Testing Center, and you land on the test sets list. Each test set is scoped to one agent and contains one or more test cases. A test case has an input message (or a multi-turn conversation), an expected topic, an expected action chain, and an expected response signature. Test sets live in the metadata API, so they ship between sandbox and production through the standard deploy path. The Testing Center is also accessible from a button inside Agent Builder, which is the more common entry point during active development.
How test cases are captured
The fastest path to a test case is the Conversation Preview inside Agent Builder. After sending a test message and observing the Plan Trace, an admin can click Save as Test Case. The capture records the input message, the topic the engine picked, the actions it invoked, and the final response. The admin then sets expectations: this topic should always be picked, these actions must fire, and the response must mention this phrase. Test cases also support negative expectations (this action must not fire, this topic must not be picked) which catch regressions in restraint rather than capability.
Single-turn vs multi-turn test cases
A single-turn test case sends one message and evaluates the response. A multi-turn test case scripts a conversation of two to ten turns and evaluates the final response or specific intermediate states. Multi-turn cases are essential for testing topic handoff (the user asks about an order, then about returning it) and state carryover (the order number captured in turn one is still known in turn four). Most teams maintain one single-turn case per topic for fast happy-path checks and a smaller set of multi-turn cases for the conversations that matter most in production.
Running test sets and reading results
Test sets run on demand or are wired to fire after specific change types in Agent Builder. A run produces a results table with one row per test case: pass, fail, or warn. Pass means every expectation matched. Fail means a hard expectation (topic, action) missed. Warn means a soft expectation (response phrase, tone) drifted but the structural expectations held. Each row drills into a Plan Trace showing exactly which decisions the engine made on this run. The diff view shows the prior run's output beside the current run, which is the fastest way to spot what changed between two saves of the same topic.
Response evaluation and the soft-expectation model
Response text is non-deterministic in an LLM-driven agent, so Testing Center evaluates response content with a soft model. A response expectation can require specific words ("must include the order number"), forbid specific words ("must not mention competitor names"), or match a tone (formal, casual, apologetic). The tone check runs an evaluator model against the response. The soft model produces warns rather than fails, so a tone drift surfaces without breaking the run. This avoids the trap of test sets that constantly fail because the model phrased something slightly differently than the last run.
Integration with the Agent Builder change flow
Agent Builder shows the most recent Testing Center result inline. Saving a topic, action, or prompt template can be configured to auto-trigger the test set. A run that fails any hard expectation blocks the agent from being promoted to a production version. The metadata for which tests guard which agent versions ships in the agent's manifest, so the production agent always has a known-passing test set associated with its current version. Reverting an agent to a prior version reverts the bound test set too, so the regression contract is preserved through rollbacks.
How big a test set should be and how to keep it healthy
Most production agents settle on 30 to 80 test cases. Below 30 the coverage feels thin and changes still surprise the team. Above 80 the cases start to overlap and the maintenance overhead becomes a tax. Run the full set once a week even when no changes are pending, because the underlying LLM ships updates that can shift behavior without code changes on your side. Retire cases that pass for six months unchanged unless they cover regulated behavior; the rotating set stays useful. Track flake rate per case; anything that warns on more than 10 percent of runs without code changes is too brittle and needs rewriting.
How to build a Testing Center test set that catches real regressions
The first 10 test cases pay back the entire setup effort. Capture them from real conversations rather than inventing prompts, and write expectations that focus on behavior the team actually cares about (correct topic, no forbidden actions) rather than exact response wording.
- Decide which behaviors must never regress
List the five to ten behaviors that would cause customer complaints if they broke: each topic firing on its canonical message, escalation triggering when the agent does not know the answer, no hallucinated pricing, no Knowledge citation without a real source.
- Capture the first 10 test cases from real conversations
Pull conversations from the Service Agent or SDR Agent logs. Pick ten that represent the most common patterns. Replay each in the Conversation Preview, save as a test case, set expectations.
- Add explicit negative expectations
For each topic, capture a near-miss message that should not pick the topic. Save with a negative expectation. Negative cases catch overreach regressions that positive-only sets miss.
- Write a few multi-turn cases for the critical paths
For topic handoff (order to return to refund) and state carryover (extract order, use in next turn), build a multi-turn case. These take longer to write and catch the highest-impact failures.
- Wire the test set to run on agent save
Agent Builder lets you require a test pass before promoting a version. Enable this for the production agent. Topic or action edits that fail the set are blocked from promotion.
- Schedule a full weekly run
Test sets pick up LLM drift even when nothing changed on your side. A scheduled weekly run with results posted to a Chatter group catches that drift before users feel it.
- Review and prune the set quarterly
Retire cases that have not failed in six months and do not cover regulated behavior. Replace them with cases from recent production conversations. A test set that does not evolve goes stale.
Single-turn or multi-turn. Most sets are mostly single-turn with a small multi-turn core for critical paths.
Hard (topic, action firing, action restraint) or soft (response phrase inclusion, tone, length). Hard expectations fail the run; soft ones warn.
Per-case assertions that a specific topic must not be picked or a specific action must not fire. Catches overreach.
When to auto-run the set: on save of any topic, on save of a specific topic, on scheduled cadence, or manual only.
Whether a failed run blocks promoting the agent to a production version. Default off for new agents, on for mature ones.
- Test sets without negative expectations only catch missing capability, not overreach. An agent that picks the wrong topic confidently passes a positive-only set every time.
- Response-text expectations that require exact wording fail constantly because LLM output is non-deterministic. Use soft expectations (must include phrase, tone match) instead.
- Multi-turn cases are sensitive to small classification changes upstream. A topic-tightening edit can fail a multi-turn case at turn three even though turns one and two work. Read the trace before assuming the edit is wrong.
- LLM-driven response drift over weeks even with no agent changes. Scheduled weekly runs catch this; on-save-only runs do not.
- Test sets shipped from sandbox to production do not automatically re-record. Verify the production agent's versions match what the set was built against, or expectations may pass on stale assumptions.
Trust & references
Cross-checked against the following references.
- Agentforce product overviewSalesforce
- Testing Center referenceSalesforce
Straight from the source - Salesforce's reference material on Agentforce Testing Center.
- Agentforce Testing CenterSalesforce Help
- Create Test CasesSalesforce Help
About the Author
Dipojjal Chakrabarti is a B2C Solution Architect with 29 Salesforce certifications and over 13 years in the Salesforce ecosystem. He runs salesforcedictionary.com to help admins, developers, architects, and cert/interview candidates sharpen their fundamentals. More about Dipojjal.
Test your knowledge
Q1. What is Agentforce Testing Center used for?
Q2. Which capability helps teams catch regressions after agent changes?
Q3. Why is reviewing the reasoning trace important in Testing Center?
Discussion
Loading discussion…