The first 10 test cases pay back the entire setup effort. Capture them from real conversations rather than inventing prompts, and write expectations that focus on behavior the team actually cares about (correct topic, no forbidden actions) rather than exact response wording.
- Decide which behaviors must never regress
List the five to ten behaviors that would cause customer complaints if they broke: each topic firing on its canonical message, escalation triggering when the agent does not know the answer, no hallucinated pricing, no Knowledge citation without a real source.
- Capture the first 10 test cases from real conversations
Pull conversations from the Service Agent or SDR Agent logs. Pick ten that represent the most common patterns. Replay each in the Conversation Preview, save as a test case, set expectations.
- Add explicit negative expectations
For each topic, capture a near-miss message that should not pick the topic. Save with a negative expectation. Negative cases catch overreach regressions that positive-only sets miss.
- Write a few multi-turn cases for the critical paths
For topic handoff (order to return to refund) and state carryover (extract order, use in next turn), build a multi-turn case. These take longer to write and catch the highest-impact failures.
- Wire the test set to run on agent save
Agent Builder lets you require a test pass before promoting a version. Enable this for the production agent. Topic or action edits that fail the set are blocked from promotion.
- Schedule a full weekly run
Test sets pick up LLM drift even when nothing changed on your side. A scheduled weekly run with results posted to a Chatter group catches that drift before users feel it.
- Review and prune the set quarterly
Retire cases that have not failed in six months and do not cover regulated behavior. Replace them with cases from recent production conversations. A test set that does not evolve goes stale.
Single-turn or multi-turn. Most sets are mostly single-turn with a small multi-turn core for critical paths.
Hard (topic, action firing, action restraint) or soft (response phrase inclusion, tone, length). Hard expectations fail the run; soft ones warn.
Per-case assertions that a specific topic must not be picked or a specific action must not fire. Catches overreach.
When to auto-run the set: on save of any topic, on save of a specific topic, on scheduled cadence, or manual only.
Whether a failed run blocks promoting the agent to a production version. Default off for new agents, on for mature ones.
- Test sets without negative expectations only catch missing capability, not overreach. An agent that picks the wrong topic confidently passes a positive-only set every time.
- Response-text expectations that require exact wording fail constantly because LLM output is non-deterministic. Use soft expectations (must include phrase, tone match) instead.
- Multi-turn cases are sensitive to small classification changes upstream. A topic-tightening edit can fail a multi-turn case at turn three even though turns one and two work. Read the trace before assuming the edit is wrong.
- LLM-driven response drift over weeks even with no agent changes. Scheduled weekly runs catch this; on-save-only runs do not.
- Test sets shipped from sandbox to production do not automatically re-record. Verify the production agent's versions match what the set was built against, or expectations may pass on stale assumptions.