Salesforce Dictionary - Free Salesforce GlossarySalesforce Dictionary
Salesforce QA / Tester
hard

How do you test AI / Agentforce features?

AI testing differs from deterministic testing. Same input may produce different outputs; need different approach.

Challenges:

  • Non-deterministic — LLM responses vary.
  • Quality is subjective — "good answer" hard to define exactly.
  • Cost — every test call costs money.
  • Latency — LLM calls slow.
  • Bias — needs detection.
  • Hallucination — AI can confidently wrong.

Test approaches:

1. Functional correctness.

  • For deterministic outputs (classification, score), traditional tests apply.
  • "Does this case classify as 'Billing' correctly?"

2. Quality assessment.

  • Sample AI outputs.
  • Rate against criteria (accuracy, helpfulness, tone).
  • Either human-graded or auto-graded against rubric.

3. Regression on tone / style.

  • AI output may shift over model updates.
  • Compare new outputs vs baseline.

4. Bias and fairness.

  • Test outputs across demographics.
  • Detect disparate treatment.

5. Hallucination detection.

  • Compare AI output against ground truth.
  • Flag inaccuracies.

6. Edge cases.

  • Adversarial inputs.
  • Edge of training distribution.
  • Out-of-scope queries.

7. Cost monitoring.

  • LLM tokens consumed per test.
  • Total test cost.

8. Performance.

  • Latency under load.
  • Concurrency.

Tools:

  • Salesforce Trust Layer — built-in checks for PII, toxicity.
  • External AI testing tools — emerging market (Galileo, Patronus, etc.).
  • Custom evaluation frameworks — task-specific.

Challenges in production:

  • Evolving model — Salesforce updates LLM; behavior shifts.
  • Prompt regressions — prompt change degrades quality.
  • Training data drift — input distribution changes over time.

Architecture for testability:

  • Versioned prompts — track changes.
  • A/B test new prompts before full rollout.
  • Automated regression suite comparing outputs.
  • Quality monitoring in production.

Common pitfalls:

  • Treating AI as deterministic — tests fail randomly.
  • No regression — quality drift unnoticed.
  • No cost tracking — surprise bills.
  • No bias auditing — disparate impact unmeasured.

Senior QA insight: AI testing is a new discipline. Existing patterns don't directly apply.

The senior framing: for AI, tests verify behaviour ranges, not exact outputs. Assess statistical quality, not deterministic correctness.

Important: Salesforce's Trust Layer handles much basic safety; QA focuses on use-case-specific quality.

Why this answer works

Modern senior. The quality-vs-correctness distinction and "new discipline" framing are mature.

Follow-ups to expect

Related dictionary terms