Salesforce DictionarySalesforce Dictionary

Salesforce QA / Tester

hard

How do you test AI / Agentforce features?

AI testing differs from deterministic testing. Same input may produce different outputs; need different approach.

Challenges:

Non-deterministic — LLM responses vary.
Quality is subjective — "good answer" hard to define exactly.
Cost — every test call costs money.
Latency — LLM calls slow.
Bias — needs detection.
Hallucination — AI can confidently wrong.

Test approaches:

1. Functional correctness.

For deterministic outputs (classification, score), traditional tests apply.
"Does this case classify as 'Billing' correctly?"

2. Quality assessment.

Sample AI outputs.
Rate against criteria (accuracy, helpfulness, tone).
Either human-graded or auto-graded against rubric.

3. Regression on tone / style.

AI output may shift over model updates.
Compare new outputs vs baseline.

4. Bias and fairness.

Test outputs across demographics.
Detect disparate treatment.

5. Hallucination detection.

Compare AI output against ground truth.
Flag inaccuracies.

6. Edge cases.

Adversarial inputs.
Edge of training distribution.
Out-of-scope queries.

7. Cost monitoring.

LLM tokens consumed per test.
Total test cost.

8. Performance.

Latency under load.
Concurrency.

Tools:

Salesforce Trust Layer — built-in checks for PII, toxicity.
External AI testing tools — emerging market (Galileo, Patronus, etc.).
Custom evaluation frameworks — task-specific.

Challenges in production:

Evolving model — Salesforce updates LLM; behavior shifts.
Prompt regressions — prompt change degrades quality.
Training data drift — input distribution changes over time.

Architecture for testability:

Versioned prompts — track changes.
A/B test new prompts before full rollout.
Automated regression suite comparing outputs.
Quality monitoring in production.

Common pitfalls:

Treating AI as deterministic — tests fail randomly.
No regression — quality drift unnoticed.
No cost tracking — surprise bills.
No bias auditing — disparate impact unmeasured.

Senior QA insight: AI testing is a new discipline. Existing patterns don't directly apply.

The senior framing: for AI, tests verify behaviour ranges, not exact outputs. Assess statistical quality, not deterministic correctness.

Important: Salesforce's Trust Layer handles much basic safety; QA focuses on use-case-specific quality.

Why this answer works

Modern senior. The quality-vs-correctness distinction and "new discipline" framing are mature.

Follow-ups to expect

What is the Einstein Trust Layer?
How do you handle non-deterministic AI in regression?
What is hallucination and how do you detect it?

Related dictionary terms