AI testing differs from deterministic testing. Same input may produce different outputs; need different approach.
Challenges:
- Non-deterministic — LLM responses vary.
- Quality is subjective — "good answer" hard to define exactly.
- Cost — every test call costs money.
- Latency — LLM calls slow.
- Bias — needs detection.
- Hallucination — AI can confidently wrong.
Test approaches:
1. Functional correctness.
- For deterministic outputs (classification, score), traditional tests apply.
- "Does this case classify as 'Billing' correctly?"
2. Quality assessment.
- Sample AI outputs.
- Rate against criteria (accuracy, helpfulness, tone).
- Either human-graded or auto-graded against rubric.
3. Regression on tone / style.
- AI output may shift over model updates.
- Compare new outputs vs baseline.
4. Bias and fairness.
- Test outputs across demographics.
- Detect disparate treatment.
5. Hallucination detection.
- Compare AI output against ground truth.
- Flag inaccuracies.
6. Edge cases.
- Adversarial inputs.
- Edge of training distribution.
- Out-of-scope queries.
7. Cost monitoring.
- LLM tokens consumed per test.
- Total test cost.
8. Performance.
- Latency under load.
- Concurrency.
Tools:
- Salesforce Trust Layer — built-in checks for PII, toxicity.
- External AI testing tools — emerging market (Galileo, Patronus, etc.).
- Custom evaluation frameworks — task-specific.
Challenges in production:
- Evolving model — Salesforce updates LLM; behavior shifts.
- Prompt regressions — prompt change degrades quality.
- Training data drift — input distribution changes over time.
Architecture for testability:
- Versioned prompts — track changes.
- A/B test new prompts before full rollout.
- Automated regression suite comparing outputs.
- Quality monitoring in production.
Common pitfalls:
- Treating AI as deterministic — tests fail randomly.
- No regression — quality drift unnoticed.
- No cost tracking — surprise bills.
- No bias auditing — disparate impact unmeasured.
Senior QA insight: AI testing is a new discipline. Existing patterns don't directly apply.
The senior framing: for AI, tests verify behaviour ranges, not exact outputs. Assess statistical quality, not deterministic correctness.
Important: Salesforce's Trust Layer handles much basic safety; QA focuses on use-case-specific quality.
