Graceful degradation = system continues to provide value even when dependencies fail.
Failure modes:
- External API down.
- External API slow (timing out).
- Network partition.
- Salesforce platform incident.
- Database constraint failure.
Patterns for graceful degradation:
1. Cached fallback.
When external is down, serve last-known-good data with staleness disclaimer.
`apex try { HttpResponse res = new Http().send(req); if (res.getStatusCode() == 200) { // fresh data cache.put(key, parsed); return parsed; } } catch (CalloutException e) { / fall through / }
// fallback to cache return cache.get(key, null); // or default value `
2. Skip optional features.
Critical workflow: must work even if optional services down.
- Account record must load even if external "credit check" service is down.
- Show the record without the credit info; surface message "credit check unavailable".
3. Async backfill.
When primary path fails, queue for retry. Retry until success or max attempts.
4. Circuit breaker.
After N consecutive failures, stop calling external service for a cooldown period. Avoids piling up calls during known outage.
5. Timeout discipline.
Every callout has reasonable timeout. Better to fail fast than block users for minutes.
6. Partial responses.
Composite calls fetching from N services: return what succeeded; flag what didn't.
7. Health check + UI banner.
If external system unhealthy, show banner: "Some features temporarily unavailable."
8. Read-only mode.
If write path fails, allow reads. Users can still browse, just not modify.
9. Queue-based writes.
User action queues; processing happens async. If the system is overloaded, queue absorbs spike. Eventually consistent.
10. Dead-letter queue.
Failures route to dead-letter for manual review.
Communication patterns:
- Banners on Lightning pages indicating known issues.
- Status page internal or external.
- Slack notifications to ops team.
- Customer-facing communication for prolonged outages.
Testing graceful degradation:
- Failure injection in QA — kill external dependency, verify Salesforce continues to work.
- Load tests under stress conditions.
- Disaster recovery drills annually.
Architectural priorities:
- Critical paths: must work always. Heavy investment in resilience.
- Important paths: should work usually; degrade gracefully.
- Nice-to-have: can fail without significant impact.
Don't engineer all paths to maximum resilience — costs add up. Triage based on user impact.
Senior architect insight: failure is normal at scale. Designing as if failures don't happen creates fragile systems. Designing for failure creates resilient ones.
Cost of graceful degradation is typically 10-20% of build effort. Cost of NOT having it is one major incident from reputation / business loss.
The senior framing: what does the user experience when X is down? If "broken UI", you have work to do. If "limited but functional", you've done it right.
