Resilience = the system continues to work (or degrades gracefully) when components fail.
Failure modes to design for:
- External system down (ERP, payment gateway, third-party API).
- Salesforce platform issue (rare but happens).
- Network failure between systems.
- Slow response (timeout) from dependency.
- Database constraint failure (validation, sharing recalc).
- Governor limit breach.
- Data quality issue propagating.
Patterns:
1. Timeout + retry. Every callout has a timeout. Failed calls retry with exponential backoff.
2. Circuit breaker. Track failure rate of external service. After N failures, "open the circuit" — stop calling for a cooldown period. Resume gradually.
3. Fallback paths. When primary service fails, have a fallback:
- Display cached data instead of live.
- Submit to a queue for later processing.
- Show partial UI without the failing component.
- Use AI's "best guess" when ML service down.
4. Async over sync. Synchronous calls block the user; async calls queue. Outage causes a backlog instead of user-facing failures.
5. Bulkhead pattern. Isolate failure domains. One slow integration doesn't block the entire org.
6. Monitoring and alerting. Detect failures fast. Logs, alerts, dashboards.
7. Self-healing. Some failures auto-recover with retry. Don't surface every transient blip to humans.
8. Idempotency. Operations safe to retry. Idempotency keys prevent duplicate writes.
9. Compensating transactions. If a multi-step operation partially fails, undo the completed steps.
10. Read-replica / cache. When external is down, serve from cache (with stale acknowledgement).
Specific Salesforce patterns:
- Outbound Messages with retry for guaranteed delivery.
- Platform Events with replay for downtime recovery.
- Heroku Functions as fallback compute when Salesforce constraints hit.
- Salesforce Connect with caching for external object resilience.
- Dead-letter queue in custom object for failed integrations.
Communication during failures:
- Banners on Lightning pages: "Some features temporarily unavailable."
- Status page (custom or via Salesforce Trust dashboard).
- Slack notifications to ops team.
- Customer-facing status if user-impacting.
Testing resilience:
- Chaos engineering — deliberately fail components in test environments.
- Failure injection in CI tests.
- Disaster recovery drills.
Senior architects design for failure from day one. The senior maxim: "Everything fails eventually. Plan for it."
