How do you architect for resilience and graceful degradation?

Resilience = the system continues to work (or degrades gracefully) when components fail.

Failure modes to design for:

External system down (ERP, payment gateway, third-party API).
Salesforce platform issue (rare but happens).
Network failure between systems.
Slow response (timeout) from dependency.
Database constraint failure (validation, sharing recalc).
Governor limit breach.
Data quality issue propagating.

Patterns:

1. Timeout + retry. Every callout has a timeout. Failed calls retry with exponential backoff.

2. Circuit breaker. Track failure rate of external service. After N failures, "open the circuit" — stop calling for a cooldown period. Resume gradually.

3. Fallback paths. When primary service fails, have a fallback:

Display cached data instead of live.
Submit to a queue for later processing.
Show partial UI without the failing component.
Use AI's "best guess" when ML service down.

4. Async over sync. Synchronous calls block the user; async calls queue. Outage causes a backlog instead of user-facing failures.

5. Bulkhead pattern. Isolate failure domains. One slow integration doesn't block the entire org.

6. Monitoring and alerting. Detect failures fast. Logs, alerts, dashboards.

7. Self-healing. Some failures auto-recover with retry. Don't surface every transient blip to humans.

8. Idempotency. Operations safe to retry. Idempotency keys prevent duplicate writes.

9. Compensating transactions. If a multi-step operation partially fails, undo the completed steps.

10. Read-replica / cache. When external is down, serve from cache (with stale acknowledgement).

Specific Salesforce patterns:

Outbound Messages with retry for guaranteed delivery.
Platform Events with replay for downtime recovery.
Heroku Functions as fallback compute when Salesforce constraints hit.
Salesforce Connect with caching for external object resilience.
Dead-letter queue in custom object for failed integrations.

Communication during failures:

Banners on Lightning pages: "Some features temporarily unavailable."
Status page (custom or via Salesforce Trust dashboard).
Slack notifications to ops team.
Customer-facing status if user-impacting.

Testing resilience:

Chaos engineering — deliberately fail components in test environments.
Failure injection in CI tests.
Disaster recovery drills.

Senior architects design for failure from day one. The senior maxim: "Everything fails eventually. Plan for it."

How do you architect for resilience and graceful degradation?

Why this answer works

Follow-ups to expect

Related dictionary terms