Salesforce Dictionary - Free Salesforce GlossarySalesforce Dictionary
Salesforce Architect
medium

How do you architect for resilience and graceful degradation?

Resilience = the system continues to work (or degrades gracefully) when components fail.

Failure modes to design for:

  • External system down (ERP, payment gateway, third-party API).
  • Salesforce platform issue (rare but happens).
  • Network failure between systems.
  • Slow response (timeout) from dependency.
  • Database constraint failure (validation, sharing recalc).
  • Governor limit breach.
  • Data quality issue propagating.

Patterns:

1. Timeout + retry. Every callout has a timeout. Failed calls retry with exponential backoff.

2. Circuit breaker. Track failure rate of external service. After N failures, "open the circuit" — stop calling for a cooldown period. Resume gradually.

3. Fallback paths. When primary service fails, have a fallback:

  • Display cached data instead of live.
  • Submit to a queue for later processing.
  • Show partial UI without the failing component.
  • Use AI's "best guess" when ML service down.

4. Async over sync. Synchronous calls block the user; async calls queue. Outage causes a backlog instead of user-facing failures.

5. Bulkhead pattern. Isolate failure domains. One slow integration doesn't block the entire org.

6. Monitoring and alerting. Detect failures fast. Logs, alerts, dashboards.

7. Self-healing. Some failures auto-recover with retry. Don't surface every transient blip to humans.

8. Idempotency. Operations safe to retry. Idempotency keys prevent duplicate writes.

9. Compensating transactions. If a multi-step operation partially fails, undo the completed steps.

10. Read-replica / cache. When external is down, serve from cache (with stale acknowledgement).

Specific Salesforce patterns:

  • Outbound Messages with retry for guaranteed delivery.
  • Platform Events with replay for downtime recovery.
  • Heroku Functions as fallback compute when Salesforce constraints hit.
  • Salesforce Connect with caching for external object resilience.
  • Dead-letter queue in custom object for failed integrations.

Communication during failures:

  • Banners on Lightning pages: "Some features temporarily unavailable."
  • Status page (custom or via Salesforce Trust dashboard).
  • Slack notifications to ops team.
  • Customer-facing status if user-impacting.

Testing resilience:

  • Chaos engineering — deliberately fail components in test environments.
  • Failure injection in CI tests.
  • Disaster recovery drills.

Senior architects design for failure from day one. The senior maxim: "Everything fails eventually. Plan for it."

Why this answer works

Senior. The pattern catalogue and "design for failure from day one" framing are mature.

Follow-ups to expect

Related dictionary terms