How do you architect error recovery in a complex Apex system — retries, dead-letter, alerts? — Salesforce Salesforce Developer interview prep

Production Apex systems fail. The question is what happens next. A robust error recovery architecture has multiple layers.

1. Catch-and-classify errors

apex try { riskyOperation(); } catch (CalloutException e) { handleTransient(e); // network / timeout } catch (DmlException e) { handleData(e); // validation, conflict } catch (System.LimitException e) { // mostly uncatchable, but log if you can } catch (Exception e) { handleUnknown(e); }

Different error categories deserve different recovery: transient errors retry; data errors don't.

2. Persistent error log

Every uncaught (and significant caught) exception logs to Error_Log__c:

apex Error_Log__c log = new Error_Log__c( Class__c = 'OrderProcessor', Method__c = 'processOrder', Message__c = e.getMessage(), Stack_Trace__c = e.getStackTraceString(), Record_Id__c = recordId, User_Id__c = UserInfo.getUserId(), Timestamp__c = DateTime.now() ); insert log;

Persistent record survives transactions; queryable for forensics; reportable.

3. Retry with exponential backoff

For transient failures (HTTP timeout, SOQL row lock):

`apex public class RetryableJob implements Queueable, Database.AllowsCallouts { private Integer attempt; private Id recordId; private static final Integer MAX_ATTEMPTS = 5;

public void execute(QueueableContext ctx) { try { doWork(recordId); } catch (CalloutException e) { if (attempt < MAX_ATTEMPTS) { Integer delaySec = (Integer) Math.pow(2, attempt) * 60; // exponential backoff System.scheduleBatch(new RetryableJob(recordId, attempt + 1), delaySec / 60); } else { sendToDeadLetter(recordId, e); } } } } `

Each retry waits longer. After max attempts, give up and route to dead-letter.

4. Dead-letter queue

Failed jobs that exhausted retries go to Dead_Letter__c records. Admins review manually:

apex private static void sendToDeadLetter(Id recordId, Exception e) { Dead_Letter__c dl = new Dead_Letter__c( Original_Record_Id__c = recordId, Error__c = e.getMessage(), Last_Attempt__c = DateTime.now(), Status__c = 'Pending Review' ); insert dl; notifyAdmins(dl); }

Admins can manually retry (mark Status='Retry' which a flow re-enqueues) or mark as resolved.

5. Alerts and notifications

Email alert for high-priority errors.
Custom Notification to the operator.
Slack/Teams webhook for ops channel.
Aggregated daily digest for low-priority errors.

6. Circuit breaker pattern

If an external service is consistently failing, stop trying for a while:

`apex public class CircuitBreaker { public static Boolean isOpen(String serviceName) { Circuit__c c = [SELECT Failure_Count__c, Open_Until__c FROM Circuit__c WHERE Name=:serviceName LIMIT 1]; return c.Open_Until__c != null && c.Open_Until__c > DateTime.now(); }

public static void recordFailure(String serviceName) { // increment failure count; if > threshold, open circuit for 5 mins } } `

Saves API quota and avoids piling up errors during a known outage.

7. Idempotency

Every retry-able operation must be safe to repeat. Use idempotency keys (e.g., External Id) so the same insert doesn't create duplicates.

8. Observability

Dashboards showing error rate by class/method.
Trend reports — error rate going up?
Error log retention — old logs purge, or archive to Big Object.

9. Runbook for operators

Document: "When this error happens, here's how to fix it." Saves hours during incidents.

A mature error architecture turns failures into a manageable ops experience, not midnight pages.

How do you architect error recovery in a complex Apex system — retries, dead-letter, alerts?

Why this answer works

Follow-ups to expect

Related dictionary terms