Salesforce Dictionary - Free Salesforce GlossarySalesforce Dictionary
Salesforce Developer
hard

How do you architect error recovery in a complex Apex system — retries, dead-letter, alerts?

Production Apex systems fail. The question is what happens next. A robust error recovery architecture has multiple layers.

1. Catch-and-classify errors

try {
    riskyOperation();
} catch (CalloutException e) {
    handleTransient(e);  // network / timeout
} catch (DmlException e) {
    handleData(e);  // validation, conflict
} catch (System.LimitException e) {
    // mostly uncatchable, but log if you can
} catch (Exception e) {
    handleUnknown(e);
}

Different error categories deserve different recovery: transient errors retry; data errors don't.

2. Persistent error log

Every uncaught (and significant caught) exception logs to Error_Log__c:

Error_Log__c log = new Error_Log__c(
    Class__c = 'OrderProcessor',
    Method__c = 'processOrder',
    Message__c = e.getMessage(),
    Stack_Trace__c = e.getStackTraceString(),
    Record_Id__c = recordId,
    User_Id__c = UserInfo.getUserId(),
    Timestamp__c = DateTime.now()
);
insert log;

Persistent record survives transactions; queryable for forensics; reportable.

3. Retry with exponential backoff

For transient failures (HTTP timeout, SOQL row lock):

public class RetryableJob implements Queueable, Database.AllowsCallouts {
    private Integer attempt;
    private Id recordId;
    private static final Integer MAX_ATTEMPTS = 5;
    
    public void execute(QueueableContext ctx) {
        try {
            doWork(recordId);
        } catch (CalloutException e) {
            if (attempt < MAX_ATTEMPTS) {
                Integer delaySec = (Integer) Math.pow(2, attempt) * 60; // exponential backoff
                System.scheduleBatch(new RetryableJob(recordId, attempt + 1), delaySec / 60);
            } else {
                sendToDeadLetter(recordId, e);
            }
        }
    }
}

Each retry waits longer. After max attempts, give up and route to dead-letter.

4. Dead-letter queue

Failed jobs that exhausted retries go to Dead_Letter__c records. Admins review manually:

private static void sendToDeadLetter(Id recordId, Exception e) {
    Dead_Letter__c dl = new Dead_Letter__c(
        Original_Record_Id__c = recordId,
        Error__c = e.getMessage(),
        Last_Attempt__c = DateTime.now(),
        Status__c = 'Pending Review'
    );
    insert dl;
    notifyAdmins(dl);
}

Admins can manually retry (mark Status='Retry' which a flow re-enqueues) or mark as resolved.

5. Alerts and notifications

  • Email alert for high-priority errors.
  • Custom Notification to the operator.
  • Slack/Teams webhook for ops channel.
  • Aggregated daily digest for low-priority errors.

6. Circuit breaker pattern

If an external service is consistently failing, stop trying for a while:

public class CircuitBreaker {
    public static Boolean isOpen(String serviceName) {
        Circuit__c c = [SELECT Failure_Count__c, Open_Until__c FROM Circuit__c WHERE Name=:serviceName LIMIT 1];
        return c.Open_Until__c != null && c.Open_Until__c > DateTime.now();
    }
    
    public static void recordFailure(String serviceName) {
        // increment failure count; if > threshold, open circuit for 5 mins
    }
}

Saves API quota and avoids piling up errors during a known outage.

7. Idempotency

Every retry-able operation must be safe to repeat. Use idempotency keys (e.g., External Id) so the same insert doesn't create duplicates.

8. Observability

  • Dashboards showing error rate by class/method.
  • Trend reports — error rate going up?
  • Error log retention — old logs purge, or archive to Big Object.

9. Runbook for operators

Document: "When this error happens, here's how to fix it." Saves hours during incidents.

A mature error architecture turns failures into a manageable ops experience, not midnight pages.

Why this answer works

Senior architecture. The full system (catch, log, retry, dead-letter, alert, circuit-breaker, observability, runbook) is comprehensive.

Follow-ups to expect

Related dictionary terms