Production Apex systems fail. The question is what happens next. A robust error recovery architecture has multiple layers.
1. Catch-and-classify errors
try {
riskyOperation();
} catch (CalloutException e) {
handleTransient(e); // network / timeout
} catch (DmlException e) {
handleData(e); // validation, conflict
} catch (System.LimitException e) {
// mostly uncatchable, but log if you can
} catch (Exception e) {
handleUnknown(e);
}Different error categories deserve different recovery: transient errors retry; data errors don't.
2. Persistent error log
Every uncaught (and significant caught) exception logs to Error_Log__c:
Error_Log__c log = new Error_Log__c(
Class__c = 'OrderProcessor',
Method__c = 'processOrder',
Message__c = e.getMessage(),
Stack_Trace__c = e.getStackTraceString(),
Record_Id__c = recordId,
User_Id__c = UserInfo.getUserId(),
Timestamp__c = DateTime.now()
);
insert log;Persistent record survives transactions; queryable for forensics; reportable.
3. Retry with exponential backoff
For transient failures (HTTP timeout, SOQL row lock):
public class RetryableJob implements Queueable, Database.AllowsCallouts {
private Integer attempt;
private Id recordId;
private static final Integer MAX_ATTEMPTS = 5;
public void execute(QueueableContext ctx) {
try {
doWork(recordId);
} catch (CalloutException e) {
if (attempt < MAX_ATTEMPTS) {
Integer delaySec = (Integer) Math.pow(2, attempt) * 60; // exponential backoff
System.scheduleBatch(new RetryableJob(recordId, attempt + 1), delaySec / 60);
} else {
sendToDeadLetter(recordId, e);
}
}
}
}Each retry waits longer. After max attempts, give up and route to dead-letter.
4. Dead-letter queue
Failed jobs that exhausted retries go to Dead_Letter__c records. Admins review manually:
private static void sendToDeadLetter(Id recordId, Exception e) {
Dead_Letter__c dl = new Dead_Letter__c(
Original_Record_Id__c = recordId,
Error__c = e.getMessage(),
Last_Attempt__c = DateTime.now(),
Status__c = 'Pending Review'
);
insert dl;
notifyAdmins(dl);
}Admins can manually retry (mark Status='Retry' which a flow re-enqueues) or mark as resolved.
5. Alerts and notifications
- Email alert for high-priority errors.
- Custom Notification to the operator.
- Slack/Teams webhook for ops channel.
- Aggregated daily digest for low-priority errors.
6. Circuit breaker pattern
If an external service is consistently failing, stop trying for a while:
public class CircuitBreaker {
public static Boolean isOpen(String serviceName) {
Circuit__c c = [SELECT Failure_Count__c, Open_Until__c FROM Circuit__c WHERE Name=:serviceName LIMIT 1];
return c.Open_Until__c != null && c.Open_Until__c > DateTime.now();
}
public static void recordFailure(String serviceName) {
// increment failure count; if > threshold, open circuit for 5 mins
}
}Saves API quota and avoids piling up errors during a known outage.
7. Idempotency
Every retry-able operation must be safe to repeat. Use idempotency keys (e.g., External Id) so the same insert doesn't create duplicates.
8. Observability
- Dashboards showing error rate by class/method.
- Trend reports — error rate going up?
- Error log retention — old logs purge, or archive to Big Object.
9. Runbook for operators
Document: "When this error happens, here's how to fix it." Saves hours during incidents.
A mature error architecture turns failures into a manageable ops experience, not midnight pages.