Production Apex systems fail. The question is what happens next. A robust error recovery architecture has multiple layers.
1. Catch-and-classify errors
apex try { riskyOperation(); } catch (CalloutException e) { handleTransient(e); // network / timeout } catch (DmlException e) { handleData(e); // validation, conflict } catch (System.LimitException e) { // mostly uncatchable, but log if you can } catch (Exception e) { handleUnknown(e); }
Different error categories deserve different recovery: transient errors retry; data errors don't.
2. Persistent error log
Every uncaught (and significant caught) exception logs to Error_Log__c:
apex Error_Log__c log = new Error_Log__c( Class__c = 'OrderProcessor', Method__c = 'processOrder', Message__c = e.getMessage(), Stack_Trace__c = e.getStackTraceString(), Record_Id__c = recordId, User_Id__c = UserInfo.getUserId(), Timestamp__c = DateTime.now() ); insert log;
Persistent record survives transactions; queryable for forensics; reportable.
3. Retry with exponential backoff
For transient failures (HTTP timeout, SOQL row lock):
`apex public class RetryableJob implements Queueable, Database.AllowsCallouts { private Integer attempt; private Id recordId; private static final Integer MAX_ATTEMPTS = 5;
public void execute(QueueableContext ctx) { try { doWork(recordId); } catch (CalloutException e) { if (attempt < MAX_ATTEMPTS) { Integer delaySec = (Integer) Math.pow(2, attempt) * 60; // exponential backoff System.scheduleBatch(new RetryableJob(recordId, attempt + 1), delaySec / 60); } else { sendToDeadLetter(recordId, e); } } } } `
Each retry waits longer. After max attempts, give up and route to dead-letter.
4. Dead-letter queue
Failed jobs that exhausted retries go to Dead_Letter__c records. Admins review manually:
apex private static void sendToDeadLetter(Id recordId, Exception e) { Dead_Letter__c dl = new Dead_Letter__c( Original_Record_Id__c = recordId, Error__c = e.getMessage(), Last_Attempt__c = DateTime.now(), Status__c = 'Pending Review' ); insert dl; notifyAdmins(dl); }
Admins can manually retry (mark Status='Retry' which a flow re-enqueues) or mark as resolved.
5. Alerts and notifications
- Email alert for high-priority errors.
- Custom Notification to the operator.
- Slack/Teams webhook for ops channel.
- Aggregated daily digest for low-priority errors.
6. Circuit breaker pattern
If an external service is consistently failing, stop trying for a while:
`apex public class CircuitBreaker { public static Boolean isOpen(String serviceName) { Circuit__c c = [SELECT Failure_Count__c, Open_Until__c FROM Circuit__c WHERE Name=:serviceName LIMIT 1]; return c.Open_Until__c != null && c.Open_Until__c > DateTime.now(); }
public static void recordFailure(String serviceName) { // increment failure count; if > threshold, open circuit for 5 mins } } `
Saves API quota and avoids piling up errors during a known outage.
7. Idempotency
Every retry-able operation must be safe to repeat. Use idempotency keys (e.g., External Id) so the same insert doesn't create duplicates.
8. Observability
- Dashboards showing error rate by class/method.
- Trend reports — error rate going up?
- Error log retention — old logs purge, or archive to Big Object.
9. Runbook for operators
Document: "When this error happens, here's how to fix it." Saves hours during incidents.
A mature error architecture turns failures into a manageable ops experience, not midnight pages.
