Platform

503 Service Unavailable / SERVER_UNAVAILABLE / The Salesforce server is temporarily unavailable

Salesforce's instance hit a transient problem — usually maintenance, an incident, or an instance under load. Almost always a wait-and-retry situation. Check status.salesforce.com for your instance before opening a support case.

Also seen as503 Service Unavailable·SERVER_UNAVAILABLE·Salesforce server is temporarily unavailable·Service Unavailable salesforce

A payment integration that posts payment confirmations to Salesforce after every successful transaction starts logging 503 Service Unavailable responses around 2 PM on a Thursday. Throughput drops from 200 calls per minute to a handful. The integration team checks the Salesforce status page; nothing is reported. The payment processor's own systems are healthy. Whatever is happening is between the integration and the Salesforce edge.

What the platform is checking

A 503 Service Unavailable response means the platform is, for the moment, refusing to process the request. The server is up, the routing is up, but the application tier is either overloaded, in maintenance, or has applied a throttle to the calling client.

Salesforce returns 503 in several scenarios. The org may be in a brief maintenance window. The pod (the multi-tenant instance hosting the org) may be experiencing a load spike from another tenant. The API gateway may be applying a back-pressure throttle to the calling integration. A planned release may be rolling out and a subset of the platform may be temporarily unavailable.

Most 503 responses are transient. The platform recovers within seconds to minutes. The integration's job is to retry intelligently rather than treat the response as a hard failure.

The error body is sparse. Headers may include Retry-After with a suggested wait. When present, the value is in seconds. The integration should honor it. When absent, the integration falls back to exponential backoff.

Salesforce's trust.salesforce.com page reports incidents at the pod level. A 503 that affects all callers to a pod usually appears on trust within minutes. A 503 that affects only the calling integration (because the integration is hitting a rate limit or has a connection-pool issue) does not appear on trust because it is not platform-wide.

What is happening upstream

Three causes cover most production 503 occurrences.

Platform maintenance and release windows. Salesforce performs rolling releases multiple times per week. Pods undergo brief maintenance during these windows. The maintenance is short but visible: a few minutes of degraded availability per pod per release.

Pod-level capacity events. A multi-tenant pod hosts hundreds or thousands of orgs. If one tenant on the pod runs a sudden expensive workload (large batch, large data load), the pod can briefly throttle all tenants until the workload completes. This is rare and short-lived but produces 503s during the throttle window.

Client-side rate limiting. Salesforce applies per-integration rate limits to protect the platform from runaway clients. An integration that exceeds its quota receives 503 (sometimes 429 in newer APIs) with a back-off signal. The integration must slow down or pause before retrying.

The broken example

A Node.js integration that posts payment confirmations without retry logic:

async function recordPayment(payment, token) {
    const res = await fetch(`${INSTANCE_URL}/services/data/v60.0/sobjects/Payment__c`, {
        method: 'POST',
        headers: {
            'Authorization': `Bearer ${token}`,
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            Amount__c: payment.amount,
            Customer__c: payment.customerId,
            Reference__c: payment.reference
        })
    });

    if (!res.ok) {
        console.error('Failed:', res.status);
        return null;
    }
    return await res.json();
}

When Salesforce returns 503, the function logs and returns null. The calling code interprets null as a permanent failure. The payment is never recorded in Salesforce. The reconciliation team has to manually backfill from the payment processor's log.

The integration drops dozens of legitimate records during a five-minute platform incident because there is no retry. A better-built integration would absorb the incident transparently.

The fix, three paths

Add retry with exponential backoff. The simplest fix retries on 503 with increasing delays. The first retry after one second, the second after two, the third after four. Eventually either the platform recovers or the retry budget exhausts.

async function recordPaymentWithRetry(payment, token, maxRetries = 5) {
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
        const res = await fetch(`${INSTANCE_URL}/services/data/v60.0/sobjects/Payment__c`, {
            method: 'POST',
            headers: {
                'Authorization': `Bearer ${token}`,
                'Content-Type': 'application/json'
            },
            body: JSON.stringify({
                Amount__c: payment.amount,
                Customer__c: payment.customerId,
                Reference__c: payment.reference
            })
        });

        if (res.ok) {
            return await res.json();
        }

        if (res.status === 503 || res.status === 429) {
            const retryAfter = parseInt(res.headers.get('Retry-After') || '0', 10);
            const delayMs = retryAfter > 0 ? retryAfter * 1000 : Math.min(1000 * Math.pow(2, attempt), 30000);
            await new Promise(resolve => setTimeout(resolve, delayMs));
            continue;
        }

        throw new Error(`Non-retryable error: ${res.status}`);
    }
    throw new Error('Retry budget exhausted');
}

The function honors the Retry-After header when present, falls back to exponential backoff with a 30-second cap when absent, and gives up after a configurable number of attempts. Non-503/429 errors are not retried because they are typically permanent (bad request, unauthorized, validation rule failure).

Use a message queue with at-least-once delivery. Production integrations should not retry indefinitely in-process. The integration writes the payment to a durable queue (SQS, RabbitMQ, Kafka, Pub/Sub). A separate worker consumes the queue and pushes to Salesforce. If Salesforce returns 503, the worker fails the message and the queue redelivers after a configured visibility timeout.

async function enqueuePayment(payment) {
    await sqs.sendMessage({
        QueueUrl: PAYMENT_QUEUE,
        MessageBody: JSON.stringify(payment),
        MessageAttributes: {
            CustomerId: { DataType: 'String', StringValue: payment.customerId }
        }
    }).promise();
}

async function processPaymentMessage(message) {
    const payment = JSON.parse(message.Body);
    try {
        await recordPaymentWithRetry(payment, await getToken());
        await sqs.deleteMessage({ QueueUrl: PAYMENT_QUEUE, ReceiptHandle: message.ReceiptHandle }).promise();
    } catch (err) {
        if (message.Attributes?.ApproximateReceiveCount > 10) {
            await sendToDLQ(message, err);
        }
    }
}

The queue absorbs the incident. When the platform recovers, the worker drains the backlog automatically. The reconciliation step is built into the system rather than handled manually.

Use the Bulk API for batch loads. For workloads that are inherently batched (nightly data syncs, large catch-up loads), the Bulk API is designed for resilient throughput. It accepts a job, processes it asynchronously on the Salesforce side, and exposes job-level status. Transient 503s do not surface as individual record failures; they are absorbed within the platform's own batch execution.

async function bulkInsertPayments(payments, token) {
    const job = await createBulkJob('Payment__c', 'insert', token);
    await uploadBatch(job.id, payments, token);
    await closeBulkJob(job.id, token);
    return pollUntilComplete(job.id, token);
}

The Bulk API trades latency (the job processes over minutes) for reliability. For batch workloads, the trade is almost always worth it.

The fixed example

A complete payment integration with retry, queue-backed delivery, and a circuit breaker:

class SalesforceClient {
    constructor() {
        this.maxRetries = 5;
        this.baseDelay = 1000;
        this.maxDelay = 60000;
        this.circuitOpen = false;
        this.circuitOpenedAt = null;
        this.circuitTimeout = 60000;
    }

    async post(path, body, token) {
        if (this.circuitOpen) {
            if (Date.now() - this.circuitOpenedAt > this.circuitTimeout) {
                this.circuitOpen = false;
            } else {
                throw new Error('Circuit open');
            }
        }

        for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
            const res = await fetch(`${INSTANCE_URL}${path}`, {
                method: 'POST',
                headers: {
                    'Authorization': `Bearer ${token}`,
                    'Content-Type': 'application/json'
                },
                body: JSON.stringify(body)
            });

            if (res.ok) return await res.json();

            if (this.isRetryable(res.status)) {
                if (attempt === this.maxRetries) {
                    this.circuitOpen = true;
                    this.circuitOpenedAt = Date.now();
                    throw new Error(`Retryable status ${res.status} after ${this.maxRetries} attempts`);
                }
                const delay = this.computeDelay(res, attempt);
                await this.sleep(delay);
                continue;
            }

            throw new Error(`Non-retryable status ${res.status}`);
        }
    }

    isRetryable(status) {
        return status === 503 || status === 429 || status === 502 || status === 504;
    }

    computeDelay(res, attempt) {
        const retryAfter = parseInt(res.headers.get('Retry-After') || '0', 10);
        if (retryAfter > 0) return retryAfter * 1000;
        return Math.min(this.baseDelay * Math.pow(2, attempt), this.maxDelay);
    }

    sleep(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

The client retries on the 5xx and 429 status codes, respects Retry-After, falls back to capped exponential backoff, and opens a circuit breaker when the retries fail. The circuit breaker prevents thundering herd on a sustained outage by failing fast for the configured timeout window before testing the platform again.

Edge cases and gotchas

Idempotency on retry. Retrying a POST can create duplicate records if the original request actually succeeded but the network dropped the response. Use external-id-based upserts where possible. PATCH against /sobjects/Payment__c/External_Id__c/{ref} is idempotent: a retry against the same external id updates the existing record rather than creating a duplicate.

Authentication during outages. If the integration's OAuth token expires during the retry window, the next retry will return 401 (Unauthorized) rather than 503. The retry logic must handle re-authentication and then retry the original request.

Long-running batches. A Bulk API job that started before an outage may stay in progress through the outage. The job status reflects the platform's view, not the integration's. The integration should poll job status with its own retry logic.

Connect API and Wave Analytics. Specialized APIs have their own load profiles. The Connect API (Chatter, Communities) and Wave Analytics (CRM Analytics) can return 503 independently of the main data API. Retry logic should not assume a single failure mode for all APIs.

Maintenance windows are scheduled. Salesforce publishes the release schedule and pod maintenance windows on trust.salesforce.com. Integration runbooks should reference the schedule and avoid heavy batch loads during known windows.

Health check endpoints. Salesforce exposes /services/data as a lightweight endpoint for checking connectivity. An integration that pings this endpoint before starting a large batch can detect a platform issue early and abort gracefully.

Defensive habits

Build retry into every integration from day one. The cost of adding retry later, after a production incident, is high. Building it in from the start is a small effort with large reliability payoff.

Use queues for high-volume integrations. In-process retries work for low-volume real-time calls. High-volume systems need durable queuing to absorb sustained outages without losing data.

Respect Retry-After. The platform's hint is more accurate than the integration's guess. When the header is present, use it. When absent, use capped exponential backoff.

Monitor end-to-end. The integration's success rate, the average latency, and the error distribution are the metrics that matter. Alerts on a sustained drop in success rate catch incidents that the platform's own status page may miss.

Test for outages. A chaos test that injects 503 responses verifies the retry logic actually works. Most teams discover their retry logic has a bug only when they encounter a real outage.

Test patterns

A unit test that simulates a transient 503 followed by success:

test('retries after 503 and succeeds', async () => {
    const responses = [
        { ok: false, status: 503, headers: new Map([['Retry-After', '1']]) },
        { ok: true, status: 200, json: () => ({ id: '001xx000000001A' }) }
    ];
    const fetchMock = jest.fn().mockImplementation(() => Promise.resolve(responses.shift()));
    global.fetch = fetchMock;

    const client = new SalesforceClient();
    const result = await client.post('/services/data/v60.0/sobjects/Payment__c', { Amount__c: 100 }, 'token');

    expect(result.id).toBe('001xx000000001A');
    expect(fetchMock).toHaveBeenCalledTimes(2);
});

The test confirms one retry and one success. Additional tests cover sustained outages, non-retryable errors, and the circuit-breaker transitions.

Diagnosing in production

When 503 responses spike:

Check trust.salesforce.com for pod-level incidents. If reported, wait for resolution.
Check the integration's rate-limit headers. If the integration is rate-limited, slow it down.
Check whether the integration's source IP changed recently. A new source IP can trip security throttles.
Verify the OAuth token is still valid. Authentication failures sometimes surface as 503 in older API versions.
Open a support case if the incident persists beyond the platform's published recovery window.

Most 503 events resolve within minutes. The right action is to wait, retry, and observe. Persistent 503s warrant a support case.

Distinguishing 503 from other 5xx codes

The 5xx family includes 500 (Internal Server Error), 502 (Bad Gateway), 503 (Service Unavailable), and 504 (Gateway Timeout). Each has a slightly different meaning, but for retry purposes they are equivalent: the integration retries with backoff and treats success on retry as the indicator that the platform recovered.

A 500 from Salesforce typically signals an unexpected server-side condition. The integration should retry as it would for 503. A 502 or 504 is a gateway-level issue, often during a routing reconfiguration. A 503 is the explicit "I cannot serve you right now" response. Treating all four with the same retry policy keeps the integration code simple and resilient.

Quick recovery checklist

Confirm the integration retries on 503.
Confirm Retry-After is honored.
Confirm a queue or buffer absorbs sustained outages.
Verify the circuit breaker prevents runaway retries.
Add the incident to the runbook with the recovery details.

503s are part of running an integration against any cloud platform. Building the retry layer once, then maintaining it, is far cheaper than treating each incident as a novel emergency.

Related dictionary terms

Share this fix

Share on LinkedIn Share on X