How do you architect Apex code for orgs with Large Data Volumes (LDV)?

LDV is when objects have millions to hundreds of millions of records. Standard Apex patterns break.

Read patterns:

Selective queries are mandatory. SOQL must filter on indexed fields. Salesforce auto-indexes Id, Name, foreign keys. Custom Indexes can be requested via Salesforce Support.
Use `Database.QueryLocator` for >50k rows — iterates lazily, doesn't materialise the full result set.
Pagination via `LIMIT`/`OFFSET` is anti-pattern past 2000 rows — Salesforce has a hard OFFSET cap of 2000. Use Last-seen Id pagination instead:

apex List<Account> page = [SELECT Id, Name FROM Account WHERE Id > :lastSeenId ORDER BY Id LIMIT 1000];

Avoid GROUP BY on huge data — aggregate query row limit is 2000. Use external warehousing (Snowflake/BigQuery) for analytics over millions.
Skinny Tables — for very high-frequency reads on standard objects, Salesforce can provision a denormalised Skinny Table. Request via Support.

Write patterns:

Batch Apex for any bulk operation over 10k records. Each execute() chunk is its own transaction.
Bulk-safe DML — bulkify mercilessly. The 200-record-per-trigger pattern that's a guideline at 100k records is a requirement at 100M.
Defer Sharing Calculations during massive ownership changes. Setup -> Defer Sharing Calculations.
Avoid Roll-Up Summary fields on LDV objects — they recalculate on every child change. Use periodic batch jobs to refresh aggregate fields instead.

Integration patterns:

Bulk API 2.0 for inbound loads — never single-record REST calls for high volume.
Change Data Capture for outbound replication — push to Snowflake/BigQuery via middleware.
External objects via Salesforce Connect for "we need to see it but not store it" cases.
Big Objects for archive — billions of rows of historical data, queryable by indexed key.

Sharing model:

Avoid Private OWD on LDV objects if business allows — sharing recalc takes hours.
Apex Managed Sharing with surgical RowCause — finer-grained than Sharing Rules; lower recalc cost.
Don't add sharing rules carelessly — each rule increases recalc time.

Testing:

Full Sandbox is mandatory for performance testing. Dev sandbox with 100 records doesn't reveal LDV issues.
Apex tests with 200+ records to confirm bulk safety.
Production-like data volume testing — load realistic data to a Full sandbox before launch.

Monitoring:

Event Monitoring — slow query logs.
System Overview — row counts approaching LDV thresholds.
Alerting when sharing recalc queues build up.

LDV-aware design is architect-level work. Many decisions made early (OWD = Private, Roll-Up Summaries on hot objects) are extremely expensive to reverse at scale. Plan for LDV during initial design, not in remediation.

How do you architect Apex code for orgs with Large Data Volumes (LDV)?

Why this answer works

Follow-ups to expect

Related dictionary terms