Salesforce Dictionary - Free Salesforce GlossarySalesforce Dictionary
Full Training Data entry
How-to guide

How to prepare training data for an Einstein predictive feature

Most predictive Einstein features handle the training pipeline themselves. The work for the admin or data team is upstream: making sure the historical data meets the thresholds and is clean enough to learn from.

By Dipojjal Chakrabarti · Founder & Editor, Salesforce DictionaryLast updated May 16, 2026

Most predictive Einstein features handle the training pipeline themselves. The work for the admin or data team is upstream: making sure the historical data meets the thresholds and is clean enough to learn from.

  1. Identify the prediction target

    Pick the field that represents the outcome. For Lead Scoring it is Converted. For Case Classification it is the field to predict (Type, Reason, Priority). For Opportunity Scoring it is StageName. The target must be a known, populated field on historical records.

  2. Pull the historical window

    Run a report or query against the relevant object filtered to the training period (typically the past 12 to 24 months). Count records by outcome class. Compare against the published threshold for the feature.

  3. Audit for label leakage

    Walk through the candidate feature list. For each field, ask whether it could only have a value after the outcome happened. Exclude any such field from training. The list of excluded fields belongs in the feature documentation.

  4. Check freshness and balance

    Verify that records are recent enough (most features look back 12 months by default). Verify the class balance is no worse than 95-5. If worse, escalate to a custom path because native Einstein may not handle it.

  5. Build the model and validate against a holdout

    Launch the predictive feature build. After completion, review the model card or scorecard the feature publishes. Compare against a manually held-out set of records and confirm the metrics match.

Key options
Historical windowremember

The look-back period the training pipeline uses. Default is 12 months for most features. Shorter windows react faster to change but lose signal in low-volume orgs.

Excluded fieldsremember

The feature list the model is told to ignore. Used to remove leaky or sensitive columns. Configured at feature build time.

Refresh cadenceremember

How often the model retrains. Weekly for case features, monthly for lead and opportunity features by default. Custom paths set their own cadence.

Holdout setremember

The portion of training data set aside for evaluation. Salesforce manages this for native features. BYOM owners set it themselves.

Feedback loopremember

How resolved cases or converted leads flow back into the next training cycle. Configured implicitly by the field being populated after the outcome.

Gotchas
  • Below the published volume threshold, predictive Einstein silently refuses to build a model. Check the threshold before promising a launch date.
  • Label leakage produces a model with great evaluation metrics and terrible production performance. Audit features before, not after, the first complaint.
  • Foundation model training data is not yours to inspect or customize. Customization happens via grounding and prompts, not by retraining the weights.
  • Class imbalance worse than 95-5 confuses native features. Worse than 99-1 usually needs a custom path with explicit imbalance handling.
  • Models go stale. A 24-month-old model on a 12-month sales cycle is predicting a world that does not exist. Schedule a refresh review at least annually.

See the full Training Data entry

Training Data includes the definition, worked example, deep dive, related terms, and a quiz.