Training Data
Training data is the labeled or unlabeled content a machine learning model learns from during training.
Definition
Training data is the labeled or unlabeled content a machine learning model learns from during training. For a generative AI model, training data is the corpus of text the model reads to learn patterns of language and reasoning. For a predictive AI model in Salesforce (Einstein Lead Scoring, Einstein Case Classification, Opportunity Scoring), training data is the set of historical records the model uses to learn the relationship between input features and the outcome being predicted. The shape and quality of training data determine what the model can do at inference time. Garbage history produces a confident garbage scorer.
In Salesforce, training data lives in two distinct worlds. Foundation models behind Einstein Generative AI are trained on public and licensed corpora outside the customer's org. The customer never sees or controls this layer; it is the platform's responsibility. Predictive Einstein features train on the customer's own historical records (closed-won opportunities, resolved cases, completed tasks) and that data must meet volume, freshness, and label balance thresholds before the model can be built. The thresholds are not advisory. Predictive Einstein refuses to build a model when the data does not pass them.
How training data shapes predictive and generative AI in Salesforce
Training data, inference data, feedback data
Three datasets matter for a production model and they should not be confused. Training data is what the model learned from. Inference data is what the model sees at runtime when it makes a prediction. Feedback data is what came back after the prediction was made, like whether the lead converted or the case was resolved on the suggested article. Feedback data eventually rejoins the training set on the next refresh. Mixing them up leads to bad evaluation. A model that scores 95 percent accuracy on training data and 60 percent on inference data is overfit, not good.
Volume, freshness, balance: what predictive Einstein requires
Salesforce predictive features publish hard thresholds. Einstein Lead Scoring needs at least 1,000 converted leads and 1,000 unconverted leads from the past 12 months. Einstein Case Classification needs at least 400 records per target class. Einstein Opportunity Scoring needs at least 200 closed-won and 200 closed-lost opportunities per record type. Below these numbers, the feature reports insufficient data instead of producing a model. The thresholds protect customers from publishing a model that has not learned anything stable.
Label leakage and lookahead bias
Label leakage happens when a feature in the training set encodes the answer. An obvious case: training Opportunity Scoring with the field Closed Won Reason. A subtle case: training Lead Scoring with the field Last Activity Date, because converted leads were touched more recently than unconverted ones. The model learns the leak instead of the underlying pattern, scores beautifully in evaluation, and falls apart in production because the leaky field is not populated for new records. Audit feature lists for any column that could only have a value after the outcome occurred.
Class imbalance and what to do about it
Real-world outcomes are imbalanced. Maybe 5 percent of leads convert, 95 percent do not. A naive model can score 95 percent accuracy by predicting never converts for everyone. Salesforce predictive Einstein handles this internally with stratified sampling and weighted training, but only above the volume thresholds. Below the threshold the imbalance dominates and the feature refuses to train. For Bring Your Own Model paths, the team owns the imbalance handling: resampling, class weights, or threshold tuning.
Foundation model training (the layer you do not control)
The LLMs behind Agentforce and Einstein Generative AI are trained on text the customer never sees. Salesforce contracts with foundation model providers (in-house, OpenAI, Anthropic) and the training corpus is the provider's responsibility. Customer data does not enter the training set, that is the contract guarantee enforced by the Einstein Trust Layer. This means foundation models cannot be customized to a specific industry by retraining from inside Salesforce. Customization happens at the prompt and grounding layers, not the weights layer.
Bring Your Own Model and customer-controlled training
For predictive use cases that fall outside the standard Einstein features, Salesforce supports Bring Your Own Model via Data Cloud and Einstein Studio. The customer trains a model in their own environment (Databricks, Vertex, SageMaker), registers it in Einstein Studio, and exposes it through prompt templates or flows. Training data control is full in this path, and so is the responsibility for validation, monitoring, and refresh. Native Einstein hides the training pipeline. BYOM exposes it.
Refreshing training data over time
Models go stale. Customer behavior changes, products evolve, sales motions shift. A model trained 18 months ago on pre-pandemic data is not predicting the current world. Standard Einstein features refresh on a fixed schedule (typically weekly for case features, monthly for lead and opportunity features). The refresh re-runs the training pipeline against the latest data window. Custom models need an equivalent schedule. The right cadence depends on how fast the underlying behavior changes, not on how often the team wants new dashboards.
How to prepare training data for an Einstein predictive feature
Most predictive Einstein features handle the training pipeline themselves. The work for the admin or data team is upstream: making sure the historical data meets the thresholds and is clean enough to learn from.
- Identify the prediction target
Pick the field that represents the outcome. For Lead Scoring it is Converted. For Case Classification it is the field to predict (Type, Reason, Priority). For Opportunity Scoring it is StageName. The target must be a known, populated field on historical records.
- Pull the historical window
Run a report or query against the relevant object filtered to the training period (typically the past 12 to 24 months). Count records by outcome class. Compare against the published threshold for the feature.
- Audit for label leakage
Walk through the candidate feature list. For each field, ask whether it could only have a value after the outcome happened. Exclude any such field from training. The list of excluded fields belongs in the feature documentation.
- Check freshness and balance
Verify that records are recent enough (most features look back 12 months by default). Verify the class balance is no worse than 95-5. If worse, escalate to a custom path because native Einstein may not handle it.
- Build the model and validate against a holdout
Launch the predictive feature build. After completion, review the model card or scorecard the feature publishes. Compare against a manually held-out set of records and confirm the metrics match.
The look-back period the training pipeline uses. Default is 12 months for most features. Shorter windows react faster to change but lose signal in low-volume orgs.
The feature list the model is told to ignore. Used to remove leaky or sensitive columns. Configured at feature build time.
How often the model retrains. Weekly for case features, monthly for lead and opportunity features by default. Custom paths set their own cadence.
The portion of training data set aside for evaluation. Salesforce manages this for native features. BYOM owners set it themselves.
How resolved cases or converted leads flow back into the next training cycle. Configured implicitly by the field being populated after the outcome.
- Below the published volume threshold, predictive Einstein silently refuses to build a model. Check the threshold before promising a launch date.
- Label leakage produces a model with great evaluation metrics and terrible production performance. Audit features before, not after, the first complaint.
- Foundation model training data is not yours to inspect or customize. Customization happens via grounding and prompts, not by retraining the weights.
- Class imbalance worse than 95-5 confuses native features. Worse than 99-1 usually needs a custom path with explicit imbalance handling.
- Models go stale. A 24-month-old model on a 12-month sales cycle is predicting a world that does not exist. Schedule a refresh review at least annually.
Trust & references
Cross-checked against the following references.
- Einstein Lead Scoring data requirementsSalesforce Help
- Einstein Case ClassificationSalesforce Help
Straight from the source - Salesforce's reference material on Training Data.
- Einstein for SalesSalesforce Help
- Einstein StudioSalesforce Help
About the Author
Dipojjal Chakrabarti is a B2C Solution Architect with 29 Salesforce certifications and over 13 years in the Salesforce ecosystem. He runs salesforcedictionary.com to help admins, developers, architects, and cert/interview candidates sharpen their fundamentals. More about Dipojjal.
Test your knowledge
Q1. What is Training Data?
Q2. How does data quality affect models?
Q3. What does Einstein use as training data?
Discussion
Loading discussion…