Salesforce Dictionary - Free Salesforce GlossarySalesforce Dictionary
Full Machine Learning entry
How-to guide

How to use ML in a Salesforce org without overpromising

The successful pattern is: pick the right ML feature for the business decision, validate label quality, pilot on a small subset, monitor relentlessly. The failed pattern is: turn on a managed ML feature because it sounds impressive, watch users distrust the scores, turn it off six months later.

By Dipojjal Chakrabarti · Founder & Editor, Salesforce DictionaryLast updated May 18, 2026

The successful pattern is: pick the right ML feature for the business decision, validate label quality, pilot on a small subset, monitor relentlessly. The failed pattern is: turn on a managed ML feature because it sounds impressive, watch users distrust the scores, turn it off six months later.

  1. Identify the business decision the ML feature will inform

    Write one sentence describing the decision the model output will drive. Route this case to that queue. Prioritize this lead. Forecast this opportunity. If you cannot finish the sentence, you are not ready to turn the feature on.

  2. Pick the matching Einstein feature or build custom

    Most decisions map to a shipped Einstein feature. Check Einstein Lead Scoring, Opportunity Scoring, Case Classification, etc. first. Build custom (Prediction Builder, Einstein Discovery) only when no shipped feature fits.

  3. Audit the underlying labels

    For supervised models, the training data labels drive everything. Pull a sample of historical records and confirm the labels are consistent. Inconsistent labels produce a model that learns the inconsistency.

  4. Train (or let Salesforce train) and review evaluation metrics

    Salesforce-managed models show evaluation metrics in the feature setup page. Custom models show metrics in Prediction Builder or Einstein Discovery. Review against the business decision, not against generic accuracy.

  5. Pilot on a small user or queue subset

    Two to four weeks of pilot data tells you whether the model performs against real users. Holdout evaluation does not. Roll out broadly only after pilot data confirms.

  6. Schedule monitoring and a retraining cadence

    Set a recurring report that tracks the model output against ground truth (when ground truth becomes available). Schedule retraining monthly for new features and quarterly for stable ones. Add drift alerts.

  7. Document the model owner and review date

    Write down who owns each ML feature and when it gets reviewed next. Models accumulate without ownership and become operational liabilities.

Key options
Feature typeremember

Managed Einstein feature (no training required) or custom-built (Prediction Builder, Einstein Discovery). Drives setup work and flexibility.

Supervision typeremember

Supervised (needs labeled training data), unsupervised (finds patterns without labels), reinforcement (learns from feedback). Drives data preparation.

Training cadenceremember

Salesforce-managed models retrain automatically on a schedule. Custom models retrain when you trigger it. Monthly for new, quarterly for stable.

Bias detectionremember

Einstein Discovery surfaces bias metrics across protected attributes. Critical for any model that affects users (hiring, lending, recommendations to customers).

Owner and review scheduleremember

Named individual responsible for the model's continued accuracy and the next scheduled review date. Required for accumulation discipline.

Gotchas
  • Label quality drives everything in supervised ML. A model trained on inconsistent labels produces inconsistent predictions. Audit labels before evaluating the model.
  • Monitoring is the most under-invested phase. A model that worked at launch can degrade for a quarter before anyone notices. Schedule monitoring before broad rollout.
  • Bias is inherited from training data. Removing biased features does not always remove bias because correlated features can encode the same signal. Bias detection is a sustained practice, not a one-time check.
  • Using LLMs for scoring tasks is usually worse than fit-for-purpose discriminative models. Slower, more expensive, less accurate. Pick the right model type for the job.
  • Holdout evaluation rarely matches production performance. Two to four weeks of pilot data tells you what users actually experience.

See the full Machine Learning entry

Machine Learning includes the definition, worked example, deep dive, related terms, and a quiz.