The lifecycle matters more than the model. A great model deployed without a monitoring and retirement plan becomes a future incident. The sequence below makes the operational decisions explicit before training starts, which is where most teams underinvest.
- Define the business decision the model will inform
Write one sentence: "this model will let us decide X." If you cannot finish that sentence, you are not ready to train. Examples: route case Y to queue Z, score opportunity W on win probability.
- Pick the model type that matches the decision
Predictive for probabilities, classification for categories, embedding for similarity, generative for text, recommendation for ranked lists. The type drives training data needs and evaluation metrics.
- Audit training data for leakage
For every feature in the training data, ask: was this value knowable at the moment we want the prediction made? Features filled in after the outcome was decided are leakage and need to be removed.
- Pick evaluation metrics that map to the business decision
Skip generic accuracy. Use deflection rate, win rate, resolution time, or the closest direct measurement of the business decision. Validate that the proxy correlates with the metric you actually care about.
- Train, evaluate, and pilot before broad rollout
Train in a sandbox or scratch environment. Evaluate on a holdout set. Pilot on a small user or queue subset for two to four weeks before broad rollout. Pilot data is the only honest evaluation.
- Wire monitoring before broad rollout
Set up a recurring report that tracks prediction accuracy against ground truth, alerts on threshold breaches, and shows distribution shift in inputs. Without monitoring, drift goes unnoticed until users complain.
- Register the model with owner and review cadence
Add the model to a registry (Model Manager for Einstein, your own spreadsheet for custom). List owner, last training date, performance metrics, next review date. The registry is the discipline that prevents the model becoming someone else's problem in two years.
Salesforce-managed, customer-trained, or brokered third-party. Drives governance and cost model.
Masking, residency, and logging rules applied to every model call. Non-optional for managed and brokered; equivalent compliance work required for custom models.
Fixed cadence (monthly, quarterly) plus on-demand retraining on drift detection. Continuous retraining is rare in production.
Business decision metric plus a proxy metric for fast iteration. Both should be validated to correlate.
Named individual or team responsible for the model's training, evaluation, monitoring, and retirement. Required for registry hygiene.
- Data leakage in training features is the single most common cause of models that look great in evaluation and fail in production. Audit every feature for "was this knowable at prediction time."
- Academic accuracy metrics rarely match business goals. A 90 percent accurate model that fails on the 10 percent of high-impact cases is worse than a 75 percent model that gets the important ones right.
- Drift is not optional. Models trained today perform worse next quarter, and worse still next year. Monitoring is the only way to know when to retrain.
- Retired models accumulate in production. Custom prediction connections, old Einstein Discovery stories, and brokered models from former vendors become liabilities. A registry with review dates prevents accumulation.
- The Trust Layer is non-optional for Salesforce-managed and brokered models, but does not automatically apply to custom-trained models. Build the equivalent compliance work for custom models explicitly.