Most Salesforce features pick the NLP model for you. The work is providing good data and tuning the few model-level knobs the feature exposes.
- Identify the feature and its model type
Einstein Bots uses Einstein NLU for intent classification. Einstein Case Classification uses a per-org fine-tuned classifier. Agentforce uses a foundation LLM. The configuration surface differs by feature; check the docs for the specific feature.
- Provide labeled training data
Utterances per intent for bots, historical records per target class for classification, prompt templates and grounding for generative features. The data is the actual customization. The model is fixed.
- Train, validate against a holdout
Launch the build or training pipeline. The feature returns metrics. Spot-check predictions against a hand-labeled holdout set to confirm the metrics match real behavior.
- Tune the confidence threshold
Use the feature's threshold setting to trade off action rate versus accuracy. Start conservative (high threshold) and lower as confidence in the model grows.
- Set a refresh schedule
For features that retrain automatically, verify the schedule is on. For static features, calendar a manual retrain at least quarterly.
Some features expose a model picker (Einstein NLU versus external). Most do not. When the picker exists, pick the model that matches the data language and domain.
The probability floor for the feature to act. Tune per intent or per use case rather than once across the whole bot.
For multi-language deployments, configure one model per language. Multilingual foundation models handle generation natively but classification still needs per-language data.
How often the model retrains. Built into automatic features. Calendar manually for static ones.
What the feature does below the confidence threshold. Hand off to human, return a default, ask a clarifier. Design this deliberately rather than accepting the default.
- Switching the underlying model rarely fixes a misroute problem. Rewriting utterances and tuning thresholds almost always does. Spend effort on data, not on the model picker.
- Language coverage is per-model. A model trained on English utterances misclassifies French ones, even if the underlying foundation model speaks both. Train per language for classification.
- Confidence threshold tuned once at launch stops being right. Customer language shifts, intent boundaries blur, and the right threshold drifts. Revisit it quarterly.
- Static NLP models degrade silently. A bot that was 92 percent accurate at launch can be 78 percent a year later and nobody notices until customer complaints pile up.
- Foundation model and NLP model are not synonyms. Mixing the terms in design docs causes confusion about whether the feature can be fine-tuned or only prompted.