NLP Model
An NLP model (natural language processing model) is a machine learning model trained to understand or generate human language.
Definition
An NLP model (natural language processing model) is a machine learning model trained to understand or generate human language. In Salesforce, NLP models power the language understanding behind Einstein Bots, the intent classifier behind Service Cloud routing features, the case classification engine that fills in Type and Reason fields automatically, and the foundation language layer that Agentforce builds on top of. The term covers both classical NLP models (intent classifiers, named entity recognizers, sentiment analyzers) and large language models (the transformer-based generative models behind Einstein Generative AI).
In an admin's day-to-day, NLP model usually refers to the configurable language model behind a bot or text classification feature. The admin picks the model in Setup, points it at a labeled dataset (utterances for bots, historical cases for classification), and the platform handles training. The model produces a probability distribution over possible labels and the feature consumes the top match. The model is rarely the limiting factor in a project. Data quality and label design almost always are.
The NLP model layer in the Salesforce AI stack
Classification, sequence labeling, generation
NLP models fall into three families based on what they output. Classification models pick one label from a fixed set (intent classification, case priority, sentiment positive or negative). Sequence labeling models tag each token in the input (named entity recognition, part-of-speech tagging). Generation models produce free text (summarization, reply drafting, agent responses). Salesforce features use all three. Einstein Bots use classification for intent matching. Einstein Activity Capture uses sequence labeling to extract people and topics from email. Agentforce uses generation for everything customer-facing.
Einstein NLU and its place in the stack
Einstein NLU is the default natural language understanding model behind Einstein Bots. It is a classification model fine-tuned for intent prediction. The customer never sees the weights; they configure the model by providing utterances per intent and the platform handles training and serving. The model is multi-tenant, which means improvements to the underlying architecture roll out to all customers in a release, but each customer's intents and utterances stay isolated. Einstein NLU is one option among several. Salesforce also exposes external NLU options for multi-language and specialized domains.
Language coverage and the language-per-model rule
Most NLP models are language-specific. A model trained on English utterances cannot reliably classify French utterances; the words and structures it learned do not transfer. For multi-language Einstein Bots, the standard pattern is one bot per language, each with its own utterance set in the target language. Some foundation models support multilingual generation natively, which means an Agentforce agent can respond in the customer's language without a separate model per language. The classification layer underneath still needs language-aware training.
Confidence scores and threshold tuning
Every classification model returns a confidence score alongside the predicted label. The score is a probability between 0 and 1. The confidence threshold is the floor: above it, the feature acts on the prediction, below it, the feature falls back (escalates to a human, returns a default response, asks a clarifying question). Tuning the threshold is one of the few model-level controls customers actually have. Lower it and the bot acts more often, including on weak matches. Raise it and the bot falls back more often, including on solid matches.
Fine-tuning versus prompting
For classical NLP models, customization happens through fine-tuning: the customer's labeled data adjusts the model weights to fit the customer's intents. For large language models, customization happens primarily through prompting: the customer writes a prompt template that instructs the model to behave a certain way at inference time. Salesforce predictive features are fine-tuned per customer behind the scenes. Generative features are prompted, with grounding supplying the customer-specific facts. The two customization paths feel very different from the configuration UI even though both end at the same prediction.
NLP model versus foundation model
The terms overlap and people use them interchangeably, which causes confusion. NLP model is the general category: any model that handles language. Foundation model is a specific subset: a very large model pre-trained on a massive corpus, intended to be adapted to many downstream tasks via fine-tuning or prompting. Every foundation model is an NLP model. Not every NLP model is a foundation model. Einstein Case Classification is an NLP model that is not a foundation model. The LLM behind Agentforce is both.
Refresh, drift, and the long-term health of an NLP feature
NLP models drift. The language customers use to describe their problems changes over time, and a model trained six months ago has not seen the new phrasings. Salesforce features that retrain on a cadence (Einstein Case Classification weekly, Einstein Lead Scoring monthly) handle drift automatically as long as the underlying data refreshes. Static models, including a one-time-trained Einstein Bot that never gets new utterances, degrade silently. Build a retraining cadence into every NLP project, not as a one-time launch task.
How to pick and configure an NLP model for a Salesforce feature
Most Salesforce features pick the NLP model for you. The work is providing good data and tuning the few model-level knobs the feature exposes.
- Identify the feature and its model type
Einstein Bots uses Einstein NLU for intent classification. Einstein Case Classification uses a per-org fine-tuned classifier. Agentforce uses a foundation LLM. The configuration surface differs by feature; check the docs for the specific feature.
- Provide labeled training data
Utterances per intent for bots, historical records per target class for classification, prompt templates and grounding for generative features. The data is the actual customization. The model is fixed.
- Train, validate against a holdout
Launch the build or training pipeline. The feature returns metrics. Spot-check predictions against a hand-labeled holdout set to confirm the metrics match real behavior.
- Tune the confidence threshold
Use the feature's threshold setting to trade off action rate versus accuracy. Start conservative (high threshold) and lower as confidence in the model grows.
- Set a refresh schedule
For features that retrain automatically, verify the schedule is on. For static features, calendar a manual retrain at least quarterly.
Some features expose a model picker (Einstein NLU versus external). Most do not. When the picker exists, pick the model that matches the data language and domain.
The probability floor for the feature to act. Tune per intent or per use case rather than once across the whole bot.
For multi-language deployments, configure one model per language. Multilingual foundation models handle generation natively but classification still needs per-language data.
How often the model retrains. Built into automatic features. Calendar manually for static ones.
What the feature does below the confidence threshold. Hand off to human, return a default, ask a clarifier. Design this deliberately rather than accepting the default.
- Switching the underlying model rarely fixes a misroute problem. Rewriting utterances and tuning thresholds almost always does. Spend effort on data, not on the model picker.
- Language coverage is per-model. A model trained on English utterances misclassifies French ones, even if the underlying foundation model speaks both. Train per language for classification.
- Confidence threshold tuned once at launch stops being right. Customer language shifts, intent boundaries blur, and the right threshold drifts. Revisit it quarterly.
- Static NLP models degrade silently. A bot that was 92 percent accurate at launch can be 78 percent a year later and nobody notices until customer complaints pile up.
- Foundation model and NLP model are not synonyms. Mixing the terms in design docs causes confusion about whether the feature can be fine-tuned or only prompted.
Trust & references
Cross-checked against the following references.
- Einstein NLU and Bot IntentsSalesforce Help
- Einstein Case ClassificationSalesforce Help
Straight from the source - Salesforce's reference material on NLP Model.
- Einstein platform overviewSalesforce Help
- Einstein StudioSalesforce Help
Hands-on resources to go deeper on NLP Model.
About the Author
Dipojjal Chakrabarti is a B2C Solution Architect with 29 Salesforce certifications and over 13 years in the Salesforce ecosystem. He runs salesforcedictionary.com to help admins, developers, architects, and cert/interview candidates sharpen their fundamentals. More about Dipojjal.
Test your knowledge
Q1. What is an NLP Model?
Q2. What features use NLP models?
Q3. Why is NLP impactful for customer interactions?
Discussion
Loading discussion…