The successful authoring pattern is real-data-first, varied across length, vocabulary, and grammar, with negative phrases for overlapping intents, and a tuning cadence driven by the confusion matrix from production. Skipping any of those steps produces a bot that demos well and fails for real users.

Spend an hour pulling real phrases per intent
Mine email-to-case subject lines, web form submissions, and any prior chat logs. Aim for 100 real candidates per intent before writing a single invented phrase.
Pick 30 to 50 varied phrases per intent from the candidate pool
Vary length (short, medium, long), vocabulary (formal, casual, slang), and grammar (questions, statements, fragments, commands). Skip near-duplicates; each phrase should add genuine variety.
Strip out named entities from phrases
Replace specific values (customer names, order numbers, dates) with generic placeholders or remove them. The model should learn the pattern, not the specific value.
Add 3 to 5 negative training phrases per overlapping intent
For each pair of intents that share vocabulary, add explicit negative phrases that belong to the other intent. Skip negative training for intents with no overlap.
Train the model and test in the bot preview
Send the candidate messages back through the bot and confirm they classify as expected. Send variations the model has not seen. Watch for intents that match too broadly or too narrowly.
Pilot in production for two to four weeks
Real production conversations surface misclassifications that the preview misses. Pilot with a small user group before broad rollout.
Tune from the Intent Insights confusion matrix
Open the confusion matrix after two to four weeks. Off-diagonal cells are the targets. Add positive or negative training phrases to crispen the boundaries. Re-train and observe.

Key options

Phrase count per intentremember

30 to 50 is the sweet spot. Below 20 the model under-fits; above 100 returns diminish quickly.

Source of phrasesremember

Real conversation data (email-to-case, web forms, chat logs) outperforms invented phrases every time.

Variety dimensionsremember

Length (short, medium, long), vocabulary (formal, casual, slang), grammar (question, statement, fragment, command). Each axis matters.

Negative training phrasesremember

Three to five per overlapping intent pair. Skip for intents with no overlap.

Tuning cadenceremember

Two to four weeks of pilot data, then confusion-matrix-driven tuning. Iterate three or four rounds for production-ready quality.

Gotchas

Twenty near-identical phrases produce a brittle model. Variety matters more than quantity; add 30 varied phrases rather than 60 paraphrases.
Training phrases that include verbatim entity values teach the model to match the value, not the pattern. Strip or genericize specific values.
Phrases invented by the author capture how the author writes, not how users do. Real-data-first is the discipline that distinguishes bots that work from bots that demo.
Skipping negative training for overlapping intents produces a bot that flips its classification on near-identical messages. The fix is targeted negative phrases, not more positive ones.
Tuning is iterative, not one-time. Most production bots need three to four rounds of confusion-matrix-driven tuning to reach acceptable quality.

How to author training phrases that hold up in production

See the full Training Phrase entry