This phase is about making sure your AI agent can answer real customer questions correctly before you roll it out. By testing with sample questions, fixing gaps, and aiming for consistent accuracy, you’ll capture early wins and avoid embarrassing mistakes.
Think of this phase as your dress rehearsal before going live.
Steps
1. Build your test set
Collect 30–50 simple, product-related questions.
Why it matters: A broad sample helps you spot gaps and check consistency quickly.
Where to source real questions:
Support tickets, email inbox, or chat transcripts
Help center search queries (what customers type in)
Common “how-to” questions your support team knows by heart
How to generate if you lack history:
Turn document headings into questions (“reset your password” → “how do I reset my password?”)
Use an AI assistant (e.g., ChatGPT) with your documents to create 30–50 realistic, customer-style questions
Note: Keep scope tight. Use only general product and documentation questions. Skip billing, cancellations, refunds, and account-specific issues.
2. Test in the Playground
Ask each question manually and track the results.
Why it matters: Direct testing shows what customers will actually see.
How to do it:
Enter each question in the Playground
Record outcome: correct, partial, or incorrect
Capture the source document title (and URL or ID if available)
Log results in a simple sheet for team review
Suggested tracking sheet columns:
question | intent/topic | outcome (correct/partial/incorrect) | source doc | notes/fix needed | owner | retest result
Example Playground interface showing a sample question and the retrieved sources used for the AI’s reply.
3. Identify gaps and issues
Review all incorrect or incomplete answers.
Why it matters: Understanding the cause prevents random fixes.
Common problems:
Conflicting documentation → update or consolidate overlapping pages
Information exists but wasn’t found → confirm it’s uploaded and well-titled; refine additional guidelines
Missing documentation → create or update the relevant article in the knowledge base
4. Fix with the right method
Apply targeted fixes instead of patching blindly.
Fix options:
Update or add documentation in the knowledge base (preferred and scalable)
Use Q&A sparingly for sensitive or non-public information, or as a temporary bridge until docs are updated
Adjust additional guidelines if the agent isn’t prioritizing the right source
Example Knowledge Base Q&A entry screen for adding exceptions or sensitive information.
5. Retest until consistent
Run the same set of questions again after each fix.
Repeat testing until the agent answers > 80% correctly
Confirm that previous “partial” answers are now complete and useful
Keep the tracking sheet updated and visible to the team
Continuous testing loop — test, identify gaps, fix, and retest until accuracy stabilizes.
6. Confirm readiness
Run structured tests to know when you’re ready to go live.
Weekly checks:
Test with 50–100 real customer questions (golden set)
Score each answer: Correct / Partial / Incorrect / Refused appropriately
Track groundedness: every answer should cite the right source
Spot-audit tone and refusals
Pass gates — move forward only if all are true:
≥80% correct answers on the golden set
Fallback rate <20%
≥95% of escalations include transcript + key fields
Note: Don’t over-tune. Once you hit these thresholds, stop iterating and move to Phase 4 (Embedding & Rollout). Defer other refinements to later phases.
Best Practices / Tips
Involve your support team — they know real customer phrasing.
Balance the test set across FAQs, how-tos, and edge cases.
Fix knowledge base documentation before using Q&A.
Track results in a shared sheet for clarity and alignment.
Aim for progress, not perfection — >80% accuracy is enough to move forward.
Common Mistakes to Avoid
Testing with too few questions
Jumping to sensitive or account-specific cases too early
Ignoring duplicate or conflicting documents
Adding too many Q&As instead of fixing the main documents
Forgetting to retest after changes
Cross-references
Knowledge Base – Q&A (for handling exceptions)
✅ Expected outcome: Agent answers at least 80% of FAQs accurately in the Playground.
