Evaluation design scenarios — CCA-F Exam Prep
The AI passed every test. Then it failed in production.
A content summarization AI scored 95% on the eval suite. The team deployed with confidence. Within a week, customers complained: summaries were missing critical details, inventing facts, and contradicting the source material.
The eval suite had 50 test cases. All short, well-structured articles. Production content included legal documents, medical reports, and rambling forum posts. The AI was tested on easy mode and deployed on hard mode.