Evaluation design scenarios — CCA-F Exam Prep

L3.16✨ 0

MYSTERY

The AI passed every test. Then it failed in production.

A content summarization AI scored 95% on the eval suite. The team deployed with confidence. Within a week, customers complained: summaries were missing critical details, inventing facts, and contradicting the source material.

The eval suite had 50 test cases. All short, well-structured articles. Production content included legal documents, medical reports, and rambling forum posts. The AI was tested on easy mode and deployed on hard mode.