Your LLM works perfectly in demos but breaks down with real users. Sound familiar? You're not alone. Most AI teams struggle with the gap between "cool prototype" and "production-ready system."
The problem isn't your model or your data. It's that you're missing a systematic approach to evaluation (or "evals" as we call them in the field). Evals are the structured methods for testing, measuring, and improving how your AI system performs in the real world.
The "three gulfs" model (Comprehension, Specification, and Generalization) introduced by Hamel Husain & Shreya Shankar in their AI Evals for Engineers & PMs Course provides a framework for bridging this gap. Here's how to apply it to build more reliable LLM systems.
Gulf of Comprehension: Start with the Data
No matter how advanced your model is, you can't build a good AI product without truly comprehending your data. This means investigating not just what users input, but also the outputs your model generates, the intermediary steps it took and the type of failures that appear. You can't possibly inspect every single trace, but you should inspect enough until you stop seeing new patterns to surface important properties and recurring issues.
Approach to use
- Regularly sample and read traces. Don't rely on metrics or public benchmarks alone.
- Use annotation to label failures and group similar issues together to spot patterns in failure modes.
- Always stay skeptical: there's usually more variety and more surprises in real user data than you expect.
Think of it as building empathy with your data
Gulf of Specification: Communicate with Care
Once you know your data, the next challenge is translating intent into precise instructions for your model. Vague prompts yield vague outputs. The real art is specifying exactly what the model should (and shouldn't do), how it should respond, and what "good" looks like for your users.
A good set of rules to start with
- Start with user unhappiness: List everything that could frustrate or confuse the end user if the model's response doesn't capture it. Encode protections against those issues.
- Define boundaries: Specify what the model must always do and never do.
- Set output expectations: Clearly define formats and provide examples.
- Inject your taste: What does a great answer look like for your audience? This is where your competitive moat lies.
Specification isn't just about prompt engineering—it's your product design in writing.
Gulf of Generalization: Engineer for Robustness
Even with perfect prompts, models often fail to generalize. Generalization is about making sure your LLM pipeline performs well outside the scenarios you've already seen. This is not just about alignment, it's about robustness.
Techniques you can adopt
- RAG (Retrieval-Augmented Generation): Supply relevant, external, or personalized context to reduce hallucinations and improve accuracy
- Pre/post-processing: Filter, sanitize, or reformat inputs/outputs to handle edge cases and maintain consistency
- Agent decomposition: Break complex tasks into smaller, more manageable steps that are easier to test and debug
- Context engineering: Delicately filling the context window with just the right information for the next step
- Fallback mechanisms: Build graceful degradation when the model fails or returns low-confidence responses
- Evals: Run systematic evaluations between different approaches to understand what works best for your use cases
Looping back, Generalization is about making the system work consistently, not just correctly.
The Loop: Analyze → Measure → Improve
Building LLM products is a loop, not a line. A framework to adopt:
- Analyze: Inspect representative data, annotate traces, and identify failure modes.
- Measure: Quantify those failure modes and their impact.
- Improve: Address specification or generalization issues with prompt tweaks or engineering changes.
- Repeat: Cycle through again as new issues emerge or goals shift.
You cycle through these repeatedly as you build and improve your AI system. And at each stage, putting yourself in the shoes of the end user is key.
Get Started with Evals
- Collect real user inputs or synthetic data. Focus on the former as synthetic data often misses the complexity of real user interactions.
- Run it through your model, logging all traces. Use observability platforms, OpenTelemetry, or simply store inputs/outputs to a file.
- Read and annotate traces. Yes, you need to "Look at Your Data."
- Group similar failures together to identify patterns.
- Focus on the first failure in any long trace, rather than downstream errors.
- Repeat and refine your process.
The Buckets: Quick Reference
To tie it all together, here's where common terms and tools fit within each gulf:
Gulf of Comprehension | Gulf of Specification | Gulf of Generalization |
---|---|---|
Data Traces | Prompt Engineering | LLM Chains |
Observability | Roles/Objectives | RAG |
Annotation | Few-shot Examples | Agentic Workflow |
"Look at Your Data" | Metaprompting | Memory |
Open/Axial Coding | Chain-of-Thought | Finetuning |
Synthetic Data Generation | Guardrails | Fallback Mechanism |
Failure Analysis | Output Schema | Context Engineering |
Final Thought
Evaluating and improving LLM systems is all about bridging these gulfs and cycling through them with a critical, user-first mindset. Success depends not just on technical skills, but on your willingness to deeply understand, specify, and engineer for the messy, real world.
As Hamel puts it: "If you are not willing to look at some data manually on a regular cadence you are wasting your time with evals. Furthermore, you are wasting your time more generally."