A Story of Precision, Recall, and the Path Forward
By Shanu Vashishtha, Deep Learning Engineer, Kahuna Labs
The Problem
Imagine you’re a support engineer staring at a freshly opened ticket. A customer’s backup job has failed. The disk is showing strange behavior—high utilization, a rare error code but overall a puzzle that has missing pieces. You’ve seen hundreds of these cases, but each one turned out unique in their journey. What do you do next?
With the advent of GenAI systems in place, everyone is aware that AI promises to help out in this situation. Depending on the maturity of the assistant in place, the system helps the Support Engineer by analyzing the case history, matching to one of the thousands of resolved tickets it has learned from, and suggests next steps: “Check the cluster logs for errors,” “Ask the customer about recent configuration changes,” “Run diagnostic command X to gather more information.”
But here’s the million-dollar question: How do we know if the steps suggested by the AI are actually accurate?
In this blog post, we are going to describe one of our attempts at answering this ‘precisely’ where we describe components of an AI powered Evaluation System.
Ground Truth in the Wild
The start is always with a deceptively simple idea: take real support cases that have been resolved, look at what the predictions were early on, and compare it to what actually happened. The ground truth is right there in the past ticket conversations—the questions engineers asked, the actions they took, the steps they recommended to customers.
Sounds straightforward, right?
Well, maybe.
Challenge #1: Extracting Signal from Noise
Support conversations are messy. They’re filled with:
- “Thanks for the update!”
- “Can we schedule a Zoom call?”
- “I’m out of office until Monday”
- Email signatures, legal disclaimers, and marketing footers
But buried in that noise are the gems:
- Support Engineer Actions: “I will review the cluster logs and apply the latest patch”
- Customer Actions: “Please run `<command with arguments>` and share the output”
- Probing Questions: “When did this issue first occur? Which version of the product are you on?”
The first component of the Eval system is about becoming an archaeologist – carefully sifting through email threads to extract these three types of elements from the ground truth. Apart from removing the messy stuff, these conversations require stitching ticket-related content that is not in the ticketing system (e.g. some of the steps may have come from a Zoom call transcript or a Slack conversation with a Senior Support Engineer).
But the challenge doesn’t end there. Once we’ve filtered out the operational noise, we face two more critical considerations:
Privacy First: Support tickets often contain sensitive information—customer names, email addresses, system credentials, IP addresses. Before we can use these tickets for evaluation, we need to scrub all personally identifiable information (PII). This isn’t optional; it’s foundational to building trustworthy AI systems.
Quality Matters: Not all tickets are created equal. We need to assess:
- Credibility: Is this ticket from last month or five years ago? Best practices change, and more recent tickets generally reflect current reality better than older ones.
- Completeness: Does the ticket actually document the resolution steps? Or does it say “Resolved on a call with the customer” with no details? A ticket that ends with “Issue resolved, closing ticket” without explaining how doesn’t help us evaluate anything.
These quality signals become crucial filters. We’re not just looking for any ground truth—we’re looking for credible, complete, privacy-respecting ground truth that can actually teach us something about what works.
Challenge #2: Precision vs. Recall
Here’s where it gets interesting. One could theoretically build a prediction system that suggests “everything”:
> “Check the logs. Ask about their configuration. Verify their credentials. Review the network settings. Inspect the firewall rules. Check for disk space issues. Investigate memory availability. Look at CPU utilization…”
This system would have a high recall —it would capture almost every action that engineers eventually take. But it would overwhelm users with a firehose of generic suggestions, most of which aren’t relevant to the specific case at hand.
Or one could build an ultra-conservative AI system:
> “Check the logs.”
This would game the Eval framework by scoring high on precision —when it makes a suggestion, it’s probably relevant. But it would miss too many important steps, leaving engineers to figure out the rest on their own.
The satisfaction happens in the middle—suggesting the correct next steps without overwhelming or underwhelming the Engineers who are using the prediction system.
The second component of the eval system then is about computing these precision and recall numbers for the system predictions against the extracted ground-truths. But, before we compute the numbers, we need to know what is a ‘Match’ between predictions and ground-truths.
Challenge #3: What Does a “Match” Really Mean?
This is where traditionally human judgment enters the equation. Consider this example:
- Prediction: “Please try reinstalling the Backup software’s patch on the server as you mentioned”
- Ground Truth (from actual resolution): “Thank you. It worked after the re-installation of the patch.”
These match. The system under evaluation identified the correct action, even if the phrasing differs.
Now consider this:
- Prediction: “Wait until the retention period expires to successfully unregister the bucket.”
- Ground Truth: “We will not be able to unregister external targets if they are referenced by any data locks. So for now we will not be able to unregister the bucket until all the data locks expire.”
Again, a match! The prediction captured the essence: you need to wait for data lock’s expiration, even though the ground truth provides more technical context.
But here’s a non-match:
- Prediction: “Check if you can log in to the cluster using the same credentials.”
- Ground Truth: “Please advise of any recent changes made to your cluster configuration.”
Both are reasonable diagnostic steps, but they’re exploring different hypotheses. One is verifying authentication; the other is checking for configuration drift.
Challenge #4: Real World Worry
When evaluating a real support scenario, suggesting the right probing question at the right time can be the difference between:
- A case that resolves in 2 hours (because you asked the customer about that recent configuration change that caused everything)
- The same case that is dragging on for 2 days (because you went down three wrong diagnostic paths first).
But here’s the paradox: There are often multiple valid paths to resolution
The prediction might suggest: “Check if port 443 is open.”
The engineer might ask: “When did you last update your firewall rules?”
Both could lead to discovering the same root cause. Are these a match from the eval framework’s perspective? Sometimes yes, sometimes no—it depends on the context. When it is not a match, we head deeper into the realm of a ‘usefulness’ evaluation of this discovery, something we will explore in a future blog post.
For now, we evaluate three categories separately:
1. Support Engineer Actions: What will the Support Engineer do to help?
2. Customer Actions: What do we need to ask them to do?
3. Probing Questions: What information do we need to gather?
Each category has its own precision-recall tradeoff.
Enter the LLM-Judge in the Loop
To determine these matches in each of the categories, the eval framework comprises an LLM judgement component—a judge that understands technical context and semantic similarity. It evaluates each pairwise comparison:
“Given the case context about backups failing and the attempts that didn’t work so far, does the predicted action of ‘reinstalling the backup software’s correct version’ match the ground truth action mentioned in the resolution?”
The judge returns a binary verdict: match or no match. From a collection of these judgments, we build our precision and recall scores.
The caveat here being – who judges the judges? How do we capture the essence of a match when asking an LLM to judge? A single judge may not work for every instance, product or company. We should understand the current situation of a ticket, the nature of the product, the domain the company operates in to identify an optimal set of judges.
Challenge #5: Do the numbers tell us the story we are looking for?
When we run our evaluation:
“`
Overall SRE Actions Precision: 0.742
Overall SRE Actions Recall: 0.681
Overall Probing Questions Precision: 0.658
Overall Probing Questions Recall: 0.591
“`
These numbers tell us: Our predictions get about 2 out of 3 suggestions right. It’s missing about 1 in 3 actions that engineers eventually take.
Is this good enough?
- For a busy support engineer – having 2-3 relevant suggestions immediately available might save 20 minutes of thinking through initial diagnostic steps.
- For a new engineer – this could be invaluable guidance on where to start.
The Path Forward
Balancing the Scales – We’re constantly tuning this precision recall balance. The answer isn’t to maximize both (that’s mathematically impossible in most cases). The answer is to understand the cost of each type of error:
- False positives (low precision): Suggesting irrelevant actions wastes engineer time and attention
- False negatives (low recall): Missing critical actions delays resolution and frustrates customers
Different contexts might demand different balances. For critical P1 incidents? Maybe we want higher recall—suggest more possibilities, don’t miss anything. For routine cases? Higher precision might be better—just tell the most likely next step.
Our Learnings
This evaluation framework isn’t just about measuring model performance. It’s about understanding the nature of technical support itself.
Every time we analyze a batch of cases, we learn something new about how to evolve the Evaluation we have in place to become a better system in the entire support process. We discover patterns in what works, what doesn’t, and why.
Conclusion: The Quest for “Right Enough”
As we continue to refine our evaluation methods, we’re not chasing perfection. We’re chasing usefulness. We’re asking:
- “Do these precision recall numbers mean more meaningful suggestions for the Support Team?”
- “Does this capture the essential next steps without drowning users in noise?”
The precision-recall tradeoff forces us to think critically about what matters, to understand the costs of different types of errors, and to build systems that are genuinely helpful in the messy, complex reality of technical support.
And that’s a story worth telling.

Leave a Reply