Photo by Dhaya Eddine Bentaleb on Unsplash
When we think about why some enterprise AI projects never make it to production, it’s easy to point fingers at the tech: not accurate enough, too unpredictable, doesn’t “understand” what we want. But according to a recent deep dive by Databricks, the real problem isn’t the intelligence of the model. It’s figuring out how we judge whether that model is doing a good job in the first place.
And surprise… it’s not just a technical problem. It’s a people problem.
Let me explain.
What Exactly Is an AI Judge?
Picture this: You’re testing your AI system and need to know if it’s giving you good answers. Enter the “AI judge”—an AI system whose job is to evaluate the output of another AI. Yep, it’s AI judging AI.
Databricks calls their approach Judge Builder—part of their Agent Bricks toolkit—and it’s not just another tool. It’s a full framework that helps businesses build these judges to evaluate their own AI systems accurately.
The tricky bit? Teaching a judge what “good” looks like isn’t simple. And it’s not just about writing code—it’s about getting teams to align on what quality even means.
Photo by Brett Jordan on Unsplash
The Real Bottleneck: Asking the Right Human Questions
“We’re not blocked by how smart the models are,” Jonathan Frankle, Chief AI Scientist at Databricks, said. “The real challenge is getting models to behave the way we want—and knowing when they actually did.”
It turns out, defining quality is messy. What one team member sees as a great customer service response might sound “too blunt” to another. Even subject-matter experts often disagree.
Databricks has learned this firsthand while helping customers build good judges. Through deployments and direct user feedback, they’ve discovered three major people-focused challenges:
- Getting alignment on what quality really means
- Extracting expert knowledge from domain specialists
- Scaling evaluations reliably across their systems
To tackle this, they now guide teams through collaborative workshops where everyone gets on the same page before building anything. It’s structured, practical, and surprisingly quick to get results.
The Ouroboros Problem: Can AI Judge AI?
Here comes the brain twister. Since we’re using AI to judge AI, how do we know that the judge is good?
Pallavi Koppol, the Databricks research scientist behind Judge Builder, calls it the “Ouroboros problem”—like the snake eating its own tail. You build an AI system to evaluate your AI system, which raises a new question: who’s evaluating the evaluator?
Their approach to solving this is clever yet grounded. Judges are trained to mirror how real human experts would rate outputs. The closer the AI judge’s score is to the human expert’s judgment, the better it is.
So it’s not just passing a simple checklist—it’s prioritizing real-world nuance based on actual expertise.
Photo by Jagjeet Singh on Unsplash
Three Lessons from the Field
Over time, Databricks has uncovered some hard truths from their work with enterprise customers. If you’re building AI judges (or thinking about it), these might sound familiar:
1. Your Experts Don’t Always Agree
Even within the same company, experts interpret quality differently. One might rate a piece of text a ‘1’, another a ‘5’, and another ‘neutral’. The fix? Small-scale group annotations and agreement checks help surface disagreements early—and clean them up before they pollute training data.
They found that with this method, teams could get inter-rater alignment scores of 0.6 (compared to 0.3 from outside services). That kind of consistency makes a massive difference in building better AI judges.
2. Make Things Specific, Not Vague
Don’t try to cram everything into one judge. Instead of asking if something is “relevant, factual, and concise” all at once, make three judges—one for each. This makes it way easier to pinpoint what went wrong when something fails.
One fascinating example? A customer noticed correct answers often cited the top two search results. Boom—new judge created. Simple, data-informed, and easy to run in production.
3. You Don’t Need Tons of Data
Turns out, you only need about 20–30 carefully chosen edge cases to train a good judge. The trick is choosing those tricky examples that actually reveal disagreements.
Some teams got started in under three hours. Not bad for something this foundational.
So What’s the Payoff?
Here’s where things get interesting. Judge Builder isn’t just an internal evaluation tool—it ends up driving real business impact.
Databricks tracks success in three ways:
- Do customers keep using it?
- Do they spend more on AI?
- Are they trying more advanced techniques?
In short: yes, yes, and yes.
One company built a dozen judges after using the workshop once. Others went from hesitant experimentation to full-on seven-figure AI deployments. Some even moved from prompt engineering into reinforcement learning—because now they had solid data to measure improvement.
“If you don’t know whether something got better, why would you invest in making it better?” Frankle said. Simple point. Huge implications.
What Should Enterprises Do Now?
If you’re serious about AI, you probably need to start thinking seriously about your judges. Databricks has a few solid tips:
- Start small but smart: Focus on one regulatory need and one known failure case.
- Work with your experts: Just a few hours of thoughtful annotation can get a judge up and running.
- Keep evolving: As your system changes, your judges should too. Set up regular reviews using production data.
At its heart, a judge isn’t just a scorekeeper. It’s a living reflection of what your team values—and a tool you can use to actually improve your AI systems.
Like Frankle put it, “Once you have a judge that represents your human taste in an empirical form, you can use it in 10,000 different ways.”
And that’s where this stops being just a tech story. It’s a story about how we, as people, define quality—and how we teach that to our machines.
Photo by Sam Szuchan on Unsplash