Building Smarter AI Isn’t Just Code—It’s About People, Too

Photo by Dhaya Eddine Bentaleb on Unsplash

When we think about why some enterprise AI projects never make it to production, it’s easy to point fingers at the tech: not accurate enough, too unpredictable, doesn’t “understand” what we want. But according to a recent deep dive by Databricks, the real problem isn’t the intelligence of the model. It’s figuring out how we judge whether that model is doing a good job in the first place.

And surprise… it’s not just a technical problem. It’s a people problem.

Let me explain.

What Exactly Is an AI Judge?

Picture this: You’re testing your AI system and need to know if it’s giving you good answers. Enter the “AI judge”—an AI system whose job is to evaluate the output of another AI. Yep, it’s AI judging AI.

Databricks calls their approach Judge Builder—part of their Agent Bricks toolkit—and it’s not just another tool. It’s a full framework that helps businesses build these judges to evaluate their own AI systems accurately.

The tricky bit? Teaching a judge what “good” looks like isn’t simple. And it’s not just about writing code—it’s about getting teams to align on what quality even means.

Photo by Brett Jordan on Unsplash

The Real Bottleneck: Asking the Right Human Questions

“We’re not blocked by how smart the models are,” Jonathan Frankle, Chief AI Scientist at Databricks, said. “The real challenge is getting models to behave the way we want—and knowing when they actually did.”

It turns out, defining quality is messy. What one team member sees as a great customer service response might sound “too blunt” to another. Even subject-matter experts often disagree.

Databricks has learned this firsthand while helping customers build good judges. Through deployments and direct user feedback, they’ve discovered three major people-focused challenges:

Getting alignment on what quality really means
Extracting expert knowledge from domain specialists
Scaling evaluations reliably across their systems

To tackle this, they now guide teams through collaborative workshops where everyone gets on the same page before building anything. It’s structured, practical, and surprisingly quick to get results.

The Ouroboros Problem: Can AI Judge AI?

Here comes the brain twister. Since we’re using AI to judge AI, how do we know that the judge is good?

Pallavi Koppol, the Databricks research scientist behind Judge Builder, calls it the “Ouroboros problem”—like the snake eating its own tail. You build an AI system to evaluate your AI system, which raises a new question: who’s evaluating the evaluator?

Their approach to solving this is clever yet grounded. Judges are trained to mirror how real human experts would rate outputs. The closer the AI judge’s score is to the human expert’s judgment, the better it is.

So it’s not just passing a simple checklist—it’s prioritizing real-world nuance based on actual expertise.

Photo by Jagjeet Singh on Unsplash

Three Lessons from the Field

Over time, Databricks has uncovered some hard truths from their work with enterprise customers. If you’re building AI judges (or thinking about it), these might sound familiar:

1. Your Experts Don’t Always Agree

Even within the same company, experts interpret quality differently. One might rate a piece of text a ‘1’, another a ‘5’, and another ‘neutral’. The fix? Small-scale group annotations and agreement checks help surface disagreements early—and clean them up before they pollute training data.

They found that with this method, teams could get inter-rater alignment scores of 0.6 (compared to 0.3 from outside services). That kind of consistency makes a massive difference in building better AI judges.

2. Make Things Specific, Not Vague

Don’t try to cram everything into one judge. Instead of asking if something is “relevant, factual, and concise” all at once, make three judges—one for each. This makes it way easier to pinpoint what went wrong when something fails.

One fascinating example? A customer noticed correct answers often cited the top two search results. Boom—new judge created. Simple, data-informed, and easy to run in production.

3. You Don’t Need Tons of Data

Turns out, you only need about 20–30 carefully chosen edge cases to train a good judge. The trick is choosing those tricky examples that actually reveal disagreements.

Some teams got started in under three hours. Not bad for something this foundational.

So What’s the Payoff?

Here’s where things get interesting. Judge Builder isn’t just an internal evaluation tool—it ends up driving real business impact.

Databricks tracks success in three ways:

Do customers keep using it?
Do they spend more on AI?
Are they trying more advanced techniques?

In short: yes, yes, and yes.

One company built a dozen judges after using the workshop once. Others went from hesitant experimentation to full-on seven-figure AI deployments. Some even moved from prompt engineering into reinforcement learning—because now they had solid data to measure improvement.

“If you don’t know whether something got better, why would you invest in making it better?” Frankle said. Simple point. Huge implications.

What Should Enterprises Do Now?

If you’re serious about AI, you probably need to start thinking seriously about your judges. Databricks has a few solid tips:

Start small but smart: Focus on one regulatory need and one known failure case.
Work with your experts: Just a few hours of thoughtful annotation can get a judge up and running.
Keep evolving: As your system changes, your judges should too. Set up regular reviews using production data.

At its heart, a judge isn’t just a scorekeeper. It’s a living reflection of what your team values—and a tool you can use to actually improve your AI systems.

Like Frankle put it, “Once you have a judge that represents your human taste in an empirical form, you can use it in 10,000 different ways.”

And that’s where this stops being just a tech story. It’s a story about how we, as people, define quality—and how we teach that to our machines.

Photo by Sam Szuchan on Unsplash

Keywords: AI evaluation, Databricks Judge Builder, enterprise AI, AI judges, human-in-the-loop, judge framework, Agent Bricks, reinforcement learning, MLflow, prompt optimization

Building Smarter AI Isn’t Just Code—It’s About People, Too

What Exactly Is an AI Judge?

The Real Bottleneck: Asking the Right Human Questions

The Ouroboros Problem: Can AI Judge AI?

Three Lessons from the Field

1. Your Experts Don’t Always Agree

2. Make Things Specific, Not Vague

3. You Don’t Need Tons of Data

So What’s the Payoff?

What Should Enterprises Do Now?

Leave a Comment Cancel Reply

Sign up for Newsletter

What Exactly Is an AI Judge?

The Real Bottleneck: Asking the Right Human Questions

The Ouroboros Problem: Can AI Judge AI?

Three Lessons from the Field

1. Your Experts Don’t Always Agree

2. Make Things Specific, Not Vague

3. You Don’t Need Tons of Data

So What’s the Payoff?

What Should Enterprises Do Now?

Must Read

Leave a Comment Cancel Reply