Terminal-Bench 2.0 Launches with Harbor: A Smarter, Scalable Way to Test AI Agents Inside Containers

Close up of a steering wheel, symbolizing AI agents

Photo by Olivie Zemanova on Unsplash

A new benchmark and framework set out to make testing AI agents more reliable, reproducible, and ready for the real world

If you’ve ever tried evaluating autonomous AI agents running in terminal environments, you know it can feel like chasing smoke. Some tasks break overnight. Others depend on flaky third-party services. And at scale? It’s a mess.

That’s the problem the team behind Terminal-Bench set out to fix with the release of Terminal-Bench 2.0—and they didn’t stop there. Alongside it, they introduced something new: Harbor, a container-based runtime built to make running and improving agents at cloud scale way more practical.

Let’s break it down.


What Is Terminal-Bench?

Terminal-Bench is a benchmark suite built specifically to test how well autonomous AI agents perform real-world developer-style tasks via the command line.

Think of it as a playground—but also a proving ground—for agents that interact with systems like a human developer might. No fluff, just hard challenges and performance tracking.

Gardens by the Bay and Singapore skyline, symbolizing cloud containers

Photo by Sergei Gussev on Unsplash

Version 1.0 came out in May 2025 and found a surprising amount of adoption fast. It became the go-to standard for testing command-line agents.

But it had its share of flaws.

Some tasks were too loosely defined, or broke because they depended on external services that kept changing. That made it hard to trust the results.


What’s New in Version 2.0?

The new 2.0 release tackles all that head-on:

  • 89 tasks, all carefully validated using both humans and LLMs
  • Clearer task definitions, tested for solvability and realism
  • Removed or redesigned unstable tasks like the old download-youtube one
  • Higher difficulty, but cleaner and more consistent data

Co-creator Alex Shaw summed it up well: Even though version 2.0 is intentionally harder, agents are performing about the same as before. That’s a sign the new tasks are just better designed.

The benchmark is already being used in active research around agent reasoning, code generation, and tool use. A full preprint on the design and validation process is currently in the works.


Meet Harbor: The Backend They Wish They Had

When you’re testing AI agents at scale, you need tight infrastructure. Enter Harbor.

Harbor is a new runtime framework launched alongside Terminal-Bench 2.0. It’s designed to do the dirty work of deploying, managing, and evaluating agents at scale across cloud containers.

You can use it to:

  • Run any container-installable agent
  • Scale supervised fine-tuning or reinforcement learning pipelines
  • Create or customize your own benchmarks
  • Fully integrate with Terminal-Bench 2.0

Internally, Harbor was used to run tens of thousands of agent rollouts during the development of the new benchmark. Now it’s available for everyone at harborframework.com, complete with docs, CLI tools, and a public leaderboard.

If you want to submit your own agent, just install Harbor and run:

harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>" --n-attempts 5 --jobs-dir <path/to/output>

Package your results, send them in… and let the leaderboard show you where you stand.


Who’s Performing Best So Far?

Early leaderboard results are already up, and GPT-5-powered agents are making a strong showing:

Top 5 Agents on Terminal-Bench 2.0 (Success Rate):

  1. Codex CLI (GPT-5) — 49.6%
  2. Codex CLI (GPT-5-Codex) — 44.3%
  3. OpenHands (GPT-5) — 43.8%
  4. Terminus 2 (GPT-5-Codex) — 43.4%
  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

Nobody’s cracked the halfway mark yet. That tells us the benchmark is doing its job—it’s hard, but fair.

And it shows how close the competition is at the edge of agent performance. Different models, different flavors of GPT-5, and even Claude-based agents all cluster near the top.


Why It Matters

As more teams build autonomous agents designed to run in your dev tools or back-end systems, the need for proper evaluation has only grown.

You can’t rely on quick-and-dirty GitHub scripts anymore. You need repeatable, scalable tests. You need real benchmarking backed by a clean, containerized runtime. That’s what Terminal-Bench 2.0 and Harbor aim to provide.

They don’t promise miracles. But they do offer something more important: infrastructure that works, so you can focus on building better agents.

And maybe, just maybe, finally know when they’re actually getting better.

Want to dig in for yourself?
Terminal-Bench 2.0: tbench.ai
Harbor framework: harborframework.com

Keywords: Terminal-Bench 2.0, Harbor, AI agents, container-based runtime, cloud scale, autonomous agents, benchmarking


Read more of our stuff here!

Leave a Comment

Your email address will not be published. Required fields are marked *