Chronosphere’s AI Doesn’t Just Spot Outages — It Explains Them, Step by Step

AI Troubleshooting

Photo by Jakub Żerdzicki on Unsplash

For engineers drowning in telemetry and frustrated by hidden bugs, Chronosphere’s latest move might feel like a breath of fresh (and actually useful) air.

The New York-based startup, now valued at $1.6 billion, just rolled out something it calls AI-Guided Troubleshooting — a suite of tools built to help devs figure out not just what’s broken in their systems, but why. And maybe even how to fix it.

I know, we’ve all heard of yet another “AI-powered” feature. But this one takes a different approach.


So, what does Chronosphere actually do?

Chronosphere builds observability tools — stuff that helps companies monitor everything happening under the hood of their cloud applications. Think crash reports, service dependencies, and an ocean of logs.

But while writing code is getting faster thanks to AI (a 13.5% increase in weekly code commits, according to MIT and UPenn), debugging? Not so much. It’s still a time-consuming, manual slog. That’s where Chronosphere is stepping in.

Their new AI-Guided Troubleshooting features aim to speed up failure diagnosis without guessing or going behind engineers’ backs.

Cloud observability tools

Photo by Deivids Vasiljevs on Unsplash

Here’s what it offers:

  • Suggestions – A ranked list of possible root causes, backed by actual data.
  • Temporal Knowledge Graph – A live, time-aware model of the entire system, tracking what’s changed and when.
  • Investigation Notebooks – Step-by-step logs of every troubleshooting move, automatically created.
  • Natural Language Querying – Ask system questions in plain language.

But here’s the key: the AI doesn’t make decisions for you. It shows its work, leaves a trail of evidence, and lets you — the human engineer — decide what to do next.


Why transparency matters now more than ever

A lot of AI observability tools today lean on pattern-matching or showing you a dashboard of anomalies. But they can fall apart during real incidents.

Chronosphere’s method dives deeper. Its Temporal Knowledge Graph doesn’t just draw maps of which services depend on what — it tracks how those relationships change over time. It ties changes (like feature flag updates or deployments) directly to real system behavior.

Here’s an example Mao shared:

Imagine a checkout system starts throwing errors. Chronosphere surfaces a suggestion: the payment service had a memory spike right after a flag update. The engineer clicks in, confirms the sequence, and decides to roll back that flag. All that detective work — the clues, the logic trail, the impact — gets saved in an Investigation Notebook.

No need to jump between five tools and scroll through ancient Slack threads to remember what happened.

Engineers using technology

Photo by Museums Victoria on Unsplash


Isn’t this a crowded space?

It absolutely is. Chronosphere is going up against big names like Datadog, Dynatrace, and Splunk. These companies already offer platforms that claim to give a single-pane view of everything.

But Mao argues those platforms don’t go deep enough — especially for cloud-native systems running custom code. He says most of them rely on standardized telemetry (like Kubernetes metrics), which means they miss unusual but important app-specific signals.

That gap leads to what Mao calls “confident-but-wrong guidance” — essentially, tools that sound helpful but send engineers on a wild goose chase.

Chronosphere’s competitive edge? It’s tailoring its analysis to every system’s specific quirks and showing its work along the way. That’s earning trust among technical teams.


Customers are seeing real impact

Chronosphere says its approach isn’t just about clarity — it’s also helping teams cut observability costs. According to the company, customers are:

  • Reducing observability data volumes by up to 84%
  • Cutting critical incidents by up to 75%
  • Improving Mean Time to Detection by as much as 4x

Some real-world results:

  • Robinhood reports a 5x improvement in reliability.
  • DoorDash used it to standardize monitoring practices.
  • Affirm handled 10x load during Black Friday without issues.
  • Astronomer slashed its costs by more than 85%.

That cost story is especially important right now, with companies throwing mountains of money at observability tools — much of it just to store logs that are never even queried.


It’s not trying to do everything — on purpose

Instead of building a giant all-in-one platform, Chronosphere has taken a more modular approach. Recently, they announced integrations with five specialist vendors:

  • Arize for LLM monitoring
  • Embrace for real user monitoring
  • Polar Signals for continuous profiling
  • Checkly for synthetic monitoring
  • Rootly for incident management

Why not just build it all in-house? Mao says big enterprises want best-in-class tools in each category, not one-size-fits-all.

Right now, customers often handle contracts separately with each vendor. But Chronosphere plans to simplify this with unified agreements in the future, making the ecosystem easier to manage.


The backstory: born from Halloween outages at Uber

Chronosphere’s roots go back to 2019, when founders Martin Mao and Rob Skillington left Uber. They’d built the ride-hailing company’s internal observability tools — tools that had a bad habit of failing on Halloween and New Year’s Eve, Uber’s busiest nights.

Their solution worked so well that, once the industry started moving toward Kubernetes and cloud-native architectures, they realized other companies would face the exact same observability headaches. So they built Chronosphere.

To date, the company has raised over $343 million, and their clients now include tech heavyweights like DoorDash, Snap, Zillow, Robinhood, and Affirm.


What’s available now — and what’s coming

Chronosphere’s AI-Guided Troubleshooting, including Suggestions and Investigation Notebooks, is now in limited availability for select customers. Full rollout is planned for 2026.

One feature you can use immediately if you’re a Chronosphere customer is their Model Context Protocol (MCP) Server, which plugs directly into AI-enabled development environments. That means you can start querying observability data inside the tools you’re already working in.

Bottom line?

Chronosphere isn’t just trying to surface more data. It’s trying to make it understandable. Verifiable. Actionable.

Team collaboration

Photo by Dmitrii E. on Unsplash

In a world where AI is often expected to be a magical fix, this startup is betting on something simpler: show your reasoning, stick to the facts, and let engineers stay in control.

It’s not flashy. But it just might be what teams actually need.


Keywords: Chronosphere, AI in observability, cloud monitoring, software troubleshooting, Datadog competitor, Temporal Knowledge Graph, debugging tools, Kubernetes observability, engineering productivity, incident investigation.


Read more of our stuff here!

Leave a Comment

Your email address will not be published. Required fields are marked *