AI SRE Landscape

Companies building AI SRE agents, organized by approach.

Agentic AI SRE platforms

Standalone AI agents built specifically to investigate and act on production issues across multiple tools and systems.

Cleric - Investigates daily alerts and production incidents across observability, code, infrastructure, and communication tools. Verifies its own diagnoses against post-resolution outcomes to self-improve. Founded by reinforcement learning and platform engineers.
Deductive - Investigates production incidents using code-aware reasoning across codebases, telemetry, and internal documentation. Multi-agent architecture for hypothesis generation and testing.
Resolve AI - General-purpose operational AI platform spanning incident investigation, remediation, finops, and architecture recommendations. Founded by former Splunk executives.
Traversal - Applies causal machine learning to model how failures propagate through distributed systems. Designed for complex cascading incidents where the root cause is multiple hops upstream from the symptom. Founded by academics from Columbia and Cornell.

Incident management platforms with AI

Started as incident coordination tools and added AI investigation on top.

FireHydrant - AI for incident summarization, meeting transcription, and retrospective generation. Acquired by Freshworks in December 2025.
incident.io - Slack-native incident management with an AI agent that searches code changes, logs, metrics, traces, and historical incidents to surface root causes.
PagerDuty - Added AI agents for triage, diagnostics, and remediation to its alerting and on-call platform. AI SRE Agent acts as a virtual responder.
Rootly - Slack/Teams-native incident management with AI investigation, root cause suggestions, and automated retrospective generation.

Observability platforms with AI

Large observability vendors adding AI investigation with direct access to the telemetry they already collect.

Datadog (Bits AI SRE) - Autonomous agent embedded in Datadog that investigates alerts across metrics, logs, traces, RUM, and profiling. Powerful analysis of telemetry Datadog already collects, but limited to that dataset - investigation context that lives outside Datadog (Slack conversations, deployment tools, code changes) is out of scope. Supports remediation via Datadog Workflows.
New Relic (SRE Agent) - AI agent within New Relic’s observability platform that diagnoses incidents using topology-aware root cause analysis. Searches service dependency graphs and probabilistic ranking to narrow down root causes. Integrates with Slack and Zoom for triage.

General-purpose agents (DIY)

General-purpose AI agents (Claude Code, Cursor, or custom builds) connected to observability and infrastructure APIs. Often the fastest way to get something running - a working prototype can come together in days, and for well-documented systems with contained architectures, these agents deliver real value on investigation tasks. Best suited for environments where the problem space is well-scoped. Without broader system understanding or a way to verify results, accuracy tends to plateau for most teams regardless of the underlying model.

For background on the category, see What is AI SRE?