AI SRE

AI SREArticles, talks, and writing about AI agents in production.https://aisre.com/en-usFixing Claude with Claude: Anthropic reports on AI SREhttps://www.theregister.com/2026/03/19/anthropic_claude_sre/https://www.theregister.com/2026/03/19/anthropic_claude_sre/When Claude produces a postmortem report, it delivers 'an 80 percent story that's pretty, readable and convincing.' Anthropic's experience using Claude for SRE internally.Thu, 19 Mar 2026 00:00:00 GMTYour Data is Made Powerful By Contexthttps://charity.wtf/2026/03/09/your-data-is-made-powerful-by-context-so-stop-destroying-it-already-xpost/https://charity.wtf/2026/03/09/your-data-is-made-powerful-by-context-so-stop-destroying-it-already-xpost/Agentic workflows will make automated validation techniques easier and more widely used. Context is the key to making data useful.Mon, 09 Mar 2026 00:00:00 GMTWe Automated Everything Except Knowing What's Going Onhttps://eversole.dev/blog/we-automated-everything/https://eversole.dev/blog/we-automated-everything/AI collapsed the cost of building software but not the cost of understanding it. When AI agents outnumber engineers 50-to-1, the gap between deployment speed and comprehension becomes dangerous.Mon, 02 Mar 2026 00:00:00 GMTEveryone is a junior engineer in the age of AIhttps://thenewstack.io/hightower-ai-open-source-kubecon/https://thenewstack.io/hightower-ai-open-source-kubecon/Kelsey Hightower on AI, open source sustainability, and career resilience for engineers. KubeCon Europe 2026 keynote coverage.Sun, 01 Mar 2026 00:00:00 GMTWhy Your On-Call Engineer Is Your Most Expensive Bottleneckhttps://medium.com/@pranavkumarshil/why-your-on-call-engineer-is-your-most-expensive-bottleneck-4bb00dff32ebhttps://medium.com/@pranavkumarshil/why-your-on-call-engineer-is-your-most-expensive-bottleneck-4bb00dff32ebAI adoption correlates positively with throughput yet negatively with stability. Organizations are accelerating into more failures, not fewer.Sun, 01 Mar 2026 00:00:00 GMTPractical Considerations for AI Incident Reviewshttps://fgj.codes/posts/ai-incident-reviews/https://fgj.codes/posts/ai-incident-reviews/LLM-generated incident reviews often fail due to poor input data and misunderstanding of why reviews matter. Incident reviews are fundamentally a socio-technical process, and AI should enhance analyst capacity rather than replace human engagement.Sun, 01 Mar 2026 00:00:00 GMTThe Picture They Paint of Youhttps://ferd.ca/the-picture-they-paint-of-you.htmlhttps://ferd.ca/the-picture-they-paint-of-you.htmlAI coding assistants are framed as augmenting engineers. AI SRE tools are framed as replacing them. The disparity reveals how organizations actually value reliability work.Mon, 23 Feb 2026 00:00:00 GMTBuilding An Elite AI Engineering Culture In 2026https://cjroth.com/blog/2026-02-18-building-an-elite-engineering-culturehttps://cjroth.com/blog/2026-02-18-building-an-elite-engineering-cultureAI-augmented teams merged 98% more PRs but saw 91% longer review times. Senior engineers get 5x the productivity gains of juniors.Wed, 18 Feb 2026 00:00:00 GMTLots of AI SRE, no AI incident managementhttps://surfingcomplexity.blog/2026/02/14/lots-of-ai-sre-no-ai-incident-management/https://surfingcomplexity.blog/2026/02/14/lots-of-ai-sre-no-ai-incident-management/AI SRE tools excel at diagnosis and mitigation but lack coordination capabilities. Individual AI agents suffer from fixation bias and can't maintain the common ground that human teams build during incidents.Sat, 14 Feb 2026 00:00:00 GMTAre bugs and incidents inevitable with AI coding agents?https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/Analysis of 470 codebases shows AI-generated code produces 1.7x more bugs than human-written code, with 75% more logic errors and 8x more excessive I/O operations.Wed, 28 Jan 2026 00:00:00 GMTBring Back Ops Pridehttps://charity.wtf/2026/01/19/bring-back-ops-pride-xpost/https://charity.wtf/2026/01/19/bring-back-ops-pride-xpost/Operations teams got renamed to DevOps, SRE, infrastructure, production engineering, platform engineering. The identity crisis of the people who run production.Mon, 19 Jan 2026 00:00:00 GMTSoftware engineering when the machine writes the codehttps://www.shayon.dev/post/2026/19/software-engineering-when-the-machine-writes-code/https://www.shayon.dev/post/2026/19/software-engineering-when-the-machine-writes-code/When production breaks at 2 AM, developers are reverse-engineering code they didn't write. AI coding erodes the mental models engineers need to debug complex systems under pressure.Mon, 19 Jan 2026 00:00:00 GMTHow we built an AI SRE agent that investigates like a team of engineershttps://www.datadoghq.com/blog/building-bits-ai-sre/https://www.datadoghq.com/blog/building-bits-ai-sre/Datadog's Bits AI SRE uses hypothesis-driven investigation: forming hypotheses, testing them against telemetry, and recursively investigating multi-component issues. Early versions drowned in information overload; refined design prioritizes causal connections.Mon, 12 Jan 2026 00:00:00 GMTSoftware Acceleration and Desynchronizationhttps://ferd.ca/software-acceleration-and-desynchronization.htmlhttps://ferd.ca/software-acceleration-and-desynchronization.htmlOn the speed mismatch when software ships faster than teams can absorb the consequences.Mon, 05 Jan 2026 00:00:00 GMTYour AI SRE needs better observability, not bigger modelshttps://clickhouse.com/blog/ai-sre-observability-architecturehttps://clickhouse.com/blog/ai-sre-observability-architectureAn AI agent enters a Chain of Thought loop, firing up to 27 queries in a short time period to map dependencies, check outliers, and validate.Thu, 01 Jan 2026 00:00:00 GMTBuilding internal agentshttps://lethain.com/agents-series/https://lethain.com/agents-series/Series on building internal agents. Forward-looking on what changes if AI-enhanced techniques continue to improve.Thu, 01 Jan 2026 00:00:00 GMTHuman-Centred AI for SRE: Multi-Agent Incident Response without Full Automationhttps://www.infoq.com/news/2026/01/opsworker-ai-sre/https://www.infoq.com/news/2026/01/opsworker-ai-sre/A thoughtful, detailed methodology for teams looking to integrate AI agents into their incident workflows while keeping humans in the loop.Thu, 01 Jan 2026 00:00:00 GMTTribal Knowledge Kills On-Callhttps://medium.com/@a_pomorska/tribal-knowledge-kills-on-call-574863bf3eafhttps://medium.com/@a_pomorska/tribal-knowledge-kills-on-call-574863bf3eafTribal knowledge is a single point of failure because it centralizes critical context in humans. Humans are not reliable infrastructure.Thu, 01 Jan 2026 00:00:00 GMTEnd-of-Year Observability Retrospective with Charity Majorshttps://horovits.medium.com/end-of-year-observability-retrospective-with-charity-majors-94f80fff77e8https://horovits.medium.com/end-of-year-observability-retrospective-with-charity-majors-94f80fff77e8Observability for AI Workloads: lessons from 2025 and insights for building observable AI systems in production.Mon, 01 Dec 2025 00:00:00 GMTFacilitating AI adoption at Imprinthttps://lethain.com/company-ai-adoption/https://lethain.com/company-ai-adoption/Real practitioner experience with LLM-tooling and agent adoption. The formula is deep partnership, not 'build a platform and they will come.'Mon, 01 Dec 2025 00:00:00 GMTAI and the Ironies of Automationhttps://ufried.com/blog/ironies_of_ai_1/https://ufried.com/blog/ironies_of_ai_1/Applies Bainbridge's 1983 automation ironies to modern AI. When AI handles incident response, operators lose the skills needed to intervene when AI fails. Future engineers who never built manual expertise cannot oversee AI systems.Fri, 21 Nov 2025 00:00:00 GMTNotes from the 2025 'AI Agents in Production' Conferencehttps://markptorres.com/ai_workflows/2025-11-18-ai-agents-in-production-conference-noteshttps://markptorres.com/ai_workflows/2025-11-18-ai-agents-in-production-conference-notesPractitioner notes from MLOps Community conference. Error recovery, context engineering, metrics that predict trust and task completion.Tue, 18 Nov 2025 00:00:00 GMTFrom 4 Hours to 8 Minutes with AI Agents That Transform SREhttps://www.usenix.org/conference/srecon25emea/presentation/jausovechttps://www.usenix.org/conference/srecon25emea/presentation/jausovecBuilding three core agents that form a modern reliability engineering backbone. SREcon25 EMEA talk.Wed, 01 Oct 2025 00:00:00 GMTOngoing Tradeoffs, and Incidents as Landmarkshttps://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.htmlhttps://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.htmlIncidents as navigation points for understanding systems. How incidents shape operational knowledge.Sat, 20 Sep 2025 00:00:00 GMTThe Future of AI in SRE: Preventing Failures, Not Fixing Themhttps://thenewstack.io/the-future-of-ai-in-sre-preventing-failures-not-fixing-them/https://thenewstack.io/the-future-of-ai-in-sre-preventing-failures-not-fixing-them/The shift from reactive to preventive AI in site reliability engineering.Sun, 01 Jun 2025 00:00:00 GMTThe naked truth about AI-assisted codinghttps://krasimirtsonev.com/blog/article/the-naked-truth-about-ai-assisted-codinghttps://krasimirtsonev.com/blog/article/the-naked-truth-about-ai-assisted-codingAI tools optimize for speed of code production. But the hard problems in software have never been about producing code fast enough.Sun, 01 Jun 2025 00:00:00 GMTOn-Call Is Ruining My Life and Other Taleshttps://www.youtube.com/watch?v=NWcXm9wnH-Uhttps://www.youtube.com/watch?v=NWcXm9wnH-USREcon25 Americas talk on the reality of on-call life for engineering teams.Sun, 01 Jun 2025 00:00:00 GMTStop Building AI Tools Backwardshttps://hazelweakly.me/blog/stop-building-ai-tools-backwards/https://hazelweakly.me/blog/stop-building-ai-tools-backwards/A critique of how the industry is approaching AI tooling for infrastructure and operations.Sun, 01 Jun 2025 00:00:00 GMTAnother observability 3.0 appears on the horizonhttps://charity.wtf/2025/03/24/another-observability-3-0-appears-on-the-horizon/https://charity.wtf/2025/03/24/another-observability-3-0-appears-on-the-horizon/Response to Matt Klein's observability 3.0 post. Forward-looking on where observability is headed.Mon, 24 Mar 2025 00:00:00 GMTWhat Progress In Learning From Incidents Actually Looks Likehttps://www.adaptivecapacitylabs.com/2025/02/28/what-progress-in-learning-from-incidents-actually-looks-like/https://www.adaptivecapacitylabs.com/2025/02/28/what-progress-in-learning-from-incidents-actually-looks-like/Keynote from the first Learning From Incidents conference. What real progress looks like in organizational learning from failure.Fri, 28 Feb 2025 00:00:00 GMTObservability: the present and future, with Charity Majorshttps://newsletter.pragmaticengineer.com/p/observability-the-present-and-futurehttps://newsletter.pragmaticengineer.com/p/observability-the-present-and-futureDeep interview on the Pragmatic Engineer newsletter about the future of observability.Wed, 01 Jan 2025 00:00:00 GMTLLMs won't save ushttps://blog.relyabilit.ie/llms-wont-save-us/https://blog.relyabilit.ie/llms-wont-save-us/The AI wave is passing over SRE/DevOps tooling. What of genuine value will be left behind? Skeptical, grounded perspective from a Google SRE book co-author.Thu, 12 Dec 2024 00:00:00 GMTLearning from Major Incidents: The Opportunities We're Missinghttps://www.pagerduty.com/blog/incident-management-response/learning-from-major-incidents-the-opportunities-were-missing/https://www.pagerduty.com/blog/incident-management-response/learning-from-major-incidents-the-opportunities-were-missing/Post-incident analysis could be more than a tool for SREs. It could be a way to understand how organizations actually operate.Mon, 22 Jul 2024 00:00:00 GMTGenerative AI is not going to build your engineering team for youhttps://charity.wtf/2024/06/10/generative-ai-is-not-going-to-build-your-engineering-team-for-you/https://charity.wtf/2024/06/10/generative-ai-is-not-going-to-build-your-engineering-team-for-you/AI code generation doesn't solve the production operations problem. It has far more to do with your ability to understand, maintain, and manage software in production over time.Mon, 10 Jun 2024 00:00:00 GMTAlert on symptoms, not causeshttps://varoa.net/2024/03/06/alert-on-symptoms-not-causes.htmlhttps://varoa.net/2024/03/06/alert-on-symptoms-not-causes.htmlPractitioner essay on alerting philosophy. 'I aspire to make operational toil so small that on-call feels like a free bonus.'Wed, 06 Mar 2024 00:00:00 GMT