Fail Small, IaC Control Planes, and Automated RCA

Fail Small, IaC Control Planes, and Automated RCA

Author: Teller's Tech - DevOps, SRE and Cloud Podcast January 3, 2026 Duration: 17:45

This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.

We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.

Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.

Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.

In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.

We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.

Links from this episode

SRE Weekly issue 503 (source roundup - CloudFlare) https://sreweekly.com/sre-weekly-issue-503/

Pulumi: all IaC, including Terraform and HCL https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/

Meta DrP: https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/

GitHub Actions: “Let’s talk about GitHub Actions” https://github.blog/news-insights/product-news/lets-talk-about-github-actions/

Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/

AWS ECR: create repositories on push https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/

DriftHound https://drifthound.io/

Superset https://superset.sh/

More episodes + contact info, and more details on this episode can be found on our website: https://shipitweekly.fm


For anyone building or running modern systems, the sheer volume of news, tools, and incident reports can be overwhelming. Ship It Weekly cuts through that noise. This isn't a surface-level scan of headlines. Host Brian Teller digs into the latest significant outages, major software releases, and insightful post-mortems, focusing squarely on the practical implications for DevOps, SRE, and platform engineering work. Each episode of the podcast breaks down a couple of key stories, providing the crucial context often missing from tech news. You'll hear analysis that translates events into actionable insights, answering the "so what?" for your own infrastructure and processes. The show also includes a quick rundown of tools or updates actually worth your attention, saving you hours of browsing. The tone is direct and informed, favoring depth over breadth. It’s designed for engineers and technical leaders who need a concise, reliable filter for the week's most relevant developments. Listen to this podcast for a focused recap that prioritizes what actually matters, delivered without fluff. You get the news, plus the necessary interpretation to understand how it might affect your systems, your team, and your on-call rotation. It's a weekly briefing that respects your time while aiming to make you more effective.
Author: Language: English Episodes: 37

Ship It Weekly - DevOps, SRE, Platform and Cloud Engineering News
Podcast Episodes
GitHub Runner Pricing Pause, Terraform Cloud Limits, and AI in CI [not-audio_url] [/not-audio_url]

Duration: 12:06
This week on Ship It Weekly, Brian looks at how the “platform tax” is showing up everywhere: pricing model shifts, CI dependencies, and new security boundaries thanks to AI agents.We start with GitHub Actions. GitHub ann…
IBM Buys Confluent, React2Shell, and Netflix on Aurora [not-audio_url] [/not-audio_url]

Duration: 16:14
In this episode of Ship It Weekly, Brian powers through a cold and digs into a very “infra grown-up” week in DevOps.First up, IBM is buying Confluent for $11B. We talk about what that means if you’re on Confluent Cloud t…
Kubernetes Shake-ups, Platform Reality, and AI-Native SRE [not-audio_url] [/not-audio_url]

Duration: 15:53
In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenan…