Case Study

Running an Ops Platform as Code

How hundreds of services across dozens of teams stay owned, monitored, and on-call ready by treating the whole platform as code, so every change to it lands as a reviewed pull request.

The Challenge

HMH runs hundreds of services owned by dozens of teams. At that size, the records that keep operations healthy do not stay healthy on their own. Ownership entries, monitoring coverage, and on-call readiness all drift the moment someone reorganizes a team, ships a new service, or tweaks a schedule.

The only thing I have found that holds up over time is to manage this the way we manage software. Ownership, alerting, and staffing are declared in code and validated in CI. The platform provisions and watches them from a single source of truth. If a service exists, its ownership, its monitors, and its on-call posture are all derived from that source instead of kept by hand and hoped to be correct.

The Approach

The platform I built and run at HMH stands on three legs. Each one turns a thing that used to drift into a thing you can review in a pull request.

Service catalog as code. Hundreds of service definitions live as YAML, one file per service in per-team workspaces. Terraform fans each definition out into the Datadog Software Catalog entry, the PagerDuty service, the on-call schedule integration, and the alert-routing handle. One source of truth keeps Datadog, PagerDuty, and the tracker in sync, so a single reviewed change updates everywhere a service is registered. My most recent headline here was making escalation-policy overrides self-service: I moved them out of infrastructure variables and into validated YAML blocks, so teams change their own on-call behavior through a reviewed PR instead of a ticket queue. I migrated the whole estate over without breaking anyone's on-call.

Staffing health, measured hourly. A pipeline pulls team membership, service ownership, escalation policies, and rendered schedule coverage, then classifies every service on two independent questions: does a real team staff it, and is its on-call actually ready, meaning real humans, no coverage gaps, and around-the-clock first level. The results publish to a filterable status page and push into Datadog scorecards, where ops teams already live. This one started life as a Claude skill I invoked by hand. Once it proved its value, I productionized it into a standalone pipeline on an hourly schedule, which is what made the scorecards possible. The hardest part was honest signals: I merge escalation coverage across overlapping schedules with interval math, so a gap in one schedule slice stops flagging services that are actually covered.

Monitors as code. Golden-signal monitors, SLOs, and error-budget alerts for hundreds of services across four infrastructure types are declared as Terraform and immutable outside it: UI edits get reverted. Thresholds are recommended from months of historical metrics rather than guesses. The repo ships its own Claude Code skills as well: one analyzes history and proposes threshold changes as PRs with supporting data, one onboards a new service by discovering its signals and generating the config, and one audits every service quarterly.

The Impact

Hundreds

of services cataloged, monitored, and health-scored

Hourly

staffing and on-call health, where ops teams already look

Self-service

on-call changes by reviewed PR instead of ticket queues

Data-driven

thresholds derived from months of history

When ownership, monitoring, and staffing health live in code, drift becomes a diff. Teams see their own posture in the tools they already use, and the platform team reviews changes instead of making them all.

Lessons Learned

A single source of truth beats synchronization scripts. Provisioning from the catalog, rather than reconciling against it, means there is nothing to drift back into disagreement. The catalog declares reality, and every system downstream is derived from it.

Prototype agentically, productionize deliberately. The staffing-health work earned its pipeline: it ran as a Claude skill I invoked by hand until it had clearly proven its value, and only then did I invest in the standalone hourly version that the scorecards depend on. The prototype told me what was worth building, and building it properly came second.

Trustworthy alerts come from governance plus statistics. Immutable monitors and historically derived thresholds beat hand-tuned ones nobody dares touch. When the numbers come from months of real data and the config cannot be quietly edited away, people act on the alerts.

Cole Conrad

Cole Conrad

Principal Platform Engineer

I build platforms teams rely on to ship. If this work maps to a problem you are trying to solve, I would enjoy the conversation. The chat in the corner can also go deeper on anything in this study.