Running an Ops Platform as Code

How hundreds of services across dozens of teams stay owned, monitored, and on-call ready by treating the whole platform as code, so every change to it lands as a reviewed pull request.

Role

Principal DevOps Engineer

Organization

HMH Education

Timeframe

Ongoing

Stack

Datadog · PagerDuty · Terraform · Python

Hundreds

of services cataloged and health-scored

2% to 83%

ownership coverage in two weeks

Self-service

on-call changes by reviewed PR

№ 01 · The Challenge

Nothing stays healthy on its own.

HMH runs hundreds of services owned by dozens of teams. At that size, the records that keep operations healthy do not stay healthy on their own. They drift the moment someone reorganizes a team, ships a new service, or tweaks a schedule. The cost was measurable. At one point roughly 40% of our Datadog team records were unstaffed or misstaffed, and when a service with no clear owner broke, getting the right people onto the incident call could take an hour or two.

The only thing I have found that holds up over time is to manage this the way we manage software. Ownership, alerting, and staffing are declared in code and validated in CI. The platform provisions and watches them from a single source of truth. If a service exists, its monitors and its on-call posture derive from that source instead of being kept by hand and hoped correct.

№ 02 · The Approach

The platform stands on three legs.

The platform I built and run at HMH stands on three legs. Each one turns a thing that used to drift into a thing you can review in a pull request.

Service catalog as code. Hundreds of service definitions live as YAML, one file per service in per-team workspaces. Terraform fans each definition out into the Datadog Software Catalog entry, the PagerDuty service, the on-call schedule integration, and the alert-routing handle. One source of truth keeps Datadog and PagerDuty in sync, so a single reviewed change updates everywhere a service is registered. My most recent headline here was making escalation-policy overrides self-serviceable: I moved them out of infrastructure variables and into validated YAML blocks, so teams change their own on-call behavior through a reviewed PR instead of a ticket queue. I migrated the whole estate over without breaking anyone's on-call.

Staffing health, measured hourly. A Python pipeline pulls team membership, service ownership, escalation policies, and rendered schedule coverage, then classifies every service on two independent questions: does a real team staff it, and is its on-call actually ready. Ready means real humans on the schedule, no coverage gaps, around the clock. The results publish to a filterable status page and push into Datadog scorecards, where ops teams already live. Publishing the numbers worked. The share of services with verified ownership, a staffed team plus a live on-call, went from 2% to 83% in two weeks once teams could see where they stood. This one started life as a Claude skill I invoked by hand. Once it proved its value, I productionized it into a GitHub Actions pipeline that runs on an hourly schedule, which is what made the scorecards possible. The hardest part was honest signals. I merge escalation coverage across overlapping schedules with interval math, so a gap in one schedule slice stops flagging services that are actually covered.

Monitors as code. Golden-signal monitors, SLOs, and error-budget alerts for hundreds of services across four infrastructure types are declared as Terraform and immutable outside it: UI edits get reverted. Latency thresholds sit just above each service's observed P99, derived from months of its own history. Low-traffic services taught us to keep the statistics honest. A service taking a few requests per second can show a 4% error spike in a five-minute window from a handful of bad requests, so for those we widen the evaluation window and leave the threshold alone. The tooling also refuses to touch a service until a Datadog team and a PagerDuty on-call exist for it. A human who will actually get paged vouches for every change.

The repo ships its own Claude Code skills as well. The threshold-analysis skill began as a teammate's proof of concept, and productionizing it was the real work: a batch run once silently lost 40% of its data to API rate limiting, and the fix layered in rate-aware retries, coverage tracking, and honest loss disclosure. Another skill onboards a new service by discovering its signals and generating the config. A third audits every service quarterly.

№ 03 · The Impact

Drift becomes a diff.

Hundreds

of services cataloged, monitored, and health-scored

2% to 83%

service ownership coverage in two weeks

Hourly

staffing and on-call health, where ops teams already look

Self-service

on-call changes by reviewed PR

When ownership, monitoring, and staffing health live in code, drift becomes a diff. Teams see their own posture in the tools they already use, and the platform team reviews changes instead of making them all.

№ 04 · Lessons Learned

Prototype agentically, productionize deliberately.

A single source of truth beats synchronization scripts. Provisioning from the catalog, rather than reconciling against it, means there is nothing to drift back into disagreement. The catalog declares reality, and every system downstream is derived from it.

Prototype agentically, productionize deliberately. The staffing-health skill had to prove its value by hand before it earned the standalone hourly pipeline the scorecards depend on. The prototype told me what was worth building properly.

I would have loved an internal developer portal as the single pane of glass, and we could not stand one up quickly, so the status page is Confluence while the portal stays on the roadmap. Ship what works today, build what is right long-term.

Trustworthy alerts come from governance plus statistics. Immutable monitors and historically derived thresholds beat hand-tuned ones nobody dares touch. When the numbers come from months of real data and the config cannot be quietly edited away, people act on the alerts.

Previous · № 04

Codifying Agentic Engineering for a Platform Team

Next · № 01

Building a Kubernetes Platform with Zero Customer-Impacting Incidents

Want drift to become a diff?

I build platforms teams rely on to ship. If you need infrastructure at enterprise scale, let's talk.

cole@coursecode.net Back to the homepage

linkedin.com/in/dabigcgithub.com/dabigcMore case studies →