Building a Kubernetes platform with zero customer-impacting incidents.

How a cross-cutting team took HMH from Mesos and Aurora to a multi-cluster Kubernetes platform that most of the company now runs on, with zero customer-impacting incidents since 2021.

Role

Principal DevOps Engineer

Organization

HMH Education

Timeframe

2020 → present

Stack

EKS · Vault · Consul · Ambassador · Terraform

Zero

customer-impacting incidents since 2021

~1 yr

first patterns to general availability

400+

engineers building on the platform

№ 01 · The Challenge

A platform built for products that can't go down.

HMH serves tens of millions of students and teachers, and the busiest week of the year is non-negotiable: Back to School arrives on the calendar's terms. The infrastructure underneath had grown up on Mesos and Aurora. It was stable, and we kept it that way, but doing so was real work, and by late 2020 Kubernetes was part of every serious conversation about the platform's future.

The brief was twofold and slightly contradictory. Give 400+ engineers a faster, simpler path to production, and make the whole thing more reliable at the same time. More speed usually buys you more incidents. We needed both speed and quiet.

№ 02 · The Approach

Make the safe path the easy path.

The core decision was to treat the platform as a product. A cross-cutting team, the SkyPilots, carried it from first patterns to general availability in about a year, with a single principle: a team should be able to ship without becoming a Kubernetes expert.

Everything dangerous lives behind one hardened entry point. Vault handles secrets, with the credential and access patterns teams need already laid down. Consul carries configuration and the service mesh. Ambassador fronts all ingress behind an enterprise web application firewall, and autoscaling is baked in from day one. None of it is something a product team has to assemble: they get a paved road, and the cliffs have guardrails.

The interface to all of it is a template. You copy it, fill in what's yours, and deploy:

# deploy/values.yaml · copied from the hello-world example chart name: "my-namespace" applicationName: "my-service" dockerRegistry: "registry.internal/my-team" vault: { authentication: "KUBERNETES" } # dynamic, least-privilege consul: { host: consul.internal, port: 443 } ports: { container: 8080, management: 8081 } ingressMappings: - ambassador_id: internal # per-environment overrides: # values-dev → values-int → values-cert → values-prod

Reliability here is the absence of the failure modes we designed out. Every platform change climbs a promotion ladder from development through integration to certification before it touches production. A synthetic service exercises Vault, the mesh, ingress, and AWS credentials in one request, so a regression announces itself in seconds. Migrations run with zero downtime as the bar, and the templates encode the lessons so nobody re-learns them the hard way.

The platform's best feature is how rarely anyone has to think about it.the whole case, in one line

№ 03 · The Impact

Quiet, at scale.

The platform now carries the majority of HMH's workloads, and the headline number is the one that doesn't move: zero customer-impacting incidents since 2021. Time to first deploy fell from days to hours. Teams that used to file tickets now self-serve.

Zero customer-impacting incidents since 2021, across every Back to School
Time to first deploy down from days to hours for teams that moved over
Most of HMH now runs on the platform, onboarded through templates instead of tickets
400+ engineers ship on it without becoming Kubernetes experts

№ 04 · Lessons Learned

What I'd tell the version of me who started it.

Ship thin slices, fast. A small team had outsized impact because we delivered standardized patterns in small pieces and iterated.

The paved road only works if it's genuinely easier. If the safe path is slower than the shortcut, people take the shortcut. Most of the engineering went into making the right thing the path of least resistance.

Guardrails are layers, and any single one of them can fail. The discipline is designing so the next layer catches what the last one missed. The same instinct keeps this site's chatbot safe.

Previous · № 05

Running an Ops Platform as Code

Next · № 02

Architecting ML Infrastructure for AI-Powered Education

Want this kind of quiet?

I build platforms teams rely on to ship. If you need infrastructure at enterprise scale, let's talk.

cole@coursecode.net Back to the homepage

linkedin.com/in/dabigcgithub.com/dabigcMore case studies →