Case Study

Building a Kubernetes Platform with Zero Customer-Impacting Incidents

How a cross-cutting team took HMH from Mesos and Aurora to a Kubernetes platform that has not had a customer-impacting incident since launch.

The Challenge

When I joined HMH in 2019, our workloads ran on Mesos and Aurora. It was stable, and our team kept it that way, but the industry had clearly moved. Kubernetes was where the ecosystem, the tooling, and the hiring pool were heading, and by late 2020 it was part of every serious conversation about the platform's future.

In 2021 we committed to building it for real. The question was never just whether we could run Kubernetes. It was whether we could build a platform that teams would actually want to move to.

The Approach

I joined the initial team chartered with taking Kubernetes to general availability for the organization: the SkyPilots. From the start the effort was deliberately cross-cutting. Engineers, architects, and the stakeholders who would actually use the platform participated, gave feedback, and dogfooded what we built as we built it. Over the next year we iterated until it was ready for everyone.

What started as Kubernetes grew into a full platform:

  • HashiCorp Vault for secrets, with the credential and access patterns teams needed already laid down
  • Consul KV for configuration management and Consul service mesh for east-west traffic. I laid the foundations here: team-scoped configuration namespaces with access governance, and service-to-service authorization managed as code
  • Ambassador Edge Stack for north-south traffic, with all ingress routed through an enterprise web application firewall and a single API layer
  • Cluster autoscaling and resource management baked in from day one
  • Deployment templates teams copy, fill out, and ship. Getting onto the platform does not require understanding every Kubernetes primitive

The Impact

Zero

customer-impacting platform incidents since launch in 2021

100+

services migrated onto the platform

Majority

of what runs at HMH today runs on these clusters

1 year

from first patterns to general availability

The platform primitives we laid down in 2021 are still what the majority of HMH workloads run on today. Teams deploy through templates instead of tickets, and the platform has stayed out of the headlines in the best possible way: it just works.

Lessons Learned

Platform adoption is a prioritization problem. Teams that are not given room to prioritize migration will choose features every time, which is why Mesos and Aurora are still in play. Finishing those migrations is likely a project in my future, and it will succeed by making the move cheap for teams.

Zero customer-impacting incidents since 2021 came from discipline applied release after release. We test what we build. Nothing gets thrown over the wall. The bar for what reaches production stays high. That record is a culture the team takes pride in before it is anything in the architecture.

Cole Conrad

Cole Conrad

Principal Platform Engineer

I build platforms teams rely on to ship. If this work maps to a problem you are trying to solve, I would enjoy the conversation. The chat in the corner can also go deeper on anything in this study.