Case Study
Architecting ML Infrastructure for AI-Powered Education
A multi-tenant ML Ops platform built under an immovable Back-to-School deadline, with model serving for an AI-powered personalization product and tenant onboarding cut from months to under two weeks.
The Challenge
HMH was building an AI-powered personalization product for K-8 education, and its machine learning models needed a real path to production. The data science work lived in a legacy account where everything was hand-built. There was no governed route from training to registry to serving, and no isolation between experimentation and production. Onboarding new ML work took months to a year.
The launch date made it harder. Back to School is the one deadline in education that does not move. The models had to be serving in production before the school year started.
The Approach
We built a multi-tenant ML Ops platform on AWS SageMaker, designed so the first tenant would prove the pattern for every tenant after it. In the process of this project, I led the design and implementation of:
- Dedicated training and inference accounts, isolated by design, all of it Infrastructure-as-Code
- Declarative tenancy: a new tenant is an entry in a list plus one Terraform file. Modules generate the SageMaker domains, least-privilege IAM, model registries, CI runner roles, and secrets wiring
- A standardized CI pipeline covering train, register, promote, and serve, with models promoted from development through integration to production so every model moves the same way
- Self-service by default, with extensive user documentation so teams could onboard without a ticket queue
I was the technical lead and program driver on the platform side. No project manager was ever assigned, so I ran the cross-team Scrum of Scrums with the Learning Science Engineers and Data Scientists, authored the phased acceptance criteria, ran every phase demo and handoff, and presented at architecture review. When SageMaker fought us on SDK inconsistencies, endpoint configuration, and service limits, I worked with AWS specialists directly until we had answers, implementing solutions around the inconsistencies while opening feature requests with AWS.
The Impact
2 weeks
tenant onboarding, down from months
3
models in production for the product launch
On time
for Back to School, the deadline that does not move
Handed off
to platform operations with full documentation
The first three models shipped on time and powered the personalization product through its beta school year. The platform became the governed pattern for ML at the organization.
Lessons Learned
No one owned the coordination across teams, so I picked it up. Running the Scrum of Scrums, publishing every session, and writing acceptance criteria mattered as much to shipping as any Terraform module.
Hard lines protect architecture under pressure. With the deadline bearing down, the tempting shortcut was to reach back into the legacy account from the new environments. Saying no to that, clearly and early, kept the isolation model intact and made the platform trustworthy after launch.
Self-service has a boundary. Enabling a team means giving them paved roads, docs, and templates. Holding that line kindly when they ask you to take over their backlog is part of the job.