Architecting ML Infrastructure for AI-Powered Education

A multi-tenant ML Ops platform built under an immovable Back-to-School deadline, with model serving for an AI-powered personalization product and tenant onboarding cut from a year to under two weeks.

Role

Technical lead and program driver, platform side

Organization

HMH Education

Timeframe

12 months, POC through production

Stack

SageMaker · Terraform · cross-account IAM · KMS · VPC endpoints

2 weeks

tenant onboarding, down from a year

models in production for launch

On time

for Back to School

№ 01 · The Challenge

No governed path to production.

HMH was building an AI-powered personalization product for K-8 education, the product that ships today as HMH Pulse^TM. Its machine learning models needed a real path to production. The data science work lived in a legacy account where everything was hand-built. There was no governed route from training to registry to serving, and no isolation between experimentation and production. Onboarding new ML work took months to a year.

The launch date made it harder. Back to School is the one deadline in education that does not move. The models had to be serving in production before the school year started.

№ 02 · The Approach

The first tenant laid the foundation.

We built a multi-tenant ML Ops platform on AWS SageMaker, designed so the first tenant would prove the pattern for every tenant after it. I spent months embedded with the Learning Sciences Engineering teams, learning how their scientists actually work while writing the Terraform implementation. That engagement almost certainly saved us two years of rework. In the process of this project, I led the design and implementation of:

Dedicated training and inference accounts (training-prod, inference-nonprod, inference-prod), isolated by design, all of it Infrastructure-as-Code
Declarative tenancy: a new tenant is an entry in a list plus one Terraform file. Modules generate the SageMaker domains, model registries, CI runner roles, secrets wiring, and least-privilege IAM, 10+ roles per tenant
Tenant isolation enforced with cross-account IAM, KMS encryption, and VPC endpoints, so one tenant's models and data never cross into another's
A standardized CI pipeline covering train, register, promote, and serve. Models promote from development through integration to production, and there is no path that skips a stage
Self-service by default, with 50+ pages of user docs and operator runbooks so teams could onboard without a ticket queue

I was the technical lead and program driver on the platform side. No project manager was ever assigned, so I picked the coordination up myself. I ran the cross-team Scrum of Scrums with the Learning Science Engineers and Data Scientists, and recorded and published every session. I authored the acceptance criteria for all six phases and ran every demo and handoff. Architecture review was mine to present as well.

When the SVP needed a single board slide nine months in, I framed the platform as three capabilities: tenant isolation, cross-account model promotion, and self-service provisioning. The investment held, and I have watched the SVP reuse that framing since.

When SageMaker fought us on SDK inconsistencies, endpoint configuration, and service limits, I worked with AWS specialists directly until we had answers. We implemented around the inconsistencies and opened feature requests with AWS for the rest.

№ 03 · The Impact

Shipped on time, then the standard.

2 weeks

tenant onboarding, down from a year

models in production for the product launch

On time

for Back to School, the deadline that does not move

Handed off

to platform operations with full documentation

The first three models shipped on time and powered the personalization product through its beta school year. Other models have since been developed on the same foundation, and the platform became the governed pattern for ML at the organization.

№ 04 · Lessons Learned

Hard lines hold under pressure.

No one owned the coordination across teams, so I picked it up. Running the Scrum of Scrums, publishing every session, and writing acceptance criteria mattered as much to shipping as any Terraform module.

Hard lines protect architecture under pressure. With the deadline bearing down, the tempting shortcut was to reach back into the legacy account from the new environments. Saying no to that, clearly and early, kept the isolation model intact and made the platform trustworthy after launch.

Self-service has a boundary. Enabling a team means giving them working defaults, docs, and templates. Holding that line kindly when they ask you to take over their backlog is part of the job.

Previous · № 01

Building a Kubernetes Platform with Zero Customer-Impacting Incidents

Next · № 03

CI/CD at Scale: Automating the Path from Code to Production

Shipping against a deadline that does not move?

I build platforms teams rely on to ship. If you need infrastructure at enterprise scale, let's talk.

cole@coursecode.net Back to the homepage

linkedin.com/in/dabigcgithub.com/dabigcMore case studies →