← All posts

What we learned when our ML team outgrew its infrastructure

The classic scaling inflection: great data scientists, great models, and infrastructure that can't keep up.

I spent nearly five years at Bolt working on what they called the "Model Lifecycle" platform. My part was the ML pipelines, operations, automation, and data-science tooling -- the machinery that let scientists go from notebook to production. When I started: fewer than ten data scientists, fewer than ten ML pipelines. When I left: 30+ data scientists, 200+ ML projects, roughly 100,000 model deployments a month.

The starting point everyone recognises

  1. Data scientist writes a model locally
  2. Data scientist manually runs query scripts to pull data from the warehouse
  3. Data scientist manually triggers training jobs -- one per city, one per group -- from their laptop
  4. Data scientist manually runs merge scripts to combine model tarballs
  5. Data scientist manually creates deployment config and updates the production endpoint

Five steps, five opportunities for someone to fat-finger something at 11pm and not notice until morning.

Retraining happened quarterly, at best. It required a week of babysitting a laptop. More than one data scientist per pipeline, because the process was too fragile for one person to handle alone. And the models were stale long before they got refreshed.

Quarterly retraining for a pricing model in a market that changes daily.

I'll let you sit with that one.

The signs your team has hit the wall

I see these at every company with 5-30 data scientists. If three or more apply, your infrastructure is the bottleneck, not your team:

  • Deploys require a specific person -- only one or two people know how to get a model to production, so everyone else files a ticket and waits.
  • Retraining is an event, not a process -- quarterly cycles with manual babysitting, models decaying between them, and nobody has visibility into how much.
  • Training jobs run on someone's MacBook because the cloud path is too painful. "It works on my machine" has become a deployment strategy.
  • Build times are measured in lunch breaks -- 40+ minute Docker builds, experimentation cycles measured in half-days rather than minutes.
  • No integration tests for models, or the tests are so flaky that everyone ignores failures.
  • Nobody knows what each model costs to run -- the cloud bill grows 20% every quarter and finance asks questions that nobody can answer.
  • Adding a new city or segment is a project in itself -- your pricing model runs in 6 cities, the business wants 60, but each city requires manual work, so scaling is linear in engineering effort.

A common reaction that rarely helps

The typical first move is to hire more data engineers. But this tends to just create more people to coordinate. More people doing the hand-holding.

There are teams where the ratio of data engineers to data scientists ended up nearly 1:1.

The DEs weren't building infrastructure. They were running scripts on behalf of data scientists who weren't allowed to run them themselves.

What self-service actually looks like

I built a framework where data scientists owned their pipeline all the way from notebook to production. They understand their pipeline, they configure it, they deploy it, they fix it when it breaks. The pipeline config is a single file that describes everything: what data goes in, what comes out, how it trains, how it deploys, what resources it needs.

Same containers, laptop to production

Data scientists prototype locally, using the same Docker containers that production runs. One command to build. One command to run any step on the cloud with real data. The gap between "it works locally" and "it works in production" shrinks to almost nothing.

One config file instead of a ticket queue

A single config file defines the whole pipeline instead of scripts that someone runs manually. I wrote about what this looks like in detail here -- the short version is that a data scientist describes what their pipeline does in a config, merges a PR, and Airflow picks it up as a scheduled DAG. Adding a new city is adding one line.

Production traffic as a free test suite

New models deploy silently alongside the current production model at 0% traffic. Data quality checks run via Great Expectations, a data quality testing framework -- input distributions, prediction ranges, missing values. Then "replay" testing: recent real production traffic captured via SageMaker Data Capture gets replayed against the new model. Any request that succeeded on the old model shouldn't fail on the new one. This catches regressions even when the data scientist hasn't written a single test -- production traffic becomes the test suite for free.

If the replay or quality checks fail, the deploy blocks automatically. The data scientist can override this in the Airflow UI if they have good reason to -- maybe the model intentionally rejects a category that the old one accepted. But the default is safe: prove you don't break anything, then you get traffic.

The proof: it works without you

The best test of self-service infrastructure is what happens when you stop paying attention. I wrote about this in a separate post -- the short version is that ML projects went from 100 to 200 in three months without DE involvement. Not migrations inflating the count -- those were git branches, merged or retired. These were genuinely new experiments: new alphas, improved strategies on existing ones. Many never hit production, and that was the point.

The business case nobody bothers to make

Engineering leaders struggle to justify ML infrastructure investment because the returns are indirect:

Going from quarterly to daily retraining means models reflect current reality, not last quarter's. For one pricing model, scaling from a handful of cities to several hundred -- with frequent retraining -- was estimated by the team lead's back-of-the-envelope maths to produce a 9-figure increase in annual gross merchandise value. Infrastructure investment for that model paid for itself thousands of times over. Then there's experimentation velocity: if testing a new idea takes two weeks of DE time, your DS team will test fewer ideas. If it takes one afternoon of their own time, they'll test ten times as many. Some of those ideas will be the ones that matter. And every data scientist you hire produces value faster if they can self-serve -- that's hiring leverage you never see on a balance sheet but feel in every quarter.