How we cut ML build times from 40 minutes to 5

Nobody files a ticket that says "make our builds 8x faster." Builds that took 5 minutes when you started now take 40. Everyone has adjusted. The CI queue is always backed up.

At Bolt, where I spent five years building the ML platform, I inherited 40+ minute build times. Every push, every project, on a shared Jenkins server. The maths was grim: 200 projects, multiple builds per day, all queuing behind 40-minute builds on limited compute.

I got them down to about 5 minutes. Here's what was wrong and what I did.

Why ML builds are uniquely slow

Web app Docker builds are usually fast because the dependency footprint is small. A typical Node.js or Python web app installs a few dozen packages, copies some source files, done. ML builds are different:

Fat dependencies.numpy, scipy, scikit-learn, pandas, torch -- the core ML stack is enormous. PyTorch alone can be 2GB+.
Compilation steps. Many ML libraries have C/Fortran extensions that compile during install. scipy builds can take 10+ minutes from source.
Monolithic base images. Teams start with a "kitchen sink" base image containing everything anyone might need. 8GB images. Every build pulls the whole thing.
No caching discipline. Dockerfiles written by data scientists (no shade -- it's not their core skill) invalidate the cache at the first step. One line in the wrong order means reinstalling everything every time.

The anatomy of a 40-minute build

The original Dockerfiles looked something like this:

FROM ubuntu:20.04

# Install system deps, Python, everything at once
RUN apt-get update && apt-get install -y \
    python3 python3-pip curl wget git \
    libopenblas-dev liblapack-dev gfortran \
    && rm -rf /var/lib/apt/lists/*

# Copy everything (including source code)
COPY . /app
WORKDIR /app

# Install Python deps
RUN pip3 install -r requirements.txt

# Install the project itself
RUN pip3 install -e .

CMD ["python3", "run.py"]

COPY . /app before pip install. Any source code change invalidates the Docker layer cache for everything after it. Changed one line of Python? Reinstall all 200 packages.
Building from ubuntu:20.04. Installing Python, system libraries, and build tools from scratch every time the cache busts. This alone can take 5-10 minutes.
Single-stage build. Build tools, compilers, development headers -- all of these end up in the final image. A 6GB image that's mostly unused build artefacts.
pip install -r requirements.txt without pinning or caching. Every build resolves dependencies from scratch. Different builds get different versions. Slow and non-reproducible.

Six fixes, 40 minutes back

Six changes, each small on its own:

1. Layer ordering: dependencies before source code

Takes 30 seconds:

# BEFORE: cache busts on every source change
COPY . /app
RUN pip3 install -r requirements.txt

# AFTER: deps cached until requirements change
COPY requirements.txt /app/requirements.txt
RUN pip3 install -r /app/requirements.txt
COPY . /app

This alone took many builds from 40 minutes to 10.

2. Pre-built base images

Instead of building from a bare OS image every time, I created base images with the common ML stack pre-installed:

# Base image (rebuilt weekly or when deps change)
FROM python:3.10-slim

RUN pip install --no-cache-dir \
    numpy==1.24.0 \
    scipy==1.10.0 \
    scikit-learn==1.2.0 \
    pandas==2.0.0 \
    boto3==1.26.0

# Project Dockerfile
FROM our-registry/ml-base:latest

COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
COPY . /app

Project builds install only their specific dependencies on top -- usually 30 seconds to 2 minutes.

3. Multi-stage builds

Many ML libraries need compilers and dev headers to build, but not to run. Multi-stage builds let you compile in one stage and copy only the built artifacts to a slim final image:

# Build stage
FROM python:3.10 AS builder
RUN apt-get update && apt-get install -y build-essential gfortran
COPY requirements.txt .
RUN pip install --prefix=/install --no-cache-dir -r requirements.txt

# Runtime stage
FROM python:3.10-slim
COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app
CMD ["python3", "run.py"]

Final image sizes went from 6-8GB to 1-2GB.

4. Dependency pinning and lock files

Every build resolves versions from scratch -- network round-trips to PyPI, potentially different outcomes on different days.

# Generate a lock file (once, or when deps change)
pip-compile requirements.in -o requirements.txt

# In Dockerfile: install from lock file with exact versions
RUN pip install --no-cache-dir --no-deps -r requirements.txt

The --no-deps flag tells pip "don't resolve dependencies, just install exactly what's listed." Skips the resolver entirely, saves minutes on large dependency trees.

5. Parallel builds with autoscaling

All builds ran on a single Jenkins instance. Fine with 10 projects. At 100+, the queue was permanently backed up. I moved to an autoscaling EC2 build cluster.

This didn't make individual builds faster, but it killed the queue. A build that finishes in 5 minutes but waits 25 minutes in the queue is still a 30-minute build from the engineer's perspective.

6. Build-time smoke tests

Catch failures early, before expensive steps. If you're going to fail, fail in 30 seconds, not 35 minutes.

# Quick syntax and import check before building
RUN python3 -c "import myproject; print('imports ok')"

# Quick test with minimal data
RUN python3 -m pytest tests/smoke/ -x --timeout=60

Before and after

Median build time: 40 minutes → 5 minutes
P95 build time: 1 hour+ → 15 minutes
Final image size: 6-8 GB → 1-2 GB
Cache hit rate: ~0% → ~80% of builds use cached dependency layers
Queue wait time: near zero with autoscaling

The business impact nobody expected

I optimised builds because slow builds annoyed me. The first version was a Makefile that would make you cry. It wasn't on any roadmap -- I just started fixing things.

When a build takes 40 minutes, data scientists batch up changes and test once a day. When it takes 5 minutes, they test after every meaningful change. More iterations per day means faster convergence on good models. That faster feedback loop also caught bugs sooner -- an issue introduced at 10am was found at 10:05am, not at 5pm when the daily build finally finished.

With builds this fast, daily CI became practical for the first time. Every ML project was built automatically by Jenkins every day, with all unit and integration tests run. A daily report went to a dedicated Slack channel, tagging the owners of any failing projects. If nothing failed, a randomly selected message was posted instead. A few examples:

EmptyArrayException: there should be at least one failing project

File "ci.py", line 69420:
    failed_projects[0]
IndexError: list index out of range

Did the tests actually pass, or have you just deleted them all?

We don't want the things to get broken.
But we want to dispel the illusion that things work.
Today, the illusion wins.

Data scientists actually looked forward to these. Nobody wanted to be the person who replaced them with a failure notification tagged to their name.

The cost savings compounded from there. Smaller images meant cheaper storage, faster pulls, faster job startup. But the real serving optimisation was the pre-fork pattern for inference servers. A hundred city models packed into one "super-model" per server. The idea: load all models once in the parent process, then fork workers that share the memory. Instead of N models times M workers doing N×M loads, you get N loads and shared memory across workers. This cut both startup time and memory cost dramatically.

How it worked under the hood

After loading and warming models with test requests in the parent process, calling gc.freeze() moved all objects out of Python's garbage collector's tracked set. The forked workers then shared the parent's heap via copy-on-write -- since gc.freeze() prevents the GC from touching those objects, they stay shared instead of being duplicated per worker. O(N) loads, O(N) memory instead of O(NM).

Measuring the pain (and the progress)

You can't improve what you can't see. I built a build metrics dashboard -- initially bash scripts curling the Airflow and Jenkins APIs, piped into matplotlib. Eventually a Databricks notebook generating a weekly dashboard. It tracked best, p50, and p95 build times per day with moving averages. Build success rate was tracked day-by-day with a moving minimum -- if 10% of builds failed one day due to infra issues, the moving minimum stayed at 90% for the next week. To inflict more visual pain.

One other fix worth mentioning: the streaming merge step. The original model merge downloaded all city model tarballs, extracted them, re-archived, compressed, and uploaded -- O(N) disk, staged discretely. I replaced it with a streaming tar+gz implementation: O(1) disk, pipelined instead of staged. Faster and cheaper.

And the LD_PRELOAD hack. Docker containers on Mac had TCP timeout issues -- macOS would silently kill long-running builds by closing idle TCP connections. I intercepted the system calls to force keepalives on all TCP sockets. Not elegant, but it eliminated months of intermittent build flakes.

The details

I wrote an LD_PRELOAD shim that hooked glibc's socket() to inject setsockopt keepalives on every TCP socket at creation time. This prevented macOS from timing out Docker's connections during long builds.

Most of it is standard Docker best practice. Some of it is knowing how the OS and language runtime actually work.