← All posts

How we gave data scientists ownership of their own deployments

The deployment ticket is where model quality tends to die. Here's what we built instead.

I want to tell you about a data scientist I'll call Nymphadora. She'd built a pricing model that was genuinely good -- better feature engineering, cleaner training pipeline, the works. She was excited about it. She filed a deployment ticket on a Tuesday.

The ticket sat in the backlog for a week because the DE team had 15 other tickets ahead of it. When a DE finally picked it up, they asked clarifying questions. Nymphadora was in a different timezone. Three more days. The DE deployed it. Something broke in staging because Nymphadora had never tested in that environment -- why would she? She didn't have access. More debugging. Three more days.

By the time her model went live, it was three weeks old. The data had drifted. The model was already stale. Her excitement had turned into the resigned shrug of someone who's been through this before.

A 5-minute task had become a 2-week saga.

Multiply that by 30 data scientists and 200 models, and your DE team isn't engineering.

How you end up here

One or two data scientists, one data engineer. The DE knows the infrastructure. The DE deploys the model. Works fine. Then the team grows and nobody changes the process, they just add more people to it.

And the justifications for keeping it this way always sound reasonable. "Data scientists don't understand infrastructure" -- they don't need to understand Kubernetes. They need to understand their pipeline config and their tests. "We need a gatekeeper for production" --

Quality gates, not human gates.

Automated tests and gradual rollouts tend to be better gatekeepers than a person context-switching between 15 projects. "What if they break something?" -- they will. That's what rollbacks are for. A broken deployment that auto-rolls back in 5 minutes costs far less than a 2-week deployment queue.

What declarative self-service looks like

At Bolt, we replaced this with something fundamentally different. A data scientist's entire pipeline -- ingestion through training through deployment -- lived in a declarative config in the same repo as their model code.

Here's what Nymphadora's workflow would have looked like. She develops locally, using the same containers that production runs. She tests on cloud with real data -- one command. She updates the pipeline config, opens a PR, gets it reviewed by a peer (not a DE). Merges it. The config generates an Airflow DAG automatically -- the production orchestrator picks it up, deploys the new model alongside the current one at 0% traffic, runs data quality checks via Great Expectations, then replays recent live traffic from SageMaker Data Capture against the silent model to catch regressions. If anything that worked before now fails, the deploy blocks. If everything passes, traffic shifts gradually.

The replay step deserves elaboration. By default, up to 10K requests from the past month were replayed against the new model. Any request that didn't error on the previous model shouldn't error on the new one. Data scientists could supply Python callables for custom validation checks beyond pass/fail -- distribution comparisons, latency bounds, whatever made sense for their model. The number of requests, the timeframe, and the custom checks were all configurable in the pipeline config, with sane defaults if you didn't touch them. Most data scientists never changed the defaults. They didn't need to.

No ticket. No handoff. No waiting.

The config is the contract

The config replaces the ticket. Instead of writing "please deploy my model with these settings" in Jira and hoping the DE interprets it correctly, the DS writes the config directly. It's a shared contract -- narrow enough to learn in an afternoon, stable enough to build on for years:

pipeline:
  name: demand-forecast
  schedule: "0 2 * * 1"          # Weekly, Monday 2am
  groups:
    generate: true                # Groups auto-discovered from data
    source: "warehouse.cities"
  training:
    instance_type: ml.m5.xlarge
    timeout_hours: 4
  deployment:
    type: endpoint
    integration_tests: true
    rollout:
      strategy: gradual
      initial_traffic: 0
      step_percent: 10
      step_interval_minutes: 30
    rollback:
      on_error_rate_above: 0.05

Version-controlled, code-reviewed, self-documenting.

Same code, laptop to cloud

Self-service only works if data scientists can test before hitting production. Local development has to feel like production:

# Run a pipeline step locally (native Python, fast iteration)
./pipeline run preprocess --local --group=london

# Run the same step in a production-like Docker container
./pipeline run preprocess --docker --group=london

# Run on cloud with real data (for final validation)
./pipeline run preprocess --cloud --group=london

When data scientists start thinking like engineers

The ones who thrived started thinking of themselves as ML engineers, not just modellers. They cared about pipeline reliability, model latency, deployment cost. Not because someone told them to, but because they could finally see these things.

When DE involvement IS needed

Self-service doesn't mean no-support. Legitimate reasons to involve DE:

  • New infrastructure requirements. "I need GPU training and we don't have GPU instances set up." That's an infrastructure problem, not a deployment problem.
  • Unusual data access. "I need data from a system we've never connected to." Setting up new ETL pipelines is DE work.
  • Cost optimisation. "My pipeline costs EUR 5K/month and I think it could be cheaper." Cost-aware architecture is a DE speciality.
  • Security and permissions. "I need access to PII data for a new model." This should go through a proper review process, not self-service.
  • Platform improvements. "The base Docker image is missing a library I need." Changes to shared infrastructure affect everyone and should be coordinated.

What actually changed

Experiments that would have died in a backlog made it to production and created real business value. I wrote about the numbers in a separate post -- the short version is that ML projects went from 100 to 200 in three months without DE involvement. Not migrations -- genuinely new experiments, many of which never hit production. The point was fast, cheap experimentation.

But the number that stuck with me wasn't the project count. It was the look on a data scientist's face the first time they merged a PR and watched their model go live in minutes. That's when I knew the old process hadn't just been slow -- it had been demoralising.

Getting there from here

We started with one pipeline, one cooperative data scientist, and a lot of duct tape. Converted their deployment from ticket-based to config-driven. Showed it worked. Then the next team wanted in, and the next.

The hardest part isn't the tooling. It's redefining what DE success looks like. DEs who've been gatekeepers can feel diminished when the gate goes away. That cultural shift needs to be named and said out loud, repeatedly, before people start to believe it.