← All posts

What happens when data engineering becomes the platform, not the gateway

The gatekeeper pattern feels safe. We found it was a bottleneck in disguise.

It took three months to build the first version of the platform that replaced Bolt's manual deployment process. Three months of no visible output -- no features shipped, no tickets closed. Just infrastructure work that wouldn't pay off until it was done. By the time the prototype landed, four of six pipelines were ported to it, and data scientists could define their pipeline in a config file, open a PR, and have it running in production without filing a single ticket.

The gatekeeper pattern

Data ScienceticketsData EngineeringticketsSRE + Infra
The gatekeeper model. Every request flows through DE as a serial bottleneck.

DS needs to deploy a model. Ticket to DE. DE needs a new instance type. Ticket to infra. DS needs to change a retraining schedule. Ticket to DE. DE becomes a ticket queue, not an engineering team. And every ticket is a context switch, a handoff, a potential miscommunication, and a delay measured in days or weeks.

DE understands the infrastructure. DE can catch mistakes before they hit production. DE provides quality control. All of that value erodes when the cost is that data scientists can't ship.

I've written before about what it looks like when you remove the deployment ticket. But I didn't fully explain the structural insight behind why it works. That's what this post is about.

Three months of silence

I've written about the manual deployment process elsewhere -- the short version is laptops, scripts, quarterly retraining, a week of babysitting per cycle. My assignment was "improve the ML pipelines." I decided the right move was a platform -- a declarative config system where a data scientist describes what their pipeline does, and the system handles how.

That first prototype was rough -- bash scripts and Python glue. I've written about why the rough implementation survived and why that was fine.

Three circles and a platform

About a year in, I started drawing diagrams to explain what we'd built. The standard explanation -- "we automated the DE tasks" -- wasn't right. We'd changed the relationship between three groups entirely. The platform sat at the intersection -- a shared interface to the infrastructure that both DS and DE could use directly, without depending on each other.

DataScienceSRE+ InfraData Engineeringunnatural unlessInfra team includes DEwhere thefriction wasnatural,healthyPlatform /Automation
The shared interface model. The platform sits at the centre, not DE.

Three overlaps, each with a different character:

The DS/DE overlap -- deployment, retraining, config changes -- was where the friction lived. Every ticket in this zone was a delay.

The DE/SRE+Infra overlap -- infrastructure, monitoring, scaling -- was healthy and natural.

The DS/SRE+Infra overlap is the interesting one. Data scientists doing infrastructure directly is unnatural -- it happens when there's no DE team, or when DE is so backed up that DS starts going around them. It looks like data scientists writing Terraform, managing their own Kubernetes deployments, debugging network policies. They can do it. They shouldn't have to. It's a symptom of a missing platform layer, not a solution. And it's expensive: every hour a data scientist spends fighting infrastructure is an hour they're not spending on the model work you hired them for.

And in the centre, where all three circles meet: the platform. It's an interface to the infrastructure and operations that both DS and DE can interact with directly -- neither depends on the other. SRE mostly doesn't need to know the details of it. They manage the infrastructure underneath; the platform handles the translation.

Serial vs parallel

The gatekeeper model fails because it's serial. Every request passes through DE, even when DE adds no value. A data scientist changing a retraining schedule doesn't need a data engineer -- they need a config change. A data scientist deploying a retrained model doesn't need a data engineer -- they need an automated deployment pipeline with rollbacks.

The shared interface model works because it's parallel. DS interacts with the platform for deployment and retraining. DE interacts with it for pipeline design and platform improvements. SRE+Infra interacts with it for monitoring and scaling.

At Bolt, while I spent three months on infrastructure, ML projects doubled without DE involvement. I wrote about the numbers here.

The structural implication

When a DE team is processing deployment tickets all day, that's usually not a staffing problem -- it's an architecture problem. They were spending their days running scripts on behalf of data scientists who weren't allowed to run them themselves. More DEs would have meant more people doing the same low-leverage work.

It was supposed to cost me my job to figure this out. It didn't, for reasons that had nothing to do with me.