Fixing GPU Starvation in Large-Scale Distributed Training

Fixing GPU Starvation in Large-Scale Distributed Training

Author: Demetrios April 3, 2026 Duration: 52:48

Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.


Fixing GPU Starvation in Large-Scale Distributed Training // MLOps Podcast #367 with Kashish Mittal, Staff Software Engineer at Uber


Join the Community: https://go.mlops.community/YTJoinIn

Get the newsletter: https://go.mlops.community/YTNewsletter

MLOps GPU Guide: https://go.mlops.community/gpuguide


// Abstract

Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML scaling.


The conversation dives deep into a recent architectural war story. Kashish walks through the full-stack profiling and detective work required to solve a massive GPU starvation bottleneck. By redesigning the Petastorm caching layer to bypass CPU transformation walls and uncovering hidden distributed race conditions, his team boosted GPU utilization to 60%+ and cut training time by 80%. Kashish also shares his philosophy on the fundamental trade-offs between latency and efficiency in GPU serving.


// Bio

Kashish Mittal is a Staff Software Engineer at Uber, where he architects the hyperscale machine learning infrastructure that powers Uber’s core mobility and delivery marketplaces. Prior to Uber, Kashish spent nearly a decade at Google building highly scalable, low-latency distributed ML systems for flagship products, including YouTube Ads and Core Search Ranking. His engineering expertise lies at the intersection of distributed systems and AI—specifically focusing on large-scale data processing, eliminating critical I/O bottlenecks, and maximizing GPU efficiency for petabyte-scale training pipelines. When he isn't hunting down distributed race conditions, he is a passionate advocate for open-source architecture and building reproducible, high-throughput ML systems.


// Related Links

Website: https://www.uber.com/

Getting Humans Out of the Way: How to Work with Teams of Agents // MLOps Podcast #368 with Rob Ennals, the Creator of Broomy: https://www.youtube.com/watch?v=ie1M8p-SVfM


~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~

Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

Join our Slack community [https://go.mlops.community/slack]

Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]

Sign up for the next meetup: [https://go.mlops.community/register]

MLOps Swag/Merch: [https://shop.mlops.community/]


Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Kashish on LinkedIn: /kashishmittal/


Timestamps:

[00:00] Local dataset caching

[00:30] Engineers Evolving Roles

[04:44] GPU Resource Management

[10:21] GPU Utilization Issues

[21:49] More GPU War Stories

[32:12] Model Serving Issues

[39:58] Reflective Learning in Coding

[43:23] Workflow and Reflective Skills

[52:30] Wrap up


Hosted by Demetrios, MLOps.community is a space for honest, meandering talks about the real work of making artificial intelligence systems actually work. This isn't about hype or theoretical papers; it's about the messy, practical, and often surprising journey of taking models from a notebook into a live environment. You'll hear from engineers and practitioners who are in the trenches, discussing the tools, the frustrations, and the occasional breakthroughs that define the day-to-day. The conversations are deliberately relaxed, covering everything from traditional machine learning pipelines to the new world of large language models and even the intangible "vibes" of team culture and process. Each episode peels back a layer on what "production" really means, whether that involves deploying a predictive service, managing an agentic system, or maintaining reliability as everything scales. Tuning into this podcast feels like grabbing a coffee with colleagues who aren't afraid to dig into the technical nitty-gritty while keeping the tone conversational and accessible. It's for anyone who builds, manages, or is just curious about the operational backbone that allows AI to deliver value, offering a grounded perspective often missing from the broader conversation.
Author: Language: en-us Episodes: 100

MLOps.community
Podcast Episodes
I Am Once Again Asking "What is MLOps?" // Oleksandr Stasyk // #308 [not-audio_url] [/not-audio_url]

Duration: 1:07:22
I am once again asking, "What is MLOps?" // MLOps Podcast #308 with Oleksandr Stasyk, Engineering Manager, ML Platform of Synthesia.Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.m…
We're All Finetuning Incorrectly // Tanmay Chopra // #304 [not-audio_url] [/not-audio_url]

Duration: 1:00:30
We're All Finetuning Incorrectly // MLOps Podcast #304 with Tanmay Chopra, Founder & CEO of Emissary.Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlops.community/YTNewsletter //…

«1...678910