Fixing GPU Starvation in Large-Scale Distributed Training

Author: Demetrios April 3, 2026 Duration: 52:48

Technology

Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.

Fixing GPU Starvation in Large-Scale Distributed Training // MLOps Podcast #367 with Kashish Mittal, Staff Software Engineer at Uber

Join the Community: https://go.mlops.community/YTJoinIn

Get the newsletter: https://go.mlops.community/YTNewsletter

MLOps GPU Guide: https://go.mlops.community/gpuguide

// Abstract

Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML scaling.

The conversation dives deep into a recent architectural war story. Kashish walks through the full-stack profiling and detective work required to solve a massive GPU starvation bottleneck. By redesigning the Petastorm caching layer to bypass CPU transformation walls and uncovering hidden distributed race conditions, his team boosted GPU utilization to 60%+ and cut training time by 80%. Kashish also shares his philosophy on the fundamental trade-offs between latency and efficiency in GPU serving.

// Bio

Kashish Mittal is a Staff Software Engineer at Uber, where he architects the hyperscale machine learning infrastructure that powers Uber’s core mobility and delivery marketplaces. Prior to Uber, Kashish spent nearly a decade at Google building highly scalable, low-latency distributed ML systems for flagship products, including YouTube Ads and Core Search Ranking. His engineering expertise lies at the intersection of distributed systems and AI—specifically focusing on large-scale data processing, eliminating critical I/O bottlenecks, and maximizing GPU efficiency for petabyte-scale training pipelines. When he isn't hunting down distributed race conditions, he is a passionate advocate for open-source architecture and building reproducible, high-throughput ML systems.

// Related Links

Website: https://www.uber.com/

Getting Humans Out of the Way: How to Work with Teams of Agents // MLOps Podcast #368 with Rob Ennals, the Creator of Broomy: https://www.youtube.com/watch?v=ie1M8p-SVfM

~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~

Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

Join our Slack community [https://go.mlops.community/slack]

Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]

MLOps Swag/Merch: [https://shop.mlops.community/]

Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Kashish on LinkedIn: /kashishmittal/

Timestamps:

[00:00] Local dataset caching

[00:30] Engineers Evolving Roles

[04:44] GPU Resource Management

[10:21] GPU Utilization Issues

[21:49] More GPU War Stories

[32:12] Model Serving Issues

[39:58] Reflective Learning in Coding

[43:23] Workflow and Reflective Skills

[52:30] Wrap up

MLOps.community

Hosted by Demetrios, MLOps.community is a space for honest, meandering talks about the real work of making artificial intelligence systems actually work. This isn't about hype or theoretical papers; it's about the messy, practical, and often surprising journey of taking models from a notebook into a live environment. You'll hear from engineers and practitioners who are in the trenches, discussing the tools, the frustrations, and the occasional breakthroughs that define the day-to-day. The conversations are deliberately relaxed, covering everything from traditional machine learning pipelines to the new world of large language models and even the intangible "vibes" of team culture and process. Each episode peels back a layer on what "production" really means, whether that involves deploying a predictive service, managing an agentic system, or maintaining reliability as everything scales. Tuning into this podcast feels like grabbing a coffee with colleagues who aren't afraid to dig into the technical nitty-gritty while keeping the tone conversational and accessible. It's for anyone who builds, manages, or is just curious about the operational backbone that allows AI to deliver value, offering a grounded perspective often missing from the broader conversation.

Author: Demetrios Language: en-us Episodes: 100

Official website RSS

Podcast Episodes

[not-audio_url]

[/not-audio_url]

Building Out GPU Clouds // Mohan Atreya // #317

24.05.2025

Duration: 47:57

Demetrios and Mohan Atreya break down the GPU madness behind AI — from supply headaches and sky-high prices to the rise of nimble GPU clouds trying to outsmart the giants. They cover power-hungry hardware, failed experim…

[not-audio_url]

[/not-audio_url]

A Candid Conversation Around MCP and A2A // Rahul Parundekar and Sam Partee // #316 SF Live

21.05.2025

Duration: 1:04:42

Demetrios, Sam Partee, and Rahul Parundekar unpack the chaos of AI agent tools and the evolving world of MCP (Model Context Protocol). With sharp insights and plenty of laughs, they dig into tool permissions, security qu…

[not-audio_url]

[/not-audio_url]

AI in M&A: Building, Buying, and the Future of Dealmaking // Kison Patel // #315

16.05.2025

Duration: 55:32

AI in M&A: Building, Buying, and the Future of Dealmaking // MLOps Podcast #315 with Kison Patel, CEO and M&A Science at DealRoom.Join the Community: https://go.mlops.community/YTJoinInGet the newsletter: https://go.mlop…

[not-audio_url]

[/not-audio_url]

AI, Marketing, and Human Decision Making // Fausto Albers // #313

14.05.2025

Duration: 49:40

AI, Marketing, and Human Decision Making // MLOps Podcast #313 with Fausto Albers, AI Engineer & Community Lead at AI Builders Club.Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.m…

[not-audio_url]

[/not-audio_url]

MLOps with Databricks // Maria Vechtomova // #314

13.05.2025

Duration: 52:43

MLOps with Databricks // MLOps Podcast #314 with Maria Vechtomova, MLOps Tech Lead | Founder at Ahold Delhaize | Marvelous MLOps.Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlop…

[not-audio_url]

[/not-audio_url]

Making AI Reliable is the Greatest Challenge of the 2020s // Alon Bochman // #312

06.05.2025

Duration: 1:01:37

Making AI Reliable is the Greatest Challenge of the 2020s // MLOps Podcast #312 with Alon Bochman, CEO of RagMetrics.Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlops.community/…

[not-audio_url]

[/not-audio_url]

Behavior Modeling, Secondary AI Effects, Bias Reduction & Synthetic Data // Devansh Devansh // #311

02.05.2025

Duration: 1:01:35

Behavior Modeling, Secondary AI Effects, Bias Reduction & Synthetic Data // MLOps Podcast #311 with Devansh Devansh, Head of AI at Stealth AI Startup.Join the Community: https://go.mlops.community/YTJoinIn Get the newsle…

[not-audio_url]

[/not-audio_url]

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // Paco Nathan & Weidong Yang // #310

29.04.2025

Duration: 1:14:01

GraphBI: Expanding Analytics to All Data Through the Combination of GenAI, Graph, & Visual Analytics // MLOps Podcast #310 with Paco Nathan, Principal DevRel Engineer at Senzing & Weidong Yang, CEO of Kineviz.Join the Co…

[not-audio_url]

[/not-audio_url]

AI Data Engineers - Data Engineering After AI // Vikram Chennai // #309

25.04.2025

Duration: 49:40

AI Data Engineers - Data Engineering after AI // MLOps Podcast #309 with Vikram Chennai, Founder/CEO of Ardent AI.Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlops.community/YTN…