Speculative Decoding and Efficient LLM Inference with Chris Lott - #717

Author: Sam Charrington February 4, 2025 Duration: 1:16:30

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Today, we're joined by Chris Lott, senior director of engineering at Qualcomm AI Research to discuss accelerating large language model inference. We explore the challenges presented by the LLM encoding and decoding (aka generation) and how these interact with various hardware constraints such as FLOPS, memory footprint and memory bandwidth to limit key inference metrics such as time-to-first-token, tokens per second, and tokens per joule. We then dig into a variety of techniques that can be used to accelerate inference such as KV compression, quantization, pruning, speculative decoding, and leveraging small language models (SLMs). We also discuss future directions for enabling on-device agentic experiences such as parallel generation and software tools like Qualcomm AI Orchestrator. The complete show notes for this episode can be found at https://twimlai.com/go/717.

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Hosted by industry analyst and commentator Sam Charrington, The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence) serves as a vital conduit between cutting-edge research and its real-world implications. This isn't just a series of technical lectures; it's a series of conversations that unpack how AI and machine learning are actively reshaping industries and societal structures. Each episode connects you directly with leading researchers, engineers, and innovative thinkers who are defining the frontiers of the field. The discussions go beyond abstract theory to explore the practical challenges, ethical considerations, and business transformations driven by these technologies. Whether you're a data scientist deep in the code, a tech-savvy leader strategizing implementation, or simply fascinated by the future of intelligent systems, this podcast provides the context and depth needed to stay informed. By focusing on the people behind the algorithms and the ideas powering the platforms, Sam creates a resource that is both intellectually substantive and genuinely engaging, building a thoughtful community around one of the most significant technological shifts of our time.

Author: Sam Charrington Language: English Episodes: 100

Official website RSS

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Podcast Episodes

[not-audio_url]

[/not-audio_url]

Simplifying On-Device AI for Developers with Siddhika Nevrekar - #697

12.08.2024

Duration: 46:37

Today, we're joined by Siddhika Nevrekar, AI Hub head at Qualcomm Technologies, to discuss on-device AI and how to make it easier for developers to take advantage of device capabilities. We unpack the motivations for AI…

[not-audio_url]

[/not-audio_url]

Genie: Generative Interactive Environments with Ashley Edwards - #696

05.08.2024

Duration: 46:51

Today, we're joined by Ashley Edwards, a member of technical staff at Runway, to discuss Genie: Generative Interactive Environments, a system for creating ‘playable’ video environments for training deep reinforcement lea…

[not-audio_url]

[/not-audio_url]

Bridging the Sim2real Gap in Robotics with Marius Memmel - #695

30.07.2024

Duration: 57:21

Today, we're joined by Marius Memmel, a PhD student at the University of Washington, to discuss his research on sim-to-real transfer approaches for developing autonomous robotic agents in unstructured environments. Our c…

[not-audio_url]

[/not-audio_url]

Building Real-World LLM Products with Fine-Tuning and More with Hamel Husain - #694

24.07.2024

Duration: 1:20:05

Today, we're joined by Hamel Husain, founder of Parlance Labs, to discuss the ins and outs of building real-world products using large language models (LLMs). We kick things off discussing novel applications of LLMs and…

[not-audio_url]

[/not-audio_url]

Mamba, Mamba-2 and Post-Transformer Architectures for Generative AI with Albert Gu - #693

17.07.2024

Duration: 57:54

Today, we're joined by Albert Gu, assistant professor at Carnegie Mellon University, to discuss his research on post-transformer architectures for multi-modal foundation models, with a focus on state-space models in gene…

[not-audio_url]

[/not-audio_url]

Decoding Animal Behavior to Train Robots with EgoPet with Amir Bar - #692

09.07.2024

Duration: 43:16

Today, we're joined by Amir Bar, a PhD candidate at Tel Aviv University and UC Berkeley to discuss his research on visual-based learning, including his recent paper, “EgoPet: Egomotion and Interaction Data from an Animal…

[not-audio_url]

[/not-audio_url]

How Microsoft Scales Testing and Safety for Generative AI with Sarah Bird - #691

01.07.2024

Duration: 57:12

Today, we're joined by Sarah Bird, chief product officer of responsible AI at Microsoft. We discuss the testing and evaluation techniques Microsoft applies to ensure safe deployment and use of generative AI, large langua…

[not-audio_url]

[/not-audio_url]

Long Context Language Models and their Biological Applications with Eric Nguyen - #690

25.06.2024

Duration: 45:41

Today, we're joined by Eric Nguyen, PhD student at Stanford University. In our conversation, we explore his research on long context foundation models and their application to biology particularly Hyena, and its evolutio…

[not-audio_url]

[/not-audio_url]

Accelerating Sustainability with AI with Andres Ravinet - #689

18.06.2024

Duration: 47:46

Today, we're joined by Andres Ravinet, sustainability global black belt at Microsoft, to discuss the role of AI in sustainability. We explore real-world use cases where AI-driven solutions are leveraged to help tackle en…

[not-audio_url]

[/not-audio_url]

Gen AI at the Edge: Qualcomm AI Research at CVPR 2024 with Fatih Porikli - #688

11.06.2024

Duration: 1:10:41

Today we’re joined by Fatih Porikli, senior director of technology at Qualcomm AI Research. In our conversation, we covered several of the Qualcomm team’s 16 accepted main track and workshop papers at this year’s CVPR co…