903: LLM Benchmarks Are Lying to You (And What to Do Instead), with Sinan Ozdemir

903: LLM Benchmarks Are Lying to You (And What to Do Instead), with Sinan Ozdemir

Author: Jon Krohn July 8, 2025 Duration: 1:28:20
Has AI benchmarking reached its limit, and what do we have to fill this gap? Sinan Ozdemir speaks to Jon Krohn about the lack of transparency in training data and the necessity of human-led quality assurance to detect AI hallucinations, when and why to be skeptical of AI benchmarks, and the future of benchmarking agentic and multimodal models. Additional materials: ⁠⁠⁠⁠⁠www.superdatascience.com/903⁠⁠⁠⁠ This episode is brought to you by Trainium2, the latest AI chip from AWS, by ⁠⁠Adverity, the conversational analytics platform⁠⁠ and by the ⁠⁠Dell AI Factory with NVIDIA⁠⁠. Interested in sponsoring a SuperDataScience Podcast episode? Email natalie@superdatascience.com for sponsorship information. In this episode you will learn: (16:48) Sinan’s new podcast, Practically Intelligent (21:54) What to know about the limits of AI benchmarking (53:22) Alternatives to AI benchmarks (1:01:23) The difficulties in getting a model to recognize its mistakes

Hosted by Dr. Jon Krohn, Super Data Science: ML & AI Podcast with Jon Krohn is a deep and accessible exploration of how artificial intelligence and machine learning are reshaping our world. Each episode features conversations with leading researchers, engineers, and entrepreneurs from both academia and industry, breaking down complex ideas into something tangible and relevant. You'll hear firsthand about emerging techniques, practical applications, and the evolving landscape of data-driven careers. The sheer volume of data in our world is growing at a staggering rate, and this podcast serves as a guide to understanding that expansion and finding your place within it. Rather than offering abstract theory, these discussions focus on real-world impact, from cutting-edge algorithms to the human stories behind major breakthroughs. Tune in for a thoughtful, nuanced look at the tools and trends that are defining the future, all through the lens of experts who are building that future every day. Whether you're actively working in the field or simply curious about the forces driving technological change, this podcast provides a consistent source of insight and inspiration, demystifying the science that is quietly transforming every aspect of our lives.
Author: Language: English Episodes: 100

Super Data Science: ML & AI Podcast with Jon Krohn
Podcast Episodes
974: When Will The AI Bubble Burst? How Bad Will It Be? [not-audio_url] [/not-audio_url]

Duration: 13:56
In this week’s Five-Minute Friday, Jon Krohn holds the AI bubble up to the light. He points to the deep greyzone found in AI startups like Cluely that are established on dubious ideas (Cluely’s tagline was “cheat on ever…
973: AI Systems Performance Engineering, with Chris Fregly [not-audio_url] [/not-audio_url]

Duration: 1:12:10
No one should be manually writing code in 2026, thinks Chris Fregly, Jon Krohn’s guest on this week’s episode. In this interview about Chris’ latest book, AI Systems Performance Engineering, he explains why it’s so impor…
972: In Case You Missed It in February 2026 [not-audio_url] [/not-audio_url]

Duration: 26:44
Jon Krohn recaps the month of February in this episode of In Case You Missed It. Across four interviews with Will Falcon (Episode 965), Tom Griffiths (Episode 969), Antje Barth (Episode 963), and Praveen Murugesan (Episo…
970: The “100x Engineer”: How to Be One, But Should You? [not-audio_url] [/not-audio_url]

Duration: 14:37
Working with code-gen models and Claude Code: In this Five-Minute Friday, Jon Krohn addresses how AI superstars like Andrej Karpathy are using AI agents in their coding work, the outlook for code-gen in 2026, and how you…
968: Is AI Automating Away All Coding Jobs? [not-audio_url] [/not-audio_url]

Duration: 14:56
Now that AI agents can develop new apps from product development to delivery, do AI developers have reason to worry about their careers? Podcast host Jon Krohn addresses the stark predictions that AI could “eliminate hal…
967: AI for the Physical World, with Samsara's Praveen Murugesan [not-audio_url] [/not-audio_url]

Duration: 55:10
VP of Engineering at Samsara Praveen Murugesan talks to Jon Krohn about processing 20 trillion data points covering 90 billion miles across private and public sectors, how the company helps truckers who operate long hour…