Lessons from Transcribing and Indexing 3.5 Million Podcasts with Arvid Kahl

Lessons from Transcribing and Indexing 3.5 Million Podcasts with Arvid Kahl

Author: Software Huddle July 8, 2025 Duration: 1:18:00
Big time guest today as Arvid Kahl joins us. Arvid is my favorite type of guest -- a deeply technical founder that can talk about both the technical and business challenges of a startup. Lots to enjoy from this episode. Arvid is known as the Bootstrapped Founder and has documented his path to selling Feedback Panda back in 2019. He's now building Podscan and sharing his journey as he goes. Podscan is a fascinating project. It's making the content of *every* podcast episode around the world fully searchable. He currently has 3.5 million episodes transcribed and adds another 30,000 - 50,000 episodes every day. This involves a ton of technical challenges, including how to get the best transcription results from the latest LLMs, whether you should use APIs from public providers or run your own LLMs, and how to efficiently provide full-text search across terabytes of transcription data. Arvid shares the lessons he's learned and the various strategies he's tried over the years. But there are also unique business challenges. For most technical businesses, your infrastructure costs grow in line with your customers. More customers == more data == more servers. With Podscan, Arvid has to index the entire podcast ecosystem regardless of his customers. This means a lot of upfront investment as he looks to grow his customer base. Arvid tells us how he's optimized his infrastructure to account for this unique challenge.

Every week on Software Huddle, Alex DeBrie and Sean Falconer sit down with a different expert from across the tech landscape. The conversations are less about quick tips and more about substantive discussions, digging into the real challenges and decisions behind building software, launching products, and navigating the industry's constant shifts. You'll hear from practitioners who have been in the trenches, offering perspectives that blend deep technical knowledge with hard-won business and entrepreneurial experience. Alex brings his specialized expertise as the author of The DynamoDB Book and an AWS Data Hero, while Sean contributes a unique viewpoint shaped by over two decades as an engineer, founder, and marketing executive, recognized as a Snowflake Data Superhero. Together, they create a space where complex topics in software development and technology trends become accessible and genuinely engaging. This podcast is for anyone who wants to move beyond surface-level news and understand the "why" behind the tools and strategies shaping our digital world. Tune in for a thoughtful huddle that feels more like a candid conversation between colleagues than a formal interview.
Author: Language: en-us Episodes: 79

Software Huddle
Podcast Episodes
Enterprise-grade Dev Environments with Ivan Burazin [not-audio_url] [/not-audio_url]

Duration: 51:33
Today’s guest is Ivan Burazin, the co-founder and CEO of Daytona, an actual creator of the Shift Developer Conference that he sold some time ago to Infobip. Ivan has tons of experience building developer tools, he has be…
Operational Data Warehouse with Nikhil Benesch [not-audio_url] [/not-audio_url]

Duration: 1:05:56
Today's episode is with Nikhil Benesch, who's the co-founder and CTO at Materialize, an Operational Data Warehouse. Materialize gets you the best of both worlds, combining the capabilities of your data warehouse with the…
Multi-tenancy with Khawaja Shams [not-audio_url] [/not-audio_url]

Duration: 1:09:04
Today's episode is with Khawaja Shams. Khawaja is the CEO and co-founder of Momento, which is a Serverless Cache. He used to lead the DynamoDB team at AWS and now he's doing Memento. We talk about a lot of different thin…
All about Rust with Tim McNamara [not-audio_url] [/not-audio_url]

Duration: 1:51:56
In today's episode with Tim McNamara, we talk all about Rust. Tim is one of the leading educators in the whole Rust educational space. He wrote the Rust in Action book, which is probably the best Rust book out there. He…
Becoming an Epic Web Developer with Kent C Dodds [not-audio_url] [/not-audio_url]

Duration: 55:39
Today, we have Kent C Dodds on the show. If you don't know Kent, he's a well known expert in JavaScript, Web Development and Teaching. His courses like Testing JavaScript, Epic React, and Epic Web Dev have helped countle…
SQL Meets Vector Search with Linpeng Tang of MyScale [not-audio_url] [/not-audio_url]

Duration: 1:01:38
Welcome back to an episode where we're talking Vectors, Vector Databases, and AI with Linpeng Tang, CTO and co-founder of MyScale. MyScale is a super interesting technology. They're combining the best of OLAP databases w…
What is a Vector Database with Yujian Tang [not-audio_url] [/not-audio_url]

Duration: 50:44
Today's guest is Yujian Tang from Zilliz, one of the big players in the vector database market. This is the first episode in a series of episodes we’re doing on vectors and vector databases. We start with the basics, wha…
Serverless Clickhouse with Tyler Wells [not-audio_url] [/not-audio_url]

Duration: 1:12:12
Today's episode is with Tyler Wells. Tyler is the CTO and co-founder at Propel. He was an early employee at Skype (and Microsoft after the acquisition) as well as Twilio. While at Twilio, Tyler helped build a data platfo…
Elasticsearch Fundamentals with Philipp Krenn [not-audio_url] [/not-audio_url]

Duration: 1:20:09
Today, we have Philipp Krenn on the show. He's the head of DevRel for Elastic, and we took a deep dive on all the Elasticsearch stuff like Indexes, Mappings, Shards and Replicas and how to think about performance and all…
Building a Better C with Loris Cro from Zig Software Foundation [not-audio_url] [/not-audio_url]

Duration: 1:10:25
Zig is a new programming language with big ambitions: to be a better C. Loris Cro is the VP of Community at the Zig Software Foundation, and he takes us through the ins and outs of Zig -- how was it created, what problem…