The Challenge of AI Model Evaluations with Ankur Goyal

The Challenge of AI Model Evaluations with Ankur Goyal

Author: softwareengineeringdaily.com June 10, 2025 Duration: 45:22
Evaluations are critical for assessing the quality, performance, and effectiveness of software during development. Common evaluation methods include code reviews and automated testing, and can help identify bugs, ensure compliance with requirements, and measure software reliability. However, evaluating LLMs presents unique challenges due to their complexity, versatility, and potential for unpredictable behavior. Ankur Goyal is the CEO and Founder of Braintrust Data, which provides an end-to-end platform for AI application development, and has a focus on making LLM development robust and iterative. Ankur previously founded Impira which was acquired by Figma, and he later ran the AI team at Figma. Ankur joins the show to talk about Braintrust and the unique challenges of developing evaluations in a non-deterministic context. Sean's been an academic, startup founder, and Googler. He has published works covering a wide range of topics from AI to quantum computing. Currently, Sean is an AI Entrepreneur in Residence at Confluent where he works on AI strategy and thought leadership. You can connect with Sean on LinkedIn.   Please click here to see the transcript of this episode. Sponsorship inquiries: sponsor@softwareengineeringdaily.com

For anyone curious about how the code running our world actually gets built, Software Engineering Daily offers a clear and consistent look behind the curtain. This isn't about hype cycles or surface-level news; it's a deep, technical conversation with the engineers, architects, and thinkers who are shaping our digital infrastructure. Each episode focuses on a specific technology, practice, or problem, breaking down complex systems into understandable parts. You'll hear detailed discussions on everything from database architectures and programming language design to the organizational challenges of scaling teams and the real-world trade-offs made in production systems. Hosted by softwareengineeringdaily.com, the podcast serves as a reliable source for developers who want to stay informed and inspired, translating the rapid pace of technological change into substantive, lasting knowledge. It’s for professionals who believe that understanding the "how" and "why" is just as important as knowing the "what." By dedicating time to thorough exploration, this podcast provides context that shorter formats simply cannot, making it an essential resource for anyone building the future, one line of code at a time. Tune in to hear unfiltered insights from the people on the front lines, discussing the tools and decisions that define modern software engineering.
Author: Language: en-us Episodes: 100

Software Engineering Daily
Podcast Episodes
pnpm with Zoltan Kochan [not-audio_url] [/not-audio_url]

Duration: 37:26
Traditional package management systems for JavaScript have faced several inefficiencies related to dependency storage, resolution, and project performance. pnpm is a fast, disk-efficient package manager for JavaScript an…
Angular with Jessica Janiuk [not-audio_url] [/not-audio_url]

Duration: 52:37
Modern web development faces several challenges, particularly when building scalable, maintainable, and high-performance applications. As applications grow, managing complex user interfaces, and ensuring efficient data h…
Context-Aware SQL and Metadata with Shinji Kim [not-audio_url] [/not-audio_url]

Duration: 41:37
A common challenge in data-rich organizations is that critical context about the data is often hard to capture and even harder to keep up to date. As more people across the organization use data and data models get more…
Modern Data Visualization with Robert Kosara [not-audio_url] [/not-audio_url]

Duration: 50:48
Data visualization is increasingly important as organizations prioritize data-driven decision-making. Tools that transform complex datasets into intuitive, interpretable visualizations are arguably just as critical as th…
A Conversation with Amazon CTO Werner Vogels [not-audio_url] [/not-audio_url]

Duration: 49:32
Werner Vogels is the Chief Technology Officer at Amazon, where he has played a pivotal role in shaping the company’s technology vision for over two decades. Before joining Amazon in 2004, Werner was a research scientist…
Redis and AI Agent Memory with Andrew Brookins [not-audio_url] [/not-audio_url]

Duration: 48:35
A key challenge with designing AI agents is that large language models are stateless and have limited context windows. This requires careful engineering to maintain continuity and reliability across sequential LLM intera…
Complex Workload Deployment with Will Stewart [not-audio_url] [/not-audio_url]

Duration: 40:33
Deploying and managing cloud workloads is a complex task that requires developers to handle infrastructure, scaling, CI/CD pipelines, and database hosting. Configuring and maintaining Kubernetes, ensuring smooth deployme…