The Challenge of AI Model Evaluations with Ankur Goyal

Author: softwareengineeringdaily.com June 10, 2025 Duration: 45:22

News

Evaluations are critical for assessing the quality, performance, and effectiveness of software during development. Common evaluation methods include code reviews and automated testing, and can help identify bugs, ensure compliance with requirements, and measure software reliability. However, evaluating LLMs presents unique challenges due to their complexity, versatility, and potential for unpredictable behavior. Ankur Goyal is the CEO and Founder of Braintrust Data, which provides an end-to-end platform for AI application development, and has a focus on making LLM development robust and iterative. Ankur previously founded Impira which was acquired by Figma, and he later ran the AI team at Figma. Ankur joins the show to talk about Braintrust and the unique challenges of developing evaluations in a non-deterministic context. Sean's been an academic, startup founder, and Googler. He has published works covering a wide range of topics from AI to quantum computing. Currently, Sean is an AI Entrepreneur in Residence at Confluent where he works on AI strategy and thought leadership. You can connect with Sean on LinkedIn. Please click here to see the transcript of this episode. Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Software Engineering Daily

For anyone curious about how the code running our world actually gets built, Software Engineering Daily offers a clear and consistent look behind the curtain. This isn't about hype cycles or surface-level news; it's a deep, technical conversation with the engineers, architects, and thinkers who are shaping our digital infrastructure. Each episode focuses on a specific technology, practice, or problem, breaking down complex systems into understandable parts. You'll hear detailed discussions on everything from database architectures and programming language design to the organizational challenges of scaling teams and the real-world trade-offs made in production systems. Hosted by softwareengineeringdaily.com, the podcast serves as a reliable source for developers who want to stay informed and inspired, translating the rapid pace of technological change into substantive, lasting knowledge. It’s for professionals who believe that understanding the "how" and "why" is just as important as knowing the "what." By dedicating time to thorough exploration, this podcast provides context that shorter formats simply cannot, making it an essential resource for anyone building the future, one line of code at a time. Tune in to hear unfiltered insights from the people on the front lines, discussing the tools and decisions that define modern software engineering.

Author: softwareengineeringdaily.com Language: en-us Episodes: 100

Official website RSS

Podcast Episodes

[not-audio_url]

[/not-audio_url]

Carbon and Modernizing C++ with Chandler Carruth

14.08.2025

Duration: 1:03:37

Carbon is a programming language developed by Google as a successor to C++, and it aims to provide modern safety features while maintaining high performance. It's designed to offer seamless interoperability with C++ whil…

[not-audio_url]

[/not-audio_url]

Podman with Brent Baude

12.08.2025

Duration: 43:48

Podman is an open-source container management tool that allows developers to build, run, and manage containers. Unlike Docker, it supports rootless containers for improved security and is fully compatible with standards…

[not-audio_url]

[/not-audio_url]

SED News: Meta’s AI Gambit, Windsurf Shake‑Up, and the UK VPN Surge

07.08.2025

Duration: 47:24

SED News is a monthly podcast from Software Engineering Daily where hosts Gregor Vand and Sean Falconer unpack the biggest stories shaping software engineering, Silicon Valley, and the broader tech industry. In this epis…

[not-audio_url]

[/not-audio_url]

Electron and Desktop App Engineering with Shelley Vohr

05.08.2025

Duration: 52:04

Electron is a framework for building cross-platform desktop applications using web technologies like JavaScript, HTML, and CSS. It allows developers to package web apps with a native-like experience by bundling them with…

[not-audio_url]

[/not-audio_url]

Modal and Scaling AI Inference with Erik Bernhardsson

31.07.2025

Duration: 40:55

Modal is a serverless compute platform that's specifically focused on AI workloads. The company’s goal is to enable AI teams to quickly spin up GPU-enabled containers, and rapidly iterate and autoscale. It was founded by…

[not-audio_url]

[/not-audio_url]

RxJS with Ben Lesh

29.07.2025

Duration: 50:53

RxJS is an open-source library for composing asynchronous and event-based programs. It provides powerful operators for transforming, filtering, combining, and managing streams of data, from user input and web requests to…

[not-audio_url]

[/not-audio_url]

Small AI Models with Yoeven Khemlani

24.07.2025

Duration: 42:20

JigsawStack is a startup that develops a suite of custom small models for tasks such as scraping, forecasting, vOCR, and translation. The platform is designed to support collaborative knowledge work, especially in resear…

[not-audio_url]

[/not-audio_url]

Streamlining Cloud Infrastructure Deployments with Jake Cooper

22.07.2025

Duration: 43:25

Railway is a software company that provides a popular platform for deploying and managing applications in the cloud. It automates tasks such as infrastructure provisioning, scaling, and deployment and is particularly kno…

[not-audio_url]

[/not-audio_url]

Building Open Infrastructure for AI with Illia Polosukhin

17.07.2025

Duration: 50:12

Illia Polosukhin is a veteran AI researcher and one of the original authors of the landmark Transformer paper, Attention is All You Need, which he co-authored during his time at Google Research. He has a deep background…

[not-audio_url]

[/not-audio_url]

TypeScript with Jake Bailey

15.07.2025

Duration: 48:10

TypeScript is a statically typed superset of JavaScript that adds optional type annotations and modern language features to improve developer productivity and code safety. The TypeScript compiler performs type checking a…