Multimodal Evals:🦄 #34

YouTube Video•Boundary•12,371 words

View original

Content Summary

Discussion & Opinion

Multimodal Evals:🦄 #34 • Boundary

10 concepts12 actions20 keywords12,371 words

TL;DR

This episode demonstrates how to build a reliable multimodal eval system for receipt data extraction without a golden data set, using runtime invariant checks (like verifying subtotals add up to grand totals) instead of labeled ground truth. Kevin Gregory shows that by designing self-validating evals, iterating on data models based on discovered edge cases, and switching to Gemini Flash (3:20), he built a high-accuracy receipt extraction pipeline in roughly 3-4 hours. The key thesis is that understanding your data deeply and designing good eval infrastructure upfront makes multimodal AI problems dramatically easier than expected.

ELI5

Imagine you're checking your friend's math homework, but you don't have the answer key. You can still check if they're right by adding up all the small numbers to see if they equal the big number at the bottom! That's what Kevin did with computer receipts — he told the computer to read receipts and then checked if all the numbers added up correctly, without ever needing someone to tell him the right answers first.

Top Concepts

Keywords

Quick Actions

!Design runtime evals based on mathematical invariants rather than requiring golden labeled data
!Always look at your raw data before writing any extraction code
!Use Gemini Flash for multimodal OCR/extraction tasks

1m 46s•73,912 tokens

Claude Opus 4.5prompts v1.2v1.0?

Browse more public analyses

Want to analyze your own content?

Extract insights from YouTube videos, PDFs, and web articles. Free to start.

Try Knowmler Free