← Learn RAG from Scratch

Do not Ship RAG Without This (Evaluation Metrics)

Video 7 of 9 · 8:30

Chapters

0:00Why you need RAG metrics
0:40Faithfulness
1:40Answer relevance
2:35Context recall
3:20Putting it together

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show

0:03It returns an answer and it
0:05right. So, you ship it. Two weeks
0:08support tickets roll in. The
0:10told me I can cancel any time,
0:12the policy says 30-day notice. Your
0:15retrieved the wrong document. The
0:17made up a detail and nobody caught
0:20It seems to work is not a shipping
0:22You need to measure rag quality
0:25same way you measure uptime or
0:27Three numbers tell you whether
0:30rag is production ready.
0:31answer relevance, and
0:34recall.
0:36metric, faithfulness.
0:39the answer actually come from the
0:41context or did the LLM
0:44a detail? Here is a concrete
0:47A user asks, "What is the
0:50policy for annual plans?" Your
0:53pulls the right document. It
0:56pro-rated refund based on remaining
0:59minus a $50 processing fee. The
1:02generates full refund within 30
1:05no questions asked. That sounds
1:09except the source never said
1:11The LLM invented the 30-day window
1:15dropped the processing fee.
1:17scores 0.2 out of one. That
1:20a hard fail. A faithful answer would
1:23pro-rated refund for remaining
1:26with a $50 processing fee. Every
1:29traces back to the source.
1:32score 0.95.
1:35measure this by breaking the answer
1:37individual claims, then checking
1:39claim against the retrieved
1:41If nine out of 10 claims are
1:44your faithfulness is 0.9.
1:48metric, answer relevance. Does
1:51answer address what the user
1:53asked? A user asks, "How do I
1:56up SSO for my team?" The LLM
1:59"So stands for single sign on.
2:02allows users to authenticate with one
2:04of credentials across multiple
2:06SSO improves security and
2:09experience. Technically accurate,
2:12the user asked how to set it up, not
2:14it is." Answer: Relevance score
2:17The answer is factual but misses
2:20intent. A relevant answer would say
2:23to settings then security then SSO
2:25Choose your provider.
2:28your SL endpoint URL. Upload the
2:31and click enable. Direct
2:33matches what the user asked.
2:36relevant score 0.92.
2:40measure this by generating a
2:41from the answer then comparing
2:43generated question to the original.
2:46someone reading only the answer would
2:48the same question, relevance is
2:50Third metric, context recall. Did
2:54even retrieve the right documents?
2:56one is about the retriever, not the
2:59A user asks, "What are the system
3:02for the enterprise
3:03Your ground truth answer
3:05four things. 16 GB RAM, four
3:09cores, Abuntu 20.04 or later, and
3:15retriever pulls back three chunks.
3:17one covers RAM and CPU
3:19Chunk two covers supported
3:22systems. Chunk three covers
3:24tiers, which is irrelevant. Two
3:27of four ground truth items are
3:29PostgresQL14
3:32missing entirely. Context recall 0.5.
3:37the information the user needs is
3:39in the retrieved context. Even a
3:42LLM cannot answer correctly if
3:44right documents are not retrieved.
3:47is why context recall matters. You
3:50it against a golden answer. For
3:52statement in the golden answer,
3:54if the retrieved context contains
3:56information. If you want to learn
3:59to do this yourself, I run free live
4:02every Friday at noon Eastern.
4:06the QR code on screen to join.
4:10love to see you there.
4:12you know the three metrics, but you
4:15not have to build the scoring from
4:17Ragus is an open-source
4:20framework that measures all
4:23faithfulness, answer relevance,
4:26context recall, plus a few more. You
4:30Ragus a question, the retrieved
4:32the generated answer, and
4:35a ground truth answer. It
4:38scores from 0ero to one for each
4:40Think of it as unit tests for
4:43rag pipeline. You would not ship a
4:46without tests. Do not ship rag
4:49evaluation.
4:51also tracks context precision
4:54measures whether the most relevant
4:56are ranked highest and answer
4:59which compares the generated
5:01against a known good answer.
5:04scores should you target? Here is a
5:07benchmark from a PM
5:09Faithfulness. Aim for 0.85
5:14higher. Below 0.7 means the LLM is
5:18making up details. That is a
5:21ticket factory. Answer
5:23Aim for 0.8 or higher. Below
5:27means your users are getting
5:29correct but useless answers.
5:32will stop trusting the system.
5:35recall. Aim for 0.75 or higher.
5:390.5 means your retriever is
5:42half the relevant information.
5:44LLM can fix bad retrieval. Green
5:48ship it. Amber means investigate
5:50improve. Red means stop and fix
5:53going live. These are starting
5:56Your specific thresholds
5:59on your domain. Medical and legal
6:02higher faithfulness. Search heavy
6:05need higher context recall.
6:08are useless without a test
6:10Here is how to build one. Start
6:1420 to 30 golden question answer
6:16These are questions your users
6:19ask paired with the correct
6:21you have manually verified. For
6:24pair, store the question, the
6:26answer, and the expected source
6:29Run your RAG pipeline on
6:32question. Collect the retrieved
6:35and generated answer. Feed all
6:38it into Ragass. You get a scorecard,
6:42answer relevance, context
6:45for each question. Average them
6:48your overall scores. Store these
6:51Every time you change your
6:53strategy, swap an embedding
6:55or update your prompt, rerun the
6:58suite. If scores drop, revert. If
7:02improve, ship. The final piece.
7:06is not a one-time check. It
7:10a continuous loop. You start with
7:12baseline scores, then you make a
7:15Maybe you switch from
7:17chunks to semantic chunking.
7:20rerun the test suite. Context recall
7:23from 0.62 to 0.78.
7:27Ship it. Next change. You add a
7:31Faithfulness stays at 0.88.
7:35Relevance goes from 0.74 to
7:40it. Then you try a new prompt
7:43Faithfulness drops from 0.8 88
7:460.71
7:48Every change goes through the
7:51loop. Change, evaluate, compare,
7:55This is how you go from it seems
7:58work to I can prove it works. That is
8:02difference between a demo and a
8:04That's the full picture. If you
8:07to go deeper, join my free live
8:10this Friday at noon Eastern on
8:12I walk through this hands-on,
8:15questions, and show you how to
8:17it yourself. Scan the QR code to

Hybrid Search for RAG: BM25 + Vector Search Explained Next: RAG Multi-Query, HyDE & Fusion (Complete Guide)

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.