← Learn RAG from Scratch
Do not Ship RAG Without This (Evaluation Metrics)
Video 7 of 9 · 8:30
Chapters
- 0:00Why you need RAG metrics
- 0:40Faithfulness
- 1:40Answer relevance
- 2:35Context recall
- 3:20Putting it together
Transcript
Auto generated by YouTube. Click any timestamp to jump to that moment.
Show
Transcript
Auto generated by YouTube. Click any timestamp to jump to that moment.
- 0:03It returns an answer and it
- 0:05right. So, you ship it. Two weeks
- 0:08support tickets roll in. The
- 0:10told me I can cancel any time,
- 0:12the policy says 30-day notice. Your
- 0:15retrieved the wrong document. The
- 0:17made up a detail and nobody caught
- 0:20It seems to work is not a shipping
- 0:22You need to measure rag quality
- 0:25same way you measure uptime or
- 0:27Three numbers tell you whether
- 0:30rag is production ready.
- 0:31answer relevance, and
- 0:34recall.
- 0:36metric, faithfulness.
- 0:39the answer actually come from the
- 0:41context or did the LLM
- 0:44a detail? Here is a concrete
- 0:47A user asks, "What is the
- 0:50policy for annual plans?" Your
- 0:53pulls the right document. It
- 0:56pro-rated refund based on remaining
- 0:59minus a $50 processing fee. The
- 1:02generates full refund within 30
- 1:05no questions asked. That sounds
- 1:09except the source never said
- 1:11The LLM invented the 30-day window
- 1:15dropped the processing fee.
- 1:17scores 0.2 out of one. That
- 1:20a hard fail. A faithful answer would
- 1:23pro-rated refund for remaining
- 1:26with a $50 processing fee. Every
- 1:29traces back to the source.
- 1:32score 0.95.
- 1:35measure this by breaking the answer
- 1:37individual claims, then checking
- 1:39claim against the retrieved
- 1:41If nine out of 10 claims are
- 1:44your faithfulness is 0.9.
- 1:48metric, answer relevance. Does
- 1:51answer address what the user
- 1:53asked? A user asks, "How do I
- 1:56up SSO for my team?" The LLM
- 1:59"So stands for single sign on.
- 2:02allows users to authenticate with one
- 2:04of credentials across multiple
- 2:06SSO improves security and
- 2:09experience. Technically accurate,
- 2:12the user asked how to set it up, not
- 2:14it is." Answer: Relevance score
- 2:17The answer is factual but misses
- 2:20intent. A relevant answer would say
- 2:23to settings then security then SSO
- 2:25Choose your provider.
- 2:28your SL endpoint URL. Upload the
- 2:31and click enable. Direct
- 2:33matches what the user asked.
- 2:36relevant score 0.92.
- 2:40measure this by generating a
- 2:41from the answer then comparing
- 2:43generated question to the original.
- 2:46someone reading only the answer would
- 2:48the same question, relevance is
- 2:50Third metric, context recall. Did
- 2:54even retrieve the right documents?
- 2:56one is about the retriever, not the
- 2:59A user asks, "What are the system
- 3:02for the enterprise
- 3:03Your ground truth answer
- 3:05four things. 16 GB RAM, four
- 3:09cores, Abuntu 20.04 or later, and
- 3:15retriever pulls back three chunks.
- 3:17one covers RAM and CPU
- 3:19Chunk two covers supported
- 3:22systems. Chunk three covers
- 3:24tiers, which is irrelevant. Two
- 3:27of four ground truth items are
- 3:29PostgresQL14
- 3:32missing entirely. Context recall 0.5.
- 3:37the information the user needs is
- 3:39in the retrieved context. Even a
- 3:42LLM cannot answer correctly if
- 3:44right documents are not retrieved.
- 3:47is why context recall matters. You
- 3:50it against a golden answer. For
- 3:52statement in the golden answer,
- 3:54if the retrieved context contains
- 3:56information. If you want to learn
- 3:59to do this yourself, I run free live
- 4:02every Friday at noon Eastern.
- 4:06the QR code on screen to join.
- 4:10love to see you there.
- 4:12you know the three metrics, but you
- 4:15not have to build the scoring from
- 4:17Ragus is an open-source
- 4:20framework that measures all
- 4:23faithfulness, answer relevance,
- 4:26context recall, plus a few more. You
- 4:30Ragus a question, the retrieved
- 4:32the generated answer, and
- 4:35a ground truth answer. It
- 4:38scores from 0ero to one for each
- 4:40Think of it as unit tests for
- 4:43rag pipeline. You would not ship a
- 4:46without tests. Do not ship rag
- 4:49evaluation.
- 4:51also tracks context precision
- 4:54measures whether the most relevant
- 4:56are ranked highest and answer
- 4:59which compares the generated
- 5:01against a known good answer.
- 5:04scores should you target? Here is a
- 5:07benchmark from a PM
- 5:09Faithfulness. Aim for 0.85
- 5:14higher. Below 0.7 means the LLM is
- 5:18making up details. That is a
- 5:21ticket factory. Answer
- 5:23Aim for 0.8 or higher. Below
- 5:27means your users are getting
- 5:29correct but useless answers.
- 5:32will stop trusting the system.
- 5:35recall. Aim for 0.75 or higher.
- 5:390.5 means your retriever is
- 5:42half the relevant information.
- 5:44LLM can fix bad retrieval. Green
- 5:48ship it. Amber means investigate
- 5:50improve. Red means stop and fix
- 5:53going live. These are starting
- 5:56Your specific thresholds
- 5:59on your domain. Medical and legal
- 6:02higher faithfulness. Search heavy
- 6:05need higher context recall.
- 6:08are useless without a test
- 6:10Here is how to build one. Start
- 6:1420 to 30 golden question answer
- 6:16These are questions your users
- 6:19ask paired with the correct
- 6:21you have manually verified. For
- 6:24pair, store the question, the
- 6:26answer, and the expected source
- 6:29Run your RAG pipeline on
- 6:32question. Collect the retrieved
- 6:35and generated answer. Feed all
- 6:38it into Ragass. You get a scorecard,
- 6:42answer relevance, context
- 6:45for each question. Average them
- 6:48your overall scores. Store these
- 6:51Every time you change your
- 6:53strategy, swap an embedding
- 6:55or update your prompt, rerun the
- 6:58suite. If scores drop, revert. If
- 7:02improve, ship. The final piece.
- 7:06is not a one-time check. It
- 7:10a continuous loop. You start with
- 7:12baseline scores, then you make a
- 7:15Maybe you switch from
- 7:17chunks to semantic chunking.
- 7:20rerun the test suite. Context recall
- 7:23from 0.62 to 0.78.
- 7:27Ship it. Next change. You add a
- 7:31Faithfulness stays at 0.88.
- 7:35Relevance goes from 0.74 to
- 7:40it. Then you try a new prompt
- 7:43Faithfulness drops from 0.8 88
- 7:460.71
- 7:48Every change goes through the
- 7:51loop. Change, evaluate, compare,
- 7:55This is how you go from it seems
- 7:58work to I can prove it works. That is
- 8:02difference between a demo and a
- 8:04That's the full picture. If you
- 8:07to go deeper, join my free live
- 8:10this Friday at noon Eastern on
- 8:12I walk through this hands-on,
- 8:15questions, and show you how to
- 8:17it yourself. Scan the QR code to
Hybrid Search for RAG: BM25 + Vector Search ExplainedNext: RAG Multi-Query, HyDE & Fusion (Complete Guide)
Want the next one in your inbox?
Join 1,000+ Product Managers getting one deep dive every Friday.