Learn RAG from Scratch

Do not Ship RAG Without This (Evaluation Metrics)

Video 7 of 9 · 8:30

Chapters

  • 0:00Why you need RAG metrics
  • 0:40Faithfulness
  • 1:40Answer relevance
  • 2:35Context recall
  • 3:20Putting it together

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show
  1. 0:03It returns an answer and it
  2. 0:05right. So, you ship it. Two weeks
  3. 0:08support tickets roll in. The
  4. 0:10told me I can cancel any time,
  5. 0:12the policy says 30-day notice. Your
  6. 0:15retrieved the wrong document. The
  7. 0:17made up a detail and nobody caught
  8. 0:20It seems to work is not a shipping
  9. 0:22You need to measure rag quality
  10. 0:25same way you measure uptime or
  11. 0:27Three numbers tell you whether
  12. 0:30rag is production ready.
  13. 0:31answer relevance, and
  14. 0:34recall.
  15. 0:36metric, faithfulness.
  16. 0:39the answer actually come from the
  17. 0:41context or did the LLM
  18. 0:44a detail? Here is a concrete
  19. 0:47A user asks, "What is the
  20. 0:50policy for annual plans?" Your
  21. 0:53pulls the right document. It
  22. 0:56pro-rated refund based on remaining
  23. 0:59minus a $50 processing fee. The
  24. 1:02generates full refund within 30
  25. 1:05no questions asked. That sounds
  26. 1:09except the source never said
  27. 1:11The LLM invented the 30-day window
  28. 1:15dropped the processing fee.
  29. 1:17scores 0.2 out of one. That
  30. 1:20a hard fail. A faithful answer would
  31. 1:23pro-rated refund for remaining
  32. 1:26with a $50 processing fee. Every
  33. 1:29traces back to the source.
  34. 1:32score 0.95.
  35. 1:35measure this by breaking the answer
  36. 1:37individual claims, then checking
  37. 1:39claim against the retrieved
  38. 1:41If nine out of 10 claims are
  39. 1:44your faithfulness is 0.9.
  40. 1:48metric, answer relevance. Does
  41. 1:51answer address what the user
  42. 1:53asked? A user asks, "How do I
  43. 1:56up SSO for my team?" The LLM
  44. 1:59"So stands for single sign on.
  45. 2:02allows users to authenticate with one
  46. 2:04of credentials across multiple
  47. 2:06SSO improves security and
  48. 2:09experience. Technically accurate,
  49. 2:12the user asked how to set it up, not
  50. 2:14it is." Answer: Relevance score
  51. 2:17The answer is factual but misses
  52. 2:20intent. A relevant answer would say
  53. 2:23to settings then security then SSO
  54. 2:25Choose your provider.
  55. 2:28your SL endpoint URL. Upload the
  56. 2:31and click enable. Direct
  57. 2:33matches what the user asked.
  58. 2:36relevant score 0.92.
  59. 2:40measure this by generating a
  60. 2:41from the answer then comparing
  61. 2:43generated question to the original.
  62. 2:46someone reading only the answer would
  63. 2:48the same question, relevance is
  64. 2:50Third metric, context recall. Did
  65. 2:54even retrieve the right documents?
  66. 2:56one is about the retriever, not the
  67. 2:59A user asks, "What are the system
  68. 3:02for the enterprise
  69. 3:03Your ground truth answer
  70. 3:05four things. 16 GB RAM, four
  71. 3:09cores, Abuntu 20.04 or later, and
  72. 3:15retriever pulls back three chunks.
  73. 3:17one covers RAM and CPU
  74. 3:19Chunk two covers supported
  75. 3:22systems. Chunk three covers
  76. 3:24tiers, which is irrelevant. Two
  77. 3:27of four ground truth items are
  78. 3:29PostgresQL14
  79. 3:32missing entirely. Context recall 0.5.
  80. 3:37the information the user needs is
  81. 3:39in the retrieved context. Even a
  82. 3:42LLM cannot answer correctly if
  83. 3:44right documents are not retrieved.
  84. 3:47is why context recall matters. You
  85. 3:50it against a golden answer. For
  86. 3:52statement in the golden answer,
  87. 3:54if the retrieved context contains
  88. 3:56information. If you want to learn
  89. 3:59to do this yourself, I run free live
  90. 4:02every Friday at noon Eastern.
  91. 4:06the QR code on screen to join.
  92. 4:10love to see you there.
  93. 4:12you know the three metrics, but you
  94. 4:15not have to build the scoring from
  95. 4:17Ragus is an open-source
  96. 4:20framework that measures all
  97. 4:23faithfulness, answer relevance,
  98. 4:26context recall, plus a few more. You
  99. 4:30Ragus a question, the retrieved
  100. 4:32the generated answer, and
  101. 4:35a ground truth answer. It
  102. 4:38scores from 0ero to one for each
  103. 4:40Think of it as unit tests for
  104. 4:43rag pipeline. You would not ship a
  105. 4:46without tests. Do not ship rag
  106. 4:49evaluation.
  107. 4:51also tracks context precision
  108. 4:54measures whether the most relevant
  109. 4:56are ranked highest and answer
  110. 4:59which compares the generated
  111. 5:01against a known good answer.
  112. 5:04scores should you target? Here is a
  113. 5:07benchmark from a PM
  114. 5:09Faithfulness. Aim for 0.85
  115. 5:14higher. Below 0.7 means the LLM is
  116. 5:18making up details. That is a
  117. 5:21ticket factory. Answer
  118. 5:23Aim for 0.8 or higher. Below
  119. 5:27means your users are getting
  120. 5:29correct but useless answers.
  121. 5:32will stop trusting the system.
  122. 5:35recall. Aim for 0.75 or higher.
  123. 5:390.5 means your retriever is
  124. 5:42half the relevant information.
  125. 5:44LLM can fix bad retrieval. Green
  126. 5:48ship it. Amber means investigate
  127. 5:50improve. Red means stop and fix
  128. 5:53going live. These are starting
  129. 5:56Your specific thresholds
  130. 5:59on your domain. Medical and legal
  131. 6:02higher faithfulness. Search heavy
  132. 6:05need higher context recall.
  133. 6:08are useless without a test
  134. 6:10Here is how to build one. Start
  135. 6:1420 to 30 golden question answer
  136. 6:16These are questions your users
  137. 6:19ask paired with the correct
  138. 6:21you have manually verified. For
  139. 6:24pair, store the question, the
  140. 6:26answer, and the expected source
  141. 6:29Run your RAG pipeline on
  142. 6:32question. Collect the retrieved
  143. 6:35and generated answer. Feed all
  144. 6:38it into Ragass. You get a scorecard,
  145. 6:42answer relevance, context
  146. 6:45for each question. Average them
  147. 6:48your overall scores. Store these
  148. 6:51Every time you change your
  149. 6:53strategy, swap an embedding
  150. 6:55or update your prompt, rerun the
  151. 6:58suite. If scores drop, revert. If
  152. 7:02improve, ship. The final piece.
  153. 7:06is not a one-time check. It
  154. 7:10a continuous loop. You start with
  155. 7:12baseline scores, then you make a
  156. 7:15Maybe you switch from
  157. 7:17chunks to semantic chunking.
  158. 7:20rerun the test suite. Context recall
  159. 7:23from 0.62 to 0.78.
  160. 7:27Ship it. Next change. You add a
  161. 7:31Faithfulness stays at 0.88.
  162. 7:35Relevance goes from 0.74 to
  163. 7:40it. Then you try a new prompt
  164. 7:43Faithfulness drops from 0.8 88
  165. 7:460.71
  166. 7:48Every change goes through the
  167. 7:51loop. Change, evaluate, compare,
  168. 7:55This is how you go from it seems
  169. 7:58work to I can prove it works. That is
  170. 8:02difference between a demo and a
  171. 8:04That's the full picture. If you
  172. 8:07to go deeper, join my free live
  173. 8:10this Friday at noon Eastern on
  174. 8:12I walk through this hands-on,
  175. 8:15questions, and show you how to
  176. 8:17it yourself. Scan the QR code to

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.