Learn RAG from Scratch

RAG Explained in 10 Mins

Video 1 of 9 · 9:51

Chapters

  • 0:00Why RAG matters
  • 0:40How basic RAG works
  • 1:50The problem with basic RAG
  • 2:30The full architecture
  • 3:15Query Construction (Text-to-SQL)
  • 4:30Query Translation (Multi-Query)
  • 5:45Routing
  • 6:30Indexing (Chunk Optimization)
  • 7:30Retrieval (Re-ranking)
  • 8:30Generation (Self-RAG)
  • 9:30The complete system

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show
  1. 0:03It will confidently tell you
  2. 0:06refund within 60 days. Sounds
  3. 0:09Except your actual policy is
  4. 0:12minus 2 months. The LLM doesn't
  5. 0:15your data. It just guesses. So you
  6. 0:18rag. You embed your docs and search
  7. 0:21But basic rag retrieves your
  8. 0:23page instead of the refund
  9. 0:25The user gets annual plans are
  10. 0:28per year. Real information wrong
  11. 0:31Production rag routes the
  12. 0:34to the right database,
  13. 0:36the exact policy clause, and
  14. 0:38the correct answer.
  15. 0:41we improve rag, let's see how
  16. 0:44RAG works. You start with your
  17. 0:47documents, a refund policy, a
  18. 0:50guide, whatever. You break each
  19. 0:53into smaller pieces called
  20. 0:56Think of it like cutting a book
  21. 0:58paragraphs. Each chunk gets
  22. 1:01into a list of numbers called
  23. 1:03embedding. These numbers capture the
  24. 1:06of the text. Similar ideas get
  25. 1:09numbers. All those embeddings
  26. 1:12stored in a vector database like
  27. 1:14cone or chroma. Now a user asks a
  28. 1:18What is the refund policy for
  29. 1:20plans? That question also gets
  30. 1:23into an embedding using the
  31. 1:25model. The database compares the
  32. 1:28embedding against all stored
  33. 1:30and finds the closest
  34. 1:32The top results, usually three
  35. 1:36five chunks, get pulled out. Those
  36. 1:38go into the LLM as context along
  37. 1:41the original question. The LLM
  38. 1:44the chunks and writes a coherent
  39. 1:46based on what it found. That's
  40. 1:49rag. Document in, chunks out,
  41. 1:52stored, question matched,
  42. 1:55generated. The six modules we
  43. 1:58next are all about making each of
  44. 2:00steps work better.
  45. 2:03rag tutorials show you the happy
  46. 2:06Take some docs, embed them,
  47. 2:09done. But production rag is a
  48. 2:12animal. Your retrieval misses
  49. 2:14docs. Your chunks are the wrong
  50. 2:17Your users ask vague questions
  51. 2:19don't match anything. This video
  52. 2:22the full production rag
  53. 2:24Six modules that take you
  54. 2:26it works in a demo to it works at
  55. 2:30the full picture. A question
  56. 2:33in on the left. Before it hits any
  57. 2:36it goes through routing and
  58. 2:38translation. Then it gets sent to
  59. 2:41or more data sources, vector stores,
  60. 2:44databases, graph databases.
  61. 2:48retrieve documents get ranked,
  62. 2:50and sometimes rerieved.
  63. 2:53the generation step produces an
  64. 2:55and can loop back to re-retrieve
  65. 2:58the quality isn't good enough. Six
  66. 3:01each solving a specific failure
  67. 3:03Not all your data lives in a
  68. 3:07store. Take this example. A user
  69. 3:11"What was Q3 revenue for the
  70. 3:13segment? That question needs
  71. 3:16SQL query, not a similarity search.
  72. 3:19LLM parses the intent, extracts the
  73. 3:22the time period, and the
  74. 3:25Then it generates the SQL,
  75. 3:28some amount from sales where
  76. 3:31equals Q3 2024 and segment
  77. 3:34enterprise. The query hits your
  78. 3:37database. The matching rows come
  79. 3:40Total 2.4 million. That's query
  80. 3:45Turning natural language
  81. 3:47the right query for the right
  82. 3:49Other approaches include text
  83. 3:52cipher for graph databases and
  84. 3:55retrievers for automatic
  85. 3:57filtering.
  86. 4:00users ask vague questions. How does
  87. 4:03work? That could mean
  88. 4:05Multi-query fixes this. The
  89. 4:09takes that vague question and
  90. 4:11three specific versions. What
  91. 4:14protocols does the API
  92. 4:17How do users log in and get
  93. 4:20tokens? What is the OOTH 2.0
  94. 4:23for thirdparty apps? Each version
  95. 4:26different documents, the API
  96. 4:29the login guide, the OOTH
  97. 4:31docs. You combine all three result
  98. 4:35Now you have coverage that no
  99. 4:38query could have achieved. Other
  100. 4:40include rag fusion for
  101. 4:42ranked results, decomposition
  102. 4:45breaking complex questions into sub
  103. 4:47stepback for asking more
  104. 4:50versions first, and hide for
  105. 4:53with hypothetical answers. If
  106. 4:56want to learn how to do this
  107. 4:58I run free live sessions every
  108. 5:01at noon Eastern. Scan the QR code
  109. 5:05screen to join. Would love to see you
  110. 5:10questions need different
  111. 5:12What was last quarter's
  112. 5:14That's financial data. It
  113. 5:17in a SQL database. The LLM
  114. 5:20sees the intent, classifies it as
  115. 5:23financial data, and routes it
  116. 5:26the sales database.
  117. 5:28take a different question. Explain
  118. 5:31refund policy. That's a policy
  119. 5:34The router classifies it as
  120. 5:36text and sends it to the
  121. 5:39store instead. Two questions,
  122. 5:42completely different data sources.
  123. 5:45router makes sure each one lands in
  124. 5:47right place. Semantic routing is
  125. 5:50approach. Instead of the LLM
  126. 5:53you embed the question and
  127. 5:55it against predefined prompt
  128. 5:59size makes or breaks your
  129. 6:01Take a refund policy
  130. 6:03If you embed the whole page as
  131. 6:06chunk, that's 2,000 tokens. When
  132. 6:09searches annual plan refund, the
  133. 6:12matches loosely. Similarity
  134. 6:140.61.
  135. 6:17chunk has too much unrelated
  136. 6:19diluting the match. Now split
  137. 6:22same document into paragraph level
  138. 6:24300 tokens each. The same search
  139. 6:28now hits section 4.2 directly.
  140. 6:32score 0.94.
  141. 6:34match, same document, same
  142. 6:37The only difference is chunk
  143. 6:39Other indexing strategies include
  144. 6:42indexing, which
  145. 6:45both a summary and the full
  146. 6:47specialized embeddings
  147. 6:49for your domain, and raptor,
  148. 6:52builds a tree of summaries at
  149. 6:54abstraction levels.
  150. 6:57similarity is not the same as
  151. 6:59A user asks how to cancel an
  152. 7:03subscription. Vector search
  153. 7:06five documents. pricing tiers,
  154. 7:09FAQ, support guide, cancellation
  155. 7:12and onboarding docs. All scored
  156. 7:16cosign similarity. The cancellation
  157. 7:19is ranked fourth at 0.83.
  158. 7:22should be first. A re-ranker fixes
  159. 7:25It takes those five results and
  160. 7:28them by actual relevance to the
  161. 7:30The cancellation policy jumps
  162. 7:33number four to number one. The
  163. 7:36page drops to four. Same
  164. 7:38better ordering. That's the
  165. 7:41between finding a result and
  166. 7:43the right result. Other
  167. 7:45include crag for evaluating
  168. 7:48quality and active retrieval
  169. 7:51going back to search again when
  170. 7:53are not good enough. The LLM
  171. 7:56an answer, but how do you know
  172. 7:59grounded in the actual documents?
  173. 8:02adds a verification loop. Here's
  174. 8:05it works. A user asks, "What's the
  175. 8:09fee for annual plans?" The
  176. 8:12generates, "There is typically a 20%
  177. 8:16fee." Sounds reasonable,
  178. 8:18the evaluation step checks that
  179. 8:20against the retrieved source
  180. 8:22The source says pro-rated
  181. 8:25not cancellation fee. The answer
  182. 8:28not grounded. So, the system
  183. 8:30This time it pulls the
  184. 8:33clause from the cancellation
  185. 8:35The LLM generates again
  186. 8:38refund minus two months of
  187. 8:40used. The evaluation step checks
  188. 8:43This time it matches the source.
  189. 8:46return the answer. Another
  190. 8:49is RR which rewrites the query,
  191. 8:53again, and reads the new
  192. 8:55in a similar loop. Here's the
  193. 8:59production rag architecture.
  194. 9:02comes in, routing sends it to
  195. 9:05right place. Query translation
  196. 9:08the question. The right
  197. 9:10get searched. Results get ranked
  198. 9:13refined. The LLM generates an
  199. 9:16And if the answer isn't good
  200. 9:18the system loops back. Six
  201. 9:21each independently improvable.
  202. 9:24it incrementally.
  203. 9:26the full picture. If you want to
  204. 9:28deeper, join my free live session
  205. 9:31Friday at noon Eastern on Maven. I
  206. 9:34through this hands-on, answer
  207. 9:36and show you how to build it
  208. 9:39Scan the QR code to join.

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.