Learn RAG from Scratch

RAG Chunking Strategies Explained in 5 Minutes

Video 2 of 9 · 7:29

Chapters

  • 0:00Why chunking matters
  • 0:40What chunking is
  • 1:20Chunk size comparison
  • 2:30Overlap strategy
  • 3:10Fixed vs recursive vs semantic chunking

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show
  1. 0:03docs. You search and get garbage
  2. 0:05The LLM hallucinates. Anyway,
  3. 0:09people blame the model. But the
  4. 0:11problem happened before any search
  5. 0:13You chunked your documents wrong.
  6. 0:16too big, the embedding is diluted.
  7. 0:19too small, you lose the context
  8. 0:21makes the answer useful. This video
  9. 0:24you exactly how to chunk your
  10. 0:26so retrieval actually works.
  11. 0:29the basic idea. You have a
  12. 0:32document, a refund policy, a
  13. 0:35spec, whatever. You can't embed
  14. 0:38whole thing as one vector. The
  15. 0:40model has a token limit,
  16. 0:42512 or 8,192
  17. 0:46depending on the model. Even if
  18. 0:49fits, one embedding for 10 pages
  19. 0:52nothing specific. So, you split
  20. 0:55document into smaller pieces. Each
  21. 0:58is a chunk. Each chunk gets its
  22. 1:00embedding. Now, when someone
  23. 1:02annual plan refund, the search
  24. 1:05the specific chunk about annual
  25. 1:07refunds, not a vague summary of the
  26. 1:10document. That's chunking.
  27. 1:13question is, how do you decide where
  28. 1:15split? Let's see why size matters.
  29. 1:19a refund policy document. If you
  30. 1:22the whole page as one block,
  31. 1:242,000 tokens. When someone
  32. 1:27annual plan refund, the
  33. 1:30matches loosely. Similarity
  34. 1:320.61.
  35. 1:35much unrelated content dilutes the
  36. 1:37Now go the other extreme. Chunk
  37. 1:41single sentence 20 tokens each.
  38. 1:44search finds annual plans may be
  39. 1:46Similarity 0.89.
  40. 1:49the chunk has zero context. Refunded
  41. 1:52Under what conditions? The LLM
  42. 1:55answer the actual question. The
  43. 1:58spot is paragraph level 200 to 500
  44. 2:01The search hits section 4.2, the
  45. 2:05about annual plan refunds
  46. 2:07Similarity 0.94.
  47. 2:11detail to answer the question.
  48. 2:13context for the LLM to generate a
  49. 2:15response.
  50. 2:17another problem. When you split
  51. 2:19document into chunks, you create hard
  52. 2:22information that spans two
  53. 2:24gets cut in half. Take this
  54. 2:27One paragraph says, "An annual
  55. 2:30are eligible for refunds." The
  56. 2:32paragraph says, "Subject to a
  57. 2:34usage deduction." If your
  58. 2:37boundary falls between those two
  59. 2:38neither chunk has the full
  60. 2:41The fix is overlap. You set a
  61. 2:44size of 400 tokens and an overlap
  62. 2:4750 tokens. The last 50 tokens of
  63. 2:50one repeat as the first 50 tokens
  64. 2:52chunk two. Now both chunks contain
  65. 2:55for refunds subject to a
  66. 2:57usage deduction. The search
  67. 2:59find the complete answer regardless
  68. 3:02which chunk it hits. Typical overlap
  69. 3:0410 to 20% of chunk size. If you want
  70. 3:08learn how to do this yourself, I run
  71. 3:11live sessions every Friday at noon
  72. 3:14Scan the QR code on screen to
  73. 3:18Would love to see you there. The
  74. 3:22approach is fixed size
  75. 3:24You pick a number, 500 tokens.
  76. 3:28split the document every 500 tokens.
  77. 3:31it. No logic about paragraphs,
  78. 3:34or meaning. Just count to 500
  79. 3:37cut. The advantage is
  80. 3:41chunk is the same size. Easy to
  81. 3:44easy to debug, easy to reason
  82. 3:47storage costs. The downside is
  83. 3:50You might cut in the middle of
  84. 3:52sentence. Annual plans are eligible
  85. 3:55in one chunk, refunds subject to
  86. 3:58in the next. Neither chunk
  87. 4:01complete sense on its own. Fixed
  88. 4:03works fine for homogeneous content
  89. 4:06log files or structured data. For
  90. 4:09language documents, you want
  91. 4:11smarter.
  92. 4:13is the technique you'll use 80% of
  93. 4:16time. Recursive character chunking.
  94. 4:19how it works. You give it a list
  95. 4:22separators in priority order. Double
  96. 4:24line, single new line, period space,
  97. 4:28The algorithm tries to split on
  98. 4:31new lines first. Those are
  99. 4:33boundaries. If a resulting
  100. 4:35is still too big, it falls back to
  101. 4:38new lines. Still too big, split
  102. 4:41sentences. Last resort, split on
  103. 4:44Take our refund policy. The
  104. 4:48new line split gives us four
  105. 4:50paragraphs. Section one is 280
  106. 4:53Fits perfectly. Section two is
  107. 4:57tokens. Fine. Section three is 620
  108. 5:02Too big. So it falls back to
  109. 5:05splitting on that section only.
  110. 5:07section three becomes two chunks. 3A
  111. 5:11340 tokens and 3B at 280 tokens.
  112. 5:15chunk respects natural boundaries.
  113. 5:17mid-sentence cuts. This is the
  114. 5:20in lang chain llama index and
  115. 5:22rag frameworks. Semantic chunking
  116. 5:26a completely different approach.
  117. 5:28of splitting by character count
  118. 5:30punctuation, it splits by meaning.
  119. 5:34the idea. You take every sentence
  120. 5:37compute its embedding. Then you
  121. 5:40adjacent sentences. If two
  122. 5:42have high cosine similarity,
  123. 5:45belong in the same chunk. When the
  124. 5:48drops below a threshold, you
  125. 5:50a new chunk boundary. Take a
  126. 5:53spec. Sentences about pricing
  127. 5:55together. Sentences about
  128. 5:58cluster together. The algorithm
  129. 6:00the semantic shift and splits
  130. 6:03The result is chunks that are
  131. 6:05coherent. Every chunk is about
  132. 6:08thing. The downside is cost. You're
  133. 6:11the embedding model for every
  134. 6:13not just every chunk. For most
  135. 6:16cases, recursive character chunking
  136. 6:18you 90% of the benefit at a
  137. 6:21of the cost.
  138. 6:23how to choose. Start with
  139. 6:26character chunking. Chunk size
  140. 6:29to 600 tokens. Overlap 50 to 100.
  141. 6:33your baseline. If your documents
  142. 6:36structured like code files, log
  143. 6:39or CSV data, fixed size
  144. 6:41is fine. If your retrieval
  145. 6:44still isn't good enough after
  146. 6:46chunking, try semantic
  147. 6:48on the problem documents. and
  148. 6:51measure. Run the same 20 test
  149. 6:53against each strategy. Compare
  150. 6:56similarity scores. The right
  151. 6:58strategy is the one that puts
  152. 7:00right information in front of the
  153. 7:04the full picture. If you want to
  154. 7:07deeper, join my free live session
  155. 7:09Friday at noon Eastern on Maven. I
  156. 7:12through this hands-on, answer
  157. 7:15and show you how to build it
  158. 7:17Scan the QR code to join.

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.