← Learn RAG from Scratch

RAG Chunking Strategies Explained in 5 Minutes

Video 2 of 9 · 7:29

Chapters

0:00Why chunking matters
0:40What chunking is
1:20Chunk size comparison
2:30Overlap strategy
3:10Fixed vs recursive vs semantic chunking

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show

0:03docs. You search and get garbage
0:05The LLM hallucinates. Anyway,
0:09people blame the model. But the
0:11problem happened before any search
0:13You chunked your documents wrong.
0:16too big, the embedding is diluted.
0:19too small, you lose the context
0:21makes the answer useful. This video
0:24you exactly how to chunk your
0:26so retrieval actually works.
0:29the basic idea. You have a
0:32document, a refund policy, a
0:35spec, whatever. You can't embed
0:38whole thing as one vector. The
0:40model has a token limit,
0:42512 or 8,192
0:46depending on the model. Even if
0:49fits, one embedding for 10 pages
0:52nothing specific. So, you split
0:55document into smaller pieces. Each
0:58is a chunk. Each chunk gets its
1:00embedding. Now, when someone
1:02annual plan refund, the search
1:05the specific chunk about annual
1:07refunds, not a vague summary of the
1:10document. That's chunking.
1:13question is, how do you decide where
1:15split? Let's see why size matters.
1:19a refund policy document. If you
1:22the whole page as one block,
1:242,000 tokens. When someone
1:27annual plan refund, the
1:30matches loosely. Similarity
1:320.61.
1:35much unrelated content dilutes the
1:37Now go the other extreme. Chunk
1:41single sentence 20 tokens each.
1:44search finds annual plans may be
1:46Similarity 0.89.
1:49the chunk has zero context. Refunded
1:52Under what conditions? The LLM
1:55answer the actual question. The
1:58spot is paragraph level 200 to 500
2:01The search hits section 4.2, the
2:05about annual plan refunds
2:07Similarity 0.94.
2:11detail to answer the question.
2:13context for the LLM to generate a
2:15response.
2:17another problem. When you split
2:19document into chunks, you create hard
2:22information that spans two
2:24gets cut in half. Take this
2:27One paragraph says, "An annual
2:30are eligible for refunds." The
2:32paragraph says, "Subject to a
2:34usage deduction." If your
2:37boundary falls between those two
2:38neither chunk has the full
2:41The fix is overlap. You set a
2:44size of 400 tokens and an overlap
2:4750 tokens. The last 50 tokens of
2:50one repeat as the first 50 tokens
2:52chunk two. Now both chunks contain
2:55for refunds subject to a
2:57usage deduction. The search
2:59find the complete answer regardless
3:02which chunk it hits. Typical overlap
3:0410 to 20% of chunk size. If you want
3:08learn how to do this yourself, I run
3:11live sessions every Friday at noon
3:14Scan the QR code on screen to
3:18Would love to see you there. The
3:22approach is fixed size
3:24You pick a number, 500 tokens.
3:28split the document every 500 tokens.
3:31it. No logic about paragraphs,
3:34or meaning. Just count to 500
3:37cut. The advantage is
3:41chunk is the same size. Easy to
3:44easy to debug, easy to reason
3:47storage costs. The downside is
3:50You might cut in the middle of
3:52sentence. Annual plans are eligible
3:55in one chunk, refunds subject to
3:58in the next. Neither chunk
4:01complete sense on its own. Fixed
4:03works fine for homogeneous content
4:06log files or structured data. For
4:09language documents, you want
4:11smarter.
4:13is the technique you'll use 80% of
4:16time. Recursive character chunking.
4:19how it works. You give it a list
4:22separators in priority order. Double
4:24line, single new line, period space,
4:28The algorithm tries to split on
4:31new lines first. Those are
4:33boundaries. If a resulting
4:35is still too big, it falls back to
4:38new lines. Still too big, split
4:41sentences. Last resort, split on
4:44Take our refund policy. The
4:48new line split gives us four
4:50paragraphs. Section one is 280
4:53Fits perfectly. Section two is
4:57tokens. Fine. Section three is 620
5:02Too big. So it falls back to
5:05splitting on that section only.
5:07section three becomes two chunks. 3A
5:11340 tokens and 3B at 280 tokens.
5:15chunk respects natural boundaries.
5:17mid-sentence cuts. This is the
5:20in lang chain llama index and
5:22rag frameworks. Semantic chunking
5:26a completely different approach.
5:28of splitting by character count
5:30punctuation, it splits by meaning.
5:34the idea. You take every sentence
5:37compute its embedding. Then you
5:40adjacent sentences. If two
5:42have high cosine similarity,
5:45belong in the same chunk. When the
5:48drops below a threshold, you
5:50a new chunk boundary. Take a
5:53spec. Sentences about pricing
5:55together. Sentences about
5:58cluster together. The algorithm
6:00the semantic shift and splits
6:03The result is chunks that are
6:05coherent. Every chunk is about
6:08thing. The downside is cost. You're
6:11the embedding model for every
6:13not just every chunk. For most
6:16cases, recursive character chunking
6:18you 90% of the benefit at a
6:21of the cost.
6:23how to choose. Start with
6:26character chunking. Chunk size
6:29to 600 tokens. Overlap 50 to 100.
6:33your baseline. If your documents
6:36structured like code files, log
6:39or CSV data, fixed size
6:41is fine. If your retrieval
6:44still isn't good enough after
6:46chunking, try semantic
6:48on the problem documents. and
6:51measure. Run the same 20 test
6:53against each strategy. Compare
6:56similarity scores. The right
6:58strategy is the one that puts
7:00right information in front of the
7:04the full picture. If you want to
7:07deeper, join my free live session
7:09Friday at noon Eastern on Maven. I
7:12through this hands-on, answer
7:15and show you how to build it
7:17Scan the QR code to join.

RAG Explained in 10 Mins Next: RAG vs Fine-Tuning vs Prompting - Simple Decision Guide

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.