← Learn RAG from Scratch
RAG Re-ranking: Bi-Encoder vs Cross-Encoder Explained
Video 4 of 9 · 6:52
Chapters
- 0:00The ranking problem
- 0:50How bi-encoders work
- 1:45Cross-encoders: reading together
- 2:45Retrieve then re-rank pattern
Transcript
Auto generated by YouTube. Click any timestamp to jump to that moment.
Show
Transcript
Auto generated by YouTube. Click any timestamp to jump to that moment.
- 0:03the refund policy for annual plans?
- 0:06vector search returns five results.
- 0:09pricing page is number one. The
- 0:12refund policy buried at number
- 0:14The embeddings are close, but
- 0:17is not the same as correct. Vector
- 0:20measures distance in
- 0:22space. It does not measure
- 0:24a document actually answers the
- 0:26This is the gap that kills rag
- 0:29in production. The results look
- 0:31The scores are high, but the
- 0:34answer is not on top. To understand
- 0:37fix, you need to understand the
- 0:40Most rack systems use by
- 0:42for retrieval. A by encoder
- 0:45the query and each document
- 0:47The query goes through one
- 0:50Each document goes through the
- 0:52encoder independently. Both come
- 0:55as vectors. Then you compare them
- 0:58cosign similarity. This is fast.
- 1:01can premputee all your document
- 1:03once and store them. When a
- 1:06comes in, you encode it and
- 1:08against everything in
- 1:10The trade-off, the encoder
- 1:13sees the query and document
- 1:15It cannot reason about how
- 1:18relate. It just compares two points
- 1:20space. That is why a pricing page
- 1:23annual plans scores high for a
- 1:26about refund policy for annual
- 1:29The words overlap. A cross
- 1:32takes a completely different
- 1:35Instead of encoding the query
- 1:38document separately, it reads them
- 1:40as one input. The query and
- 1:43get concatenated. They go
- 1:46a single transformer as one
- 1:48The model sees every word in
- 1:51query next to every word in the
- 1:53It can reason about their
- 1:55directly. The output is a
- 1:58relevant score, not a vector
- 2:01but a direct prediction. How
- 2:04is this document to this
- 2:06This is much more accurate.
- 2:08cross encoder understands that
- 2:11policy for annual plans is asking
- 2:14refunds, not pricing. Even though
- 2:17documents mention annual plans, but
- 2:20is slow. You cannot premputee
- 2:22Every query document pair
- 2:25a full forward pass through the
- 2:27Here is the technique that fixes
- 2:31Cross encoder reranking. It
- 2:34the speed of buyenccoders with
- 2:37accuracy of cross encoders in a
- 2:40pipeline. Stage one, your buy
- 2:43retrieves the top 20 candidates.
- 2:46is fast milliseconds. You get a
- 2:49set of potentially relevant
- 2:51Stage two, the cross encoder
- 2:55those 20 candidates. It reads
- 2:58one paired with the original query.
- 3:01it can reason about actual
- 3:0320 forward passes instead of
- 3:06your entire database. The
- 3:08change dramatically. Watch the
- 3:11policy example. Before
- 3:14the pricing page sits at
- 3:16one with a similarity score of
- 3:20refund policy is stuck at number
- 3:22with 0.58.
- 3:25the cross encoder reranks, the
- 3:28policy jumps to number one with a
- 3:30score of 0.94.
- 3:33pricing page drops to number four
- 3:350.31.
- 3:37documents, same query, completely
- 3:40ordering. The cross encoder
- 3:43what the user was actually
- 3:45for. If you want to learn how to
- 3:48this yourself, I run free live
- 3:51every Friday at noon Eastern.
- 3:55the QR code on screen to join.
- 3:58love to see you there.
- 4:01not just use cross encoders for
- 4:04Math. If you have 100,000
- 4:07and use a cross encoder for
- 4:10one, that is 100,000 forward passes
- 4:13query at 50 milliseconds each. That
- 4:1783 minutes per search. Completely
- 4:20A by encoder, it premputes all
- 4:24embeddings once. At query time,
- 4:27encode the query and do a vector
- 4:29The entire search takes
- 4:31100 milliseconds.
- 4:34is the middle ground. By
- 4:37narrows 100,000 documents to 20
- 4:40milliseconds.
- 4:42encoder reranks 20 documents in
- 4:441 second. Total latency just over
- 4:48second. That is the sweet spot. You
- 4:5195% of the accuracy of a full cross
- 4:54search at a fraction of the
- 4:56You do not have to build a cross
- 5:00from scratch. Cohhere rerank is
- 5:04most popular hosted option. You send
- 5:07query and a list of documents. It
- 5:10them reordered by relevance with
- 5:12Three lines of code. Gina
- 5:16is another option open- source
- 5:18Voyage AI focuses on domain
- 5:22reranking. And if you want to
- 5:24the cross- encoder models on
- 5:27face work well. The MS Marco
- 5:30are the standard starting point.
- 5:33the one that fits your stack. The
- 5:35is the same across all of them.
- 5:38broadly, then rerank precisely.
- 5:42zoom out. Without reranking, your
- 5:45pipeline looks like this. User asks
- 5:48question. Buy encoder retrieves the
- 5:51embeddings. Results go straight
- 5:53the LLM. The LLM works with whatever
- 5:56gets, even if the best document is
- 5:58at position 4. With re-ranking,
- 6:01add one step. The cross encoder
- 6:04each candidate paired with the
- 6:06and reorders them. Now, the LLM
- 6:09the most relevant documents first.
- 6:12context in, better answers out.
- 6:15your RAG app returns technically
- 6:17but not quite right answers,
- 6:19is probably the fix. Retrieve
- 6:22rerank precisely. That's how
- 6:25find the right result. That's the
- 6:28picture. If you want to go deeper,
- 6:31my free live session this Friday at
- 6:33Eastern on Maven. I walk through
- 6:36hands-on, answer questions, and
- 6:39you how to build it yourself. Scan
- 6:41QR code to join.
RAG vs Fine-Tuning vs Prompting - Simple Decision GuideNext: RAG Routing Explained: LLM vs Semantic Router (When to Use What)
Want the next one in your inbox?
Join 1,000+ Product Managers getting one deep dive every Friday.