← Learn RAG from Scratch

RAG Re-ranking: Bi-Encoder vs Cross-Encoder Explained

Video 4 of 9 · 6:52

Chapters

0:00The ranking problem
0:50How bi-encoders work
1:45Cross-encoders: reading together
2:45Retrieve then re-rank pattern

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show

0:03the refund policy for annual plans?
0:06vector search returns five results.
0:09pricing page is number one. The
0:12refund policy buried at number
0:14The embeddings are close, but
0:17is not the same as correct. Vector
0:20measures distance in
0:22space. It does not measure
0:24a document actually answers the
0:26This is the gap that kills rag
0:29in production. The results look
0:31The scores are high, but the
0:34answer is not on top. To understand
0:37fix, you need to understand the
0:40Most rack systems use by
0:42for retrieval. A by encoder
0:45the query and each document
0:47The query goes through one
0:50Each document goes through the
0:52encoder independently. Both come
0:55as vectors. Then you compare them
0:58cosign similarity. This is fast.
1:01can premputee all your document
1:03once and store them. When a
1:06comes in, you encode it and
1:08against everything in
1:10The trade-off, the encoder
1:13sees the query and document
1:15It cannot reason about how
1:18relate. It just compares two points
1:20space. That is why a pricing page
1:23annual plans scores high for a
1:26about refund policy for annual
1:29The words overlap. A cross
1:32takes a completely different
1:35Instead of encoding the query
1:38document separately, it reads them
1:40as one input. The query and
1:43get concatenated. They go
1:46a single transformer as one
1:48The model sees every word in
1:51query next to every word in the
1:53It can reason about their
1:55directly. The output is a
1:58relevant score, not a vector
2:01but a direct prediction. How
2:04is this document to this
2:06This is much more accurate.
2:08cross encoder understands that
2:11policy for annual plans is asking
2:14refunds, not pricing. Even though
2:17documents mention annual plans, but
2:20is slow. You cannot premputee
2:22Every query document pair
2:25a full forward pass through the
2:27Here is the technique that fixes
2:31Cross encoder reranking. It
2:34the speed of buyenccoders with
2:37accuracy of cross encoders in a
2:40pipeline. Stage one, your buy
2:43retrieves the top 20 candidates.
2:46is fast milliseconds. You get a
2:49set of potentially relevant
2:51Stage two, the cross encoder
2:55those 20 candidates. It reads
2:58one paired with the original query.
3:01it can reason about actual
3:0320 forward passes instead of
3:06your entire database. The
3:08change dramatically. Watch the
3:11policy example. Before
3:14the pricing page sits at
3:16one with a similarity score of
3:20refund policy is stuck at number
3:22with 0.58.
3:25the cross encoder reranks, the
3:28policy jumps to number one with a
3:30score of 0.94.
3:33pricing page drops to number four
3:350.31.
3:37documents, same query, completely
3:40ordering. The cross encoder
3:43what the user was actually
3:45for. If you want to learn how to
3:48this yourself, I run free live
3:51every Friday at noon Eastern.
3:55the QR code on screen to join.
3:58love to see you there.
4:01not just use cross encoders for
4:04Math. If you have 100,000
4:07and use a cross encoder for
4:10one, that is 100,000 forward passes
4:13query at 50 milliseconds each. That
4:1783 minutes per search. Completely
4:20A by encoder, it premputes all
4:24embeddings once. At query time,
4:27encode the query and do a vector
4:29The entire search takes
4:31100 milliseconds.
4:34is the middle ground. By
4:37narrows 100,000 documents to 20
4:40milliseconds.
4:42encoder reranks 20 documents in
4:441 second. Total latency just over
4:48second. That is the sweet spot. You
4:5195% of the accuracy of a full cross
4:54search at a fraction of the
4:56You do not have to build a cross
5:00from scratch. Cohhere rerank is
5:04most popular hosted option. You send
5:07query and a list of documents. It
5:10them reordered by relevance with
5:12Three lines of code. Gina
5:16is another option open- source
5:18Voyage AI focuses on domain
5:22reranking. And if you want to
5:24the cross- encoder models on
5:27face work well. The MS Marco
5:30are the standard starting point.
5:33the one that fits your stack. The
5:35is the same across all of them.
5:38broadly, then rerank precisely.
5:42zoom out. Without reranking, your
5:45pipeline looks like this. User asks
5:48question. Buy encoder retrieves the
5:51embeddings. Results go straight
5:53the LLM. The LLM works with whatever
5:56gets, even if the best document is
5:58at position 4. With re-ranking,
6:01add one step. The cross encoder
6:04each candidate paired with the
6:06and reorders them. Now, the LLM
6:09the most relevant documents first.
6:12context in, better answers out.
6:15your RAG app returns technically
6:17but not quite right answers,
6:19is probably the fix. Retrieve
6:22rerank precisely. That's how
6:25find the right result. That's the
6:28picture. If you want to go deeper,
6:31my free live session this Friday at
6:33Eastern on Maven. I walk through
6:36hands-on, answer questions, and
6:39you how to build it yourself. Scan
6:41QR code to join.

RAG vs Fine-Tuning vs Prompting - Simple Decision Guide Next: RAG Routing Explained: LLM vs Semantic Router (When to Use What)

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.