← Learn RAG from Scratch

RAG Explained in 10 Mins

Video 1 of 9 · 9:51

Chapters

0:00Why RAG matters
0:40How basic RAG works
1:50The problem with basic RAG
2:30The full architecture
3:15Query Construction (Text-to-SQL)
4:30Query Translation (Multi-Query)
5:45Routing
6:30Indexing (Chunk Optimization)
7:30Retrieval (Re-ranking)
8:30Generation (Self-RAG)
9:30The complete system

Transcript

Auto generated by YouTube. Click any timestamp to jump to that moment.

Show

0:03It will confidently tell you
0:06refund within 60 days. Sounds
0:09Except your actual policy is
0:12minus 2 months. The LLM doesn't
0:15your data. It just guesses. So you
0:18rag. You embed your docs and search
0:21But basic rag retrieves your
0:23page instead of the refund
0:25The user gets annual plans are
0:28per year. Real information wrong
0:31Production rag routes the
0:34to the right database,
0:36the exact policy clause, and
0:38the correct answer.
0:41we improve rag, let's see how
0:44RAG works. You start with your
0:47documents, a refund policy, a
0:50guide, whatever. You break each
0:53into smaller pieces called
0:56Think of it like cutting a book
0:58paragraphs. Each chunk gets
1:01into a list of numbers called
1:03embedding. These numbers capture the
1:06of the text. Similar ideas get
1:09numbers. All those embeddings
1:12stored in a vector database like
1:14cone or chroma. Now a user asks a
1:18What is the refund policy for
1:20plans? That question also gets
1:23into an embedding using the
1:25model. The database compares the
1:28embedding against all stored
1:30and finds the closest
1:32The top results, usually three
1:36five chunks, get pulled out. Those
1:38go into the LLM as context along
1:41the original question. The LLM
1:44the chunks and writes a coherent
1:46based on what it found. That's
1:49rag. Document in, chunks out,
1:52stored, question matched,
1:55generated. The six modules we
1:58next are all about making each of
2:00steps work better.
2:03rag tutorials show you the happy
2:06Take some docs, embed them,
2:09done. But production rag is a
2:12animal. Your retrieval misses
2:14docs. Your chunks are the wrong
2:17Your users ask vague questions
2:19don't match anything. This video
2:22the full production rag
2:24Six modules that take you
2:26it works in a demo to it works at
2:30the full picture. A question
2:33in on the left. Before it hits any
2:36it goes through routing and
2:38translation. Then it gets sent to
2:41or more data sources, vector stores,
2:44databases, graph databases.
2:48retrieve documents get ranked,
2:50and sometimes rerieved.
2:53the generation step produces an
2:55and can loop back to re-retrieve
2:58the quality isn't good enough. Six
3:01each solving a specific failure
3:03Not all your data lives in a
3:07store. Take this example. A user
3:11"What was Q3 revenue for the
3:13segment? That question needs
3:16SQL query, not a similarity search.
3:19LLM parses the intent, extracts the
3:22the time period, and the
3:25Then it generates the SQL,
3:28some amount from sales where
3:31equals Q3 2024 and segment
3:34enterprise. The query hits your
3:37database. The matching rows come
3:40Total 2.4 million. That's query
3:45Turning natural language
3:47the right query for the right
3:49Other approaches include text
3:52cipher for graph databases and
3:55retrievers for automatic
3:57filtering.
4:00users ask vague questions. How does
4:03work? That could mean
4:05Multi-query fixes this. The
4:09takes that vague question and
4:11three specific versions. What
4:14protocols does the API
4:17How do users log in and get
4:20tokens? What is the OOTH 2.0
4:23for thirdparty apps? Each version
4:26different documents, the API
4:29the login guide, the OOTH
4:31docs. You combine all three result
4:35Now you have coverage that no
4:38query could have achieved. Other
4:40include rag fusion for
4:42ranked results, decomposition
4:45breaking complex questions into sub
4:47stepback for asking more
4:50versions first, and hide for
4:53with hypothetical answers. If
4:56want to learn how to do this
4:58I run free live sessions every
5:01at noon Eastern. Scan the QR code
5:05screen to join. Would love to see you
5:10questions need different
5:12What was last quarter's
5:14That's financial data. It
5:17in a SQL database. The LLM
5:20sees the intent, classifies it as
5:23financial data, and routes it
5:26the sales database.
5:28take a different question. Explain
5:31refund policy. That's a policy
5:34The router classifies it as
5:36text and sends it to the
5:39store instead. Two questions,
5:42completely different data sources.
5:45router makes sure each one lands in
5:47right place. Semantic routing is
5:50approach. Instead of the LLM
5:53you embed the question and
5:55it against predefined prompt
5:59size makes or breaks your
6:01Take a refund policy
6:03If you embed the whole page as
6:06chunk, that's 2,000 tokens. When
6:09searches annual plan refund, the
6:12matches loosely. Similarity
6:140.61.
6:17chunk has too much unrelated
6:19diluting the match. Now split
6:22same document into paragraph level
6:24300 tokens each. The same search
6:28now hits section 4.2 directly.
6:32score 0.94.
6:34match, same document, same
6:37The only difference is chunk
6:39Other indexing strategies include
6:42indexing, which
6:45both a summary and the full
6:47specialized embeddings
6:49for your domain, and raptor,
6:52builds a tree of summaries at
6:54abstraction levels.
6:57similarity is not the same as
6:59A user asks how to cancel an
7:03subscription. Vector search
7:06five documents. pricing tiers,
7:09FAQ, support guide, cancellation
7:12and onboarding docs. All scored
7:16cosign similarity. The cancellation
7:19is ranked fourth at 0.83.
7:22should be first. A re-ranker fixes
7:25It takes those five results and
7:28them by actual relevance to the
7:30The cancellation policy jumps
7:33number four to number one. The
7:36page drops to four. Same
7:38better ordering. That's the
7:41between finding a result and
7:43the right result. Other
7:45include crag for evaluating
7:48quality and active retrieval
7:51going back to search again when
7:53are not good enough. The LLM
7:56an answer, but how do you know
7:59grounded in the actual documents?
8:02adds a verification loop. Here's
8:05it works. A user asks, "What's the
8:09fee for annual plans?" The
8:12generates, "There is typically a 20%
8:16fee." Sounds reasonable,
8:18the evaluation step checks that
8:20against the retrieved source
8:22The source says pro-rated
8:25not cancellation fee. The answer
8:28not grounded. So, the system
8:30This time it pulls the
8:33clause from the cancellation
8:35The LLM generates again
8:38refund minus two months of
8:40used. The evaluation step checks
8:43This time it matches the source.
8:46return the answer. Another
8:49is RR which rewrites the query,
8:53again, and reads the new
8:55in a similar loop. Here's the
8:59production rag architecture.
9:02comes in, routing sends it to
9:05right place. Query translation
9:08the question. The right
9:10get searched. Results get ranked
9:13refined. The LLM generates an
9:16And if the answer isn't good
9:18the system loops back. Six
9:21each independently improvable.
9:24it incrementally.
9:26the full picture. If you want to
9:28deeper, join my free live session
9:31Friday at noon Eastern on Maven. I
9:34through this hands-on, answer
9:36and show you how to build it
9:39Scan the QR code to join.

Next: RAG Chunking Strategies Explained in 5 Minutes

Want the next one in your inbox?

Join 1,000+ Product Managers getting one deep dive every Friday.