← Learn RAG from Scratch
RAG Explained in 10 Mins
Video 1 of 9 · 9:51
Chapters
- 0:00Why RAG matters
- 0:40How basic RAG works
- 1:50The problem with basic RAG
- 2:30The full architecture
- 3:15Query Construction (Text-to-SQL)
- 4:30Query Translation (Multi-Query)
- 5:45Routing
- 6:30Indexing (Chunk Optimization)
- 7:30Retrieval (Re-ranking)
- 8:30Generation (Self-RAG)
- 9:30The complete system
Transcript
Auto generated by YouTube. Click any timestamp to jump to that moment.
Show
Transcript
Auto generated by YouTube. Click any timestamp to jump to that moment.
- 0:03It will confidently tell you
- 0:06refund within 60 days. Sounds
- 0:09Except your actual policy is
- 0:12minus 2 months. The LLM doesn't
- 0:15your data. It just guesses. So you
- 0:18rag. You embed your docs and search
- 0:21But basic rag retrieves your
- 0:23page instead of the refund
- 0:25The user gets annual plans are
- 0:28per year. Real information wrong
- 0:31Production rag routes the
- 0:34to the right database,
- 0:36the exact policy clause, and
- 0:38the correct answer.
- 0:41we improve rag, let's see how
- 0:44RAG works. You start with your
- 0:47documents, a refund policy, a
- 0:50guide, whatever. You break each
- 0:53into smaller pieces called
- 0:56Think of it like cutting a book
- 0:58paragraphs. Each chunk gets
- 1:01into a list of numbers called
- 1:03embedding. These numbers capture the
- 1:06of the text. Similar ideas get
- 1:09numbers. All those embeddings
- 1:12stored in a vector database like
- 1:14cone or chroma. Now a user asks a
- 1:18What is the refund policy for
- 1:20plans? That question also gets
- 1:23into an embedding using the
- 1:25model. The database compares the
- 1:28embedding against all stored
- 1:30and finds the closest
- 1:32The top results, usually three
- 1:36five chunks, get pulled out. Those
- 1:38go into the LLM as context along
- 1:41the original question. The LLM
- 1:44the chunks and writes a coherent
- 1:46based on what it found. That's
- 1:49rag. Document in, chunks out,
- 1:52stored, question matched,
- 1:55generated. The six modules we
- 1:58next are all about making each of
- 2:00steps work better.
- 2:03rag tutorials show you the happy
- 2:06Take some docs, embed them,
- 2:09done. But production rag is a
- 2:12animal. Your retrieval misses
- 2:14docs. Your chunks are the wrong
- 2:17Your users ask vague questions
- 2:19don't match anything. This video
- 2:22the full production rag
- 2:24Six modules that take you
- 2:26it works in a demo to it works at
- 2:30the full picture. A question
- 2:33in on the left. Before it hits any
- 2:36it goes through routing and
- 2:38translation. Then it gets sent to
- 2:41or more data sources, vector stores,
- 2:44databases, graph databases.
- 2:48retrieve documents get ranked,
- 2:50and sometimes rerieved.
- 2:53the generation step produces an
- 2:55and can loop back to re-retrieve
- 2:58the quality isn't good enough. Six
- 3:01each solving a specific failure
- 3:03Not all your data lives in a
- 3:07store. Take this example. A user
- 3:11"What was Q3 revenue for the
- 3:13segment? That question needs
- 3:16SQL query, not a similarity search.
- 3:19LLM parses the intent, extracts the
- 3:22the time period, and the
- 3:25Then it generates the SQL,
- 3:28some amount from sales where
- 3:31equals Q3 2024 and segment
- 3:34enterprise. The query hits your
- 3:37database. The matching rows come
- 3:40Total 2.4 million. That's query
- 3:45Turning natural language
- 3:47the right query for the right
- 3:49Other approaches include text
- 3:52cipher for graph databases and
- 3:55retrievers for automatic
- 3:57filtering.
- 4:00users ask vague questions. How does
- 4:03work? That could mean
- 4:05Multi-query fixes this. The
- 4:09takes that vague question and
- 4:11three specific versions. What
- 4:14protocols does the API
- 4:17How do users log in and get
- 4:20tokens? What is the OOTH 2.0
- 4:23for thirdparty apps? Each version
- 4:26different documents, the API
- 4:29the login guide, the OOTH
- 4:31docs. You combine all three result
- 4:35Now you have coverage that no
- 4:38query could have achieved. Other
- 4:40include rag fusion for
- 4:42ranked results, decomposition
- 4:45breaking complex questions into sub
- 4:47stepback for asking more
- 4:50versions first, and hide for
- 4:53with hypothetical answers. If
- 4:56want to learn how to do this
- 4:58I run free live sessions every
- 5:01at noon Eastern. Scan the QR code
- 5:05screen to join. Would love to see you
- 5:10questions need different
- 5:12What was last quarter's
- 5:14That's financial data. It
- 5:17in a SQL database. The LLM
- 5:20sees the intent, classifies it as
- 5:23financial data, and routes it
- 5:26the sales database.
- 5:28take a different question. Explain
- 5:31refund policy. That's a policy
- 5:34The router classifies it as
- 5:36text and sends it to the
- 5:39store instead. Two questions,
- 5:42completely different data sources.
- 5:45router makes sure each one lands in
- 5:47right place. Semantic routing is
- 5:50approach. Instead of the LLM
- 5:53you embed the question and
- 5:55it against predefined prompt
- 5:59size makes or breaks your
- 6:01Take a refund policy
- 6:03If you embed the whole page as
- 6:06chunk, that's 2,000 tokens. When
- 6:09searches annual plan refund, the
- 6:12matches loosely. Similarity
- 6:140.61.
- 6:17chunk has too much unrelated
- 6:19diluting the match. Now split
- 6:22same document into paragraph level
- 6:24300 tokens each. The same search
- 6:28now hits section 4.2 directly.
- 6:32score 0.94.
- 6:34match, same document, same
- 6:37The only difference is chunk
- 6:39Other indexing strategies include
- 6:42indexing, which
- 6:45both a summary and the full
- 6:47specialized embeddings
- 6:49for your domain, and raptor,
- 6:52builds a tree of summaries at
- 6:54abstraction levels.
- 6:57similarity is not the same as
- 6:59A user asks how to cancel an
- 7:03subscription. Vector search
- 7:06five documents. pricing tiers,
- 7:09FAQ, support guide, cancellation
- 7:12and onboarding docs. All scored
- 7:16cosign similarity. The cancellation
- 7:19is ranked fourth at 0.83.
- 7:22should be first. A re-ranker fixes
- 7:25It takes those five results and
- 7:28them by actual relevance to the
- 7:30The cancellation policy jumps
- 7:33number four to number one. The
- 7:36page drops to four. Same
- 7:38better ordering. That's the
- 7:41between finding a result and
- 7:43the right result. Other
- 7:45include crag for evaluating
- 7:48quality and active retrieval
- 7:51going back to search again when
- 7:53are not good enough. The LLM
- 7:56an answer, but how do you know
- 7:59grounded in the actual documents?
- 8:02adds a verification loop. Here's
- 8:05it works. A user asks, "What's the
- 8:09fee for annual plans?" The
- 8:12generates, "There is typically a 20%
- 8:16fee." Sounds reasonable,
- 8:18the evaluation step checks that
- 8:20against the retrieved source
- 8:22The source says pro-rated
- 8:25not cancellation fee. The answer
- 8:28not grounded. So, the system
- 8:30This time it pulls the
- 8:33clause from the cancellation
- 8:35The LLM generates again
- 8:38refund minus two months of
- 8:40used. The evaluation step checks
- 8:43This time it matches the source.
- 8:46return the answer. Another
- 8:49is RR which rewrites the query,
- 8:53again, and reads the new
- 8:55in a similar loop. Here's the
- 8:59production rag architecture.
- 9:02comes in, routing sends it to
- 9:05right place. Query translation
- 9:08the question. The right
- 9:10get searched. Results get ranked
- 9:13refined. The LLM generates an
- 9:16And if the answer isn't good
- 9:18the system loops back. Six
- 9:21each independently improvable.
- 9:24it incrementally.
- 9:26the full picture. If you want to
- 9:28deeper, join my free live session
- 9:31Friday at noon Eastern on Maven. I
- 9:34through this hands-on, answer
- 9:36and show you how to build it
- 9:39Scan the QR code to join.
Want the next one in your inbox?
Join 1,000+ Product Managers getting one deep dive every Friday.