
Part of the AI Engineering course I’m doing covered RAFT (Retrieval-Augmented Fine-Tuning) but I really didn’t understand.
What if your LLM has access to the right documents, but it still gives lousy answers. Maybe it gets distracted by irrelevant information. Maybe it can’t tell which document actually has the answer. Maybe it just makes stuff up anyway.
RAFT (Retrieval-Augmented Fine-Tuning) is a training technique that helps fix this. Instead of just memorizing facts through fine-tuning, you teach the model to be a better reader of retrieved context.
The abstract from the original paper is a good starting point:
Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for manydownstream applications, it is common to additionally incorporate new information into the pretrained model either through RAG-based-prompting, or finetuning. However, the best methodology to incorporate information remains an open question. In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model’s ability to answer questions in “open-book” in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don’t help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document to help answer the question. This coupled with RAFT’s chain-of-thought-style response helps improve the model’s ability to reason. In domain specific RAG, RAFT consistently improves the model’s performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG.
The Core Idea
Standard fine-tuning updates model weights based on training data. You’re essentially burning knowledge into the model’s neurons. RAFT does something different: it trains the core LLM to navigate and extract information from documents that get retrieved at runtime.
Think of it this way:
- Standard fine-tuning: Teaching someone facts from a textbook. Better training.
- RAFT: Teaching someone how to skim through a library and find the right answers Better retrieving.
What Makes RAFT Different
The key insight is in how you structure the training data. For each training example, you include:
- A question
- The document with the answer (the relevant one)
- Several distractor documents (irrelevant ones)
- The expected answer with a citation
This forces the model to learn discrimination. Not every document that looks relevant actually helps. The model needs to figure out which one has the goods. Here’s what that looks like in a prompt:
| |
Build Time vs Runtime
RAFT happens at build time. You:
- Create training data with questions, relevant docs, and distractors (the unhelpful ones)
- Fine-tune your retrieval model on this data. This is where I kept getting confusing. You’re not training the main LLM; are are training a special retrieval model that is strictly limited to the task of retrieving relevant documents.
- Save the fine-tuned model
At runtime, you just use the resulting model in your normal RAG pipeline. No extra overhead.
Good in theory, but in our class we needed to code something up. This is not tested and not supposed to be anything like a production setup: it’s a learning example.
A Practical Example
Let’s say you have a wiki knowledge base with markdown articles. Here’s how you’d use RAFT with open source tools:
First, create your vector DB training on your data. In this case, it’s all of the wiki articles.
| |
Creating Training Data
The critical piece: you need question-answer pairs that map to specific articles. You have options:
Option 1: Manual curation (highest quality, most work)
| |
This bit really confused me.
What is the correct_article_id there?
Because as far as I know, wiki articles are long and would need to be split into smaller chunks for better context understanding and to avoid exceeding token limits.
The answer was obvious in retrospect: you do actually have to use the chunked article ID in training rather than some pointer to the original article.
This led me down a helpful rabbit hole of what you choose to put into the vector DBs, because metadata like
original_content_source,publish_date, etc, is used for filtering, ranking, and discriminating content. But vector DB design is a topic for a different day 😅
Anyway, back to options for creating training data for the RAG model.
Option 2: Synthetic generation (faster, decent quality)
Notice the specific structure of the output text? That’s one of the key insights of the RAFT methodology.
| |
Option 3: Real user queries (if you have logs)
Use actual questions from users and label which article has the answer. You figure if a user gives a 👍 or 👎 to an answer and that is logged, you’ve got instant data to discriminate for your RAG training.
Building the Training Dataset
Once you have QA pairs, format them for RAFT:
| |
Fine-Tuning
| |
Using It At Runtime
Whew. Now we have a fine-tuned model for our RAG data. Let’s use it! As you’ll see for an educational example–rather than production ready–we read the file containing our model and using it to improve the answer to a question.
| |
What Lives Where
This confused me at first, so here’s the breakdown:
Vector database (ChromaDB, Pinecone, etc.):
- Your wiki articles as embeddings
- Metadata (titles, IDs)
- Lives wherever you run your database
Fine-tuned model (saved to disk):
- Modified weights (the “smart reader”)
- Lives at
./raft-finetuned-model/(well, in my educational example at least) - Contains pytorch_model.bin, config.json, tokenizer files
At runtime: Vector DB retrieves articles → Fine-tuned model reads them → Answer comes out
Key Insights
Distractors matter: The whole point is teaching the model to ignore irrelevant information. Your distractors should be semantically similar to the question (that’s why they got retrieved) but shouldn’t contain the answer.
This is different from RAG alone: Regular RAG just stuffs documents into context and hopes for the best. RAFT trains the model to handle that context intelligently.
Well, maybe ‘hopes for the best’ is a bit harsh. Good vector DB design and query preparation play a crucial role in ensuring that the model receives relevant and accurate information. It’s not all ‘hope’! It’s a combination of smart design and careful preparation.
You still need good QA pairs: The quality of your training data matters. Start with 50-100 manually curated examples for critical articles, then scale with synthetic generation. Again - that’s for my educational example. If you’ve got user data then you probably have a lot more to work with.
Recommendation
I was having a deep conversation with Claude about this. Deep! When should I fine tune the original LLM and when should I use RAFT to fine-tune the RAG setup?
Here’s why RAFT could make sense for something like a wiki:
- Wiki content changes frequently → RAG updates easily without retraining
- Attribution matters → You can cite which article the answer came from
- Scalability → Adding new articles is trivial because you don’t have the retain (adjust weights) in the main LLM.