RAFT – A Fine-Tuning and RAG Approach to Domain-Specific Question Answering

As the applications of large language models expand into specialized domains, the need for efficient and effective adaptation techniques becomes increasingly crucial. Enter RAFT (Retrieval Augmented Fine Tuning), a novel approach that combines the strengths of retrieval-augmented generation (RAG) and fine-tuning, tailored specifically for domain-specific question answering tasks.

The Challenge of Domain Adaptation

While LLMs are pre-trained on vast amounts of data, their ability to perform well in specialized domains, such as medical research, legal documentation, or enterprise-specific knowledge bases, is often limited. This limitation arises because the pre-training data may not adequately represent the nuances and intricacies of these specialized domains. To address this challenge, researchers have traditionally employed two main techniques: retrieval-augmented generation (RAG) and fine-tuning.

Retrieval-Augmented Generation (RAG)

RAG

RAG is a technique that enables LLMs to access and utilize external knowledge sources during inference.

It achieves this by integrating real-time data retrieval into the generative process, thus making the model’s outputs more accurate and up-to-date. RAG consists of three core steps: retrieval, where relevant documents are gathered; generation, where the model produces an output based on the retrieved data; and augmentation, which refines the output further.

The retrieval process in RAG starts with a user’s query. LLMs analyze the query and fetch pertinent information from external databases, presenting a pool of data from which the model can draw to formulate its responses. The generation phase then synthesizes this input into a coherent narrative or answer. The augmentation step refines the generation by adding context or adjusting for coherence and relevance.

RAG models can be evaluated using a variety of metrics, assessing their ability to provide accurate, relevant, and up-to-date information.

Fine-Tuning

supervised-fine-tuning

supervised-fine-tuning

Fine-tuning, on the other hand, involves adapting a pre-trained LLM to a specific task or domain by further training it on a smaller, task-specific dataset. This approach allows the model to learn patterns and align its outputs with the desired task or domain. While fine-tuning can improve the model’s performance, it often fails to effectively incorporate external knowledge sources or account for retrieval imperfections during inference.

The RAFT Approach

RAFT

RAFT

RAFT standing for Retrieval-Aware Fine-Tuning, is an innovative training method tailored for language models to enhance their performance in domain-specific tasks, particularly for open-book exams. RAFT diverges from standard fine-tuning by preparing training data that incorporates questions with a mix of relevant and non-relevant documents, along with chain-of-thought styled answers derived from the relevant texts. This method aims to improve models’ abilities to not only recall information but also reason and derive answers from provided content.

In essence, RAFT fine-tunes language models to be more proficient in tasks that involve reading comprehension and knowledge extraction from a set of documents. By training with both “oracle” documents (which contain the answer) and “distractor” documents (which do not), the model learns to discern and utilize relevant information more effectively.

Training Data Preparation

The training process under RAFT involves a proportion of the data to contain oracle documents that directly relate to the answers, while the remaining data consists only of distractor documents. The fine-tuning encourages the model to learn when to rely on its internal knowledge (akin to memorization) and when to extract information from the context provided.

RAFT’s training regimen also emphasizes the generation of reasoning processes, which not only help in forming the answer but also cite sources, similar to how a human would justify their response by referencing material they have read. This approach not only prepares the model for a RAG (Retrieval Augmented Generation) setting where it has to consider top-k retrieved documents but also ensures the model’s training is independent of the retriever used, allowing for flexible application across different retrieval systems.

This approach serves multiple purposes:

  1. It trains the model to identify and utilize relevant information from the provided context, mimicking the open-book exam setting.
  2. It enhances the model’s ability to disregard irrelevant information, a critical skill for effective RAG.
  3. It exposes the model to scenarios where the answer is not present in the context, encouraging it to rely on its own knowledge when necessary.

Another key aspect of RAFT is the incorporation of chain-of-thought reasoning into the training process. Instead of simply providing the question and answer pairs, RAFT generates detailed reasoning explanations that include verbatim citations from the relevant documents. These explanations, presented in a chain-of-thought format, guide the model through the logical steps required to arrive at the correct answer.

By training the model on these reasoning chains, RAFT encourages the development of strong reasoning abilities and enhances the model’s understanding of how to effectively leverage external knowledge sources.

Evaluation and Results

The authors of the RAFT paper conducted extensive evaluations on various datasets, including PubMed (biomedical research), HotpotQA (open-domain question answering), and the Gorilla APIBench (code generation). Their results demonstrated that RAFT consistently outperformed baselines, such as domain-specific fine-tuning with and without RAG, as well as larger models like GPT-3.5 with RAG.

RAFT improves RAG performance

RAFT improves RAG performance

For instance, on the HuggingFace dataset, RAFT achieved an accuracy of 74%, a significant improvement of 31.41% over domain-specific fine-tuning (DSF) and 44.92% over GPT-3.5 with RAG. Similarly, on the HotpotQA dataset, RAFT exhibited a 28.9% accuracy gain compared to DSF.

One of the key advantages of RAFT is its robustness to retrieval imperfections. By training the model with a mix of relevant and irrelevant documents, RAFT enhances the model’s ability to discern and prioritize relevant information, even when the retrieval module returns suboptimal results.

The authors demonstrated that fine-tuning with only the oracle documents often leads to inferior performance compared to configurations that include distractor documents. This finding underscores the importance of exposing the model to varying retrieval scenarios during training, ensuring its preparedness for real-world applications.

Practical Applications and Future Directions

The RAFT technique has significant implications for a wide range of practical applications, including:

  1. Question Answering Systems: RAFT can be employed to build highly accurate and domain-specific question answering systems, leveraging both the model’s learned knowledge and external knowledge sources.
  2. Enterprise Knowledge Management: Organizations with large knowledge bases can leverage RAFT to develop customized question answering systems, enabling employees to quickly access and utilize relevant information.
  3. Medical and Scientific Research: RAFT can be particularly valuable in domains such as biomedical research, where access to the latest findings and literature is crucial for advancing scientific understanding.
  4. Legal and Financial Services: RAFT can assist professionals in these fields by providing accurate and context-aware responses based on relevant legal documents or financial reports.

As research in this area continues, we can expect further advancements and refinements to the RAFT technique. Potential future directions include:

  1. Exploration of more efficient and effective retrieval modules, tailored for specific domains or document structures.
  2. Integration of multi-modal information, such as images or tables, into the RAFT framework for enhanced context understanding.
  3. Development of specialized reasoning architectures that can better leverage the chain-of-thought explanations generated during training.
  4. Adaptation of RAFT to other natural language tasks beyond question answering, such as summarization, translation, or dialogue systems.

Conclusion

RAFT represents a significant leap forward in the field of domain-specific question answering with language models. By harmoniously blending the strengths of retrieval-augmented generation and fine-tuning, RAFT equips LLMs with the ability to effectively leverage external knowledge sources while also aligning their outputs with domain-specific patterns and preferences.

Through its innovative training data curation, incorporation of chain-of-thought reasoning, and robustness to retrieval imperfections, RAFT offers a powerful solution for organizations and researchers seeking to unlock the full potential of LLMs in specialized domains.

As the demand for domain-specific natural language processing capabilities continues to grow, techniques like RAFT will play a pivotal role in enabling more accurate, context-aware, and adaptive language models, paving the way for a future where human-machine communication becomes truly seamless and domain-agnostic.