Exploring key concepts of one of the most popular methods in generative AI solutions.
Today we will Discuss:
-
An introduction to our new series about RAG.
-
A deep dive into what making consider the first RAG paper.
💡 ML Concept of the Day: A New Series About Retrieval Augmented Generation(RAG)
As our first new series of 2025, we would like to cover one of the simplest but most active areas in generative AI. We are talking about retrieval augmented generation or how we often refer to it: RAG.
Conceptually, RAG is an architectural framework that enhances the functionality of large language models (LLMs) by incorporating external data retrieval mechanisms. This integration allows LLMs to access real-time, relevant information, thereby addressing the limitations of traditional generative models that rely solely on static training data. By retrieving pertinent documents or data points in response to specific queries, RAG ensures that the generated outputs are not only contextually appropriate but also factually accurate, significantly reducing the incidence of outdated or erroneous information. This capability is particularly beneficial in applications such as customer support and knowledge management, where timely and precise responses are critical.
The primary methods employed in RAG involve a two-stage process: first, retrieving relevant information from a curated set of external sources, and second, utilizing this information to inform the generation of responses. This dual approach allows RAG to dynamically augment the generative capabilities of LLMs with up-to-date context, enhancing their performance across various tasks. Techniques such as vector-based retrieval and query expansion are commonly used to improve the relevance and accuracy of the retrieved information. Furthermore, RAG systems can be designed to include mechanisms for citation and source attribution, enabling users to verify the accuracy of the generated content and fostering trust in AI outputs.
Despite its advantages, implementing RAG poses several challenges that organizations must navigate. One significant hurdle is the complexity of integrating retrieval systems with generative models, which requires specialized knowledge in both natural language processing and information retrieval. Additionally, the effectiveness of a RAG system is heavily dependent on the quality and reliability of the external data sources it utilizes; poor-quality data can lead to misleading outputs or propagate inaccuracies. Latency issues can also arise during retrieval operations, particularly when accessing large datasets or multiple sources simultaneously, potentially impacting user experience in time-sensitive applications.
Throughout this series, we will be exploring the core RAG methods as well as relevant research in the space.
🔎 ML Research You Should Know About: The Original RAG Paper
-
In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, researchers from Facebook AI Research(FAIR) , University College London ans New York Universityintroduced the original concepts for RAG applications.
-
Why it is so important? This paper inspired the RAG movement popularized with foundation models.
This research paper presents a novel approach called Retrieval-Augmented Generation (RAG) for improving the performance of pre-trained language models on knowledge-intensive tasks. RAG models combine the strengths of parametric memory (knowledge stored in model parameters) with non-parametric memory (knowledge retrieved from an external source like Wikipedia) to generate more factual, specific, and diverse responses. The core of the RAG model consists of a pre-trained seq2seq model like BART for parametric memory and a dense vector index of Wikipedia accessed with a pre-trained neural retriever like DPR for non-parametric memory. The retriever identifies relevant documents based on the input, and the seq2seq model generates the output based on both the input and the retrieved documents.
The paper explores two formulations of RAG: RAG-Sequence, where the same retrieved passage is used for the whole generated sequence, and RAG-Token, where different passages can be used for each token. By marginalizing over the retrieved documents, RAG models can effectively utilize the combined knowledge from both memory types. The training process involves jointly fine-tuning the retriever and generator components without any direct supervision on document retrieval.
A key contribution of this research is the demonstration of RAG’s effectiveness on a wide range of knowledge-intensive NLP tasks. On open-domain question answering tasks, RAG models achieve state-of-the-art results, outperforming both parametric-only and retrieval-based methods. For abstractive question answering, RAG generates more factual and less hallucinated responses compared to a BART baseline. On the challenging task of Jeopardy question generation, RAG outperforms BART in both automatic and human evaluations, showing significant improvements in factuality and specificity.
The paper also highlights the advantages of RAG’s non-parametric memory. For instance, the model’s knowledge can be easily updated by replacing the retrieval index without retraining, as demonstrated by accurately answering questions about world leaders using different Wikipedia dumps. Furthermore, the authors explore the impact of retrieving varying numbers of documents on performance, showing that RAG-Token benefits from retrieving a limited number of documents while RAG-Sequence performance continues to improve with more documents.
Overall, this research demonstrates the potential of combining parametric and non-parametric memory in language models for tackling knowledge-intensive NLP tasks. RAG models show promise in various applications, including question answering, text generation, and fact verification. The ability to update the model’s knowledge without retraining, the generation of more factual and specific responses, and the flexibility of using different retrieval strategies for different tasks contribute to the significance of this research.