Splunk Inc.

08/23/2024 | News release | Distributed by Public on 08/23/2024 12:26

RAG: Retrieval Augmented Generation, Explained

With the explosion of LLMs and Chat Assistants, researchers and users of these models quickly bump into a limiting factor:

What happens if my model has not been trained on the specific topic or dataset I'm interested in prompting it with?

If we do nothing then we're likely to get unhelpful responses at best, or factually incorrect hallucinations at worst.

However, there are some solutions to this problem:

  • Retrain or fine tune our model with an extended dataset. Unfortunately, this can be difficult, slow, and expensive.
  • Add all the relevant information into our prompt along with our question so that the model can use this contextual information to create an answer. This again is not ideal and somewhat defeats the point if we have to add a lot of manual context to our prompt to get an answer.

What is RAG? How does it help?

Retrieval Augmented Generation (RAG) is a technique which automates the retrieval of relevant information from datastores connected with a language model, aiming to optimize the output of the model. Ideally, the RAG technique eliminates:

  • The need for expensive fine tuning.
  • The need to add significant manual context to a prompt.

RAG has multiple stages - many of these are partially or fully implemented in libraries and products surrounding emerging LLM solutions, though naive RAG approaches can be quite easy to implement. Because of its simplicity and fast time to reasonable results, RAG is a fundamental technique to most LLM-based solutions emerging today.

In theory, RAG allows us to quickly pull in relevant context and produce more reliable answers. It opens up various enterprise data-stores to immediate interactive query such as:

  • CRM systems
  • Logging platforms
  • Authentication directories
  • Internal wikis
  • Document stores

It is imaginable that almost any internal data store could become part of an interactive knowledge base using RAG as a foundation. Indeed, this is what hyperscalers are beginning to present to their users and customers as the fundamentals of an AI based enterprise. It reliably sidesteps the issue of training data being out of date by allowing the model to access up to date sources with much greater ease.

(Related reading: LLM security with the OWASP Top 10 threats to LLMs.)

How RAG works

RAG approaches the problem of adding necessary additional context much the same way a human answering a question does when they don't already know an answer:

  1. Break the problem down and find relevant sources or sections of sources which are necessary to create an answer.
  2. Collect all the relevant information together.
  3. Produce an answer using both trained/inherent knowledge and the additional context retrieved.

In its simplest form, achieving this process using a RAG system has three components which map to those steps:

Step 1. Indexing

Indexing is the process of taking some set of documents or datastore which you would like your model to be able to access, cleaning it, creating appropriate chunks, and then forming an index and embedding. This makes finding the most relevant parts of your datastore easier for your model.

Indexes are often implemented as a vector database as such a database with an appropriate embedding model can make finding similarity between text chunks very efficient.

(Related reading: data normalization.)

Step 2. Retrieval

Retrieval is the process of taking a user input query and using the index to find chunks of text in your datastore which are relevant to creating an answer.

This is achieved by transforming your input query using an embedding model which produces a vector. This vector can then be used to find similar chunks of text stored in the indexed vector database. For example, you might gather the "top x" number of relevant similar chunks from your index.

Step 3. Generation

Lastly, generation takes the initially proposed query or prompt and combines it with the relevant chunks of information gathered via retrieval to produce a final prompt for the LLM or Assistant.

In the best case this prompt contains all of the information needed for the LLM to produce an appropriate response.

Benefits of RAG

RAG is beneficial in a number of LLM applications as it automates retrieval of relevant information for answering queries that otherwise might not be available to the pre-trained model. It can be used to pull in proprietary or context specific information without the need for expensive and slow model fine tuning or re-training.

This allows organisations to use third-party models to answer questions on relevant data without the need to create their own from scratch. (This is important in enabling more people, teams, and organizations to experiment with LLMs)

It also potentially reduces the rate of hallucinations or unhelpful responses.

Drawbacks of RAG

Whilst the theory presented above is relatively simple to understand, the devil, as usual, is in the detail. There are a number of areas where RAG can be difficult to implement or can struggle to produce the best answers.

For example, depending upon the methods used to index and embed a datastore, the retrieval step may struggle to find either the most relevant chunks or all appropriate context needed to find an answer.

The retrieval and generation steps are also quite sensitive to the size of the chunks used. For instance:

  • Smaller chunks may not contain enough of the original document to synthesize an answer.
  • Larger chunks may provide so much extraneous information as to confuse or hide important context.

RAG also does not prohibit hallucinations in responses, thus continuing to make it difficult to trust outputs from models.

(Related reading: principles for trustworthy AI.)

RAG and the AI explosion

OpenAI's CEO Sam Altman has noted in interviews that he was surprised by how quickly ChatGPT was adopted and grew. Their expectation was that many enterprises would want to fine tune models - thus creating fine-tuned models would be a limiting factor of adoption.

Part of this unexpected explosion was because so many users realized that answers could be gathered by augmenting queries with contextual information directly within prompts - the concept of prompt engineering.

RAG automates this process. Therefore, RAG is likely to be a fundamental component of almost all LLM systems used in enterprise situations, except where large effort is expended to specifically fine-tune or train models for specific tasks.

Trends & RAG variations

It has become very obvious to those using RAG and LLM systems that there are many challenges in producing reliable and consistent answers even when using RAG. Today, there are a plethora of different modified RAG approaches which attempt to improve on the naïve approach in a number of ways.

Advanced RAG

Advanced RAG includes pre- and post-processing steps for data and prompts, adjusting them to better fit with data structures and models in specific cases, thus improving answer accuracy and value.

Modular RAG

Modular RAG uses additional modules to manipulate inputs, outputs, and responses in various ways. This may be adding additional contextual interfaces via:

  • Search engines
  • Other knowledge sources
  • APIs

Modular RAG may also involve creating new iterative pipelines which allow responses to be iteratively refined or fused together using multiple prompts or models. It may further allow the integration of feedback mechanisms to tune prompts and retrieval mechanisms over time.

RAG variations

There are tens or hundreds of variations of RAG that have and could be created using these approaches and others, each of which are going to have specific advantages and disadvantages. As the technology behind LLMs and assistants progresses we will begin to get a sense of which approaches work for different applications.

In the meantime, many of these naïve, advanced, and modular approaches are implemented in libraries such as LangChain and LlamaIndex to make it easier to start to experiment with systems based around LLMs while integrating other datastores.

Additional resources

In Splunk, keep an eye on the work of Huaibo Zhao and Philipp Dreiger as well as future developments in the Splunk App for Data Science and Deep Learning as we further expand its capabilities to include LLM integrations and ways to implement RAG on your own data in Splunk.

Splunk's own AI assistants rely on RAG as a core component and you can read more about that in this Technical Reivew of Splunk AI Assistant for SPL.

At the time of writing in August 2024 this review paper has a great summary of the state of the art with respect to RAG, so please take a look to dive into the details.