Fine Tuning vs. RAG (Retrieval-Augmented Generation)

There are two approaches to using large language models (LLMs) on our data:

Fine-tuning
RAG (Retrieval-Augmented Generation)

Fine-tuning involves training an LLM on a specific task, such as question answering or summarization, using your data. RAG, on the other hand, provides a way to incorporate your data with LLMs while executing customer queries on the business data. The choice of which approach to use depends on a variety of factors, such as the amount of data you have available, the specific task you want to use the LLM for, and your budget.

Fine-tuning is a more time-consuming and expensive approach, but it can lead to better results if you have a lot of data available and a specific task in mind. RAG on the other hand is a faster and cheaper approach, but it may not be as accurate as fine-tuning. Ultimately, the best approach for you will depend on your specific needs and resources.

When to use Fine-Tuning?

Fine-tuning is suitable when there is a large amount of task-specific labeled data available. Using Fine-tuning we adjust a pre-trained model's parameters to fit a new task. This is done by feeding the model with task-specific labeled data, which is data that has been labeled with the correct output for the task.

For example, if you want to train a model to classify images of cats and dogs, you could start with a pre-trained model that has been trained on a large dataset of images. You could then fine-tune this model on a dataset of images of cats and dogs that have been labeled with the correct species. This would allow the model to learn the specific features that are associated with cats and dogs, and improve its performance on the task of image classification.

Fine-tuning is also helpful for tasks like summarization because it allows the model to learn from a large amount of data that is relevant to the task at hand. For example, let's say you want to use a model to summarize text. You could fine-tune it on a dataset of text summaries. This would allow the model to learn the patterns that are common in summaries, and it would be better able to summarize new text.

Here is an example of how fine-tuning could be used to improve the performance of a model on a summarization task. The model is first pre-trained on a large corpus of text. This allows the model to learn the general patterns of language. The model is then fine-tuned on a dataset of text summaries. This allows the model to learn the specific patterns that are common in summaries.

Challenges with Fine-Tuning

The primary challenges with fine-tuning are that it is computationally expensive, requires gathering large data sets, and is resource-intensive. Fine-tuning can be a very time-consuming process, as it requires training a large neural network on a large dataset. This can be very expensive, both in terms of the time it takes to train the model and the cost of the hardware required to do so. Additionally, fine-tuning requires a lot of data, as the model needs to be trained on a large number of examples in order to learn the task at hand. This can be difficult to obtain, especially for rare or specialized tasks. Finally, fine-tuning can be resource-intensive, as it requires a lot of memory and processing power to train a large neural network. This can be a problem for users who do not have access to powerful computers or who do not want to spend a lot of money on training their models. But methods like Parameter Efficient Fine-Tuning (PEFT), Quantization, and Pruning can help mitigate these challenges.

Parameter Efficient Fine-Tuning (PEFT)

Parameter Efficient Fine-tuning (PEFT) is a method for fine-tuning large language models (LLMs) on a specific task. It does this by using a smaller model, called a "student model," to learn from a larger model, called a "teacher model." The student model is then fine-tuned on the specific task, using the teacher model as a guide. This allows the student model to learn more quickly and efficiently than if it were to be trained from scratch.

PEFT has been shown to be effective on a variety of tasks, including natural language inference, question answering, and summarization. It is a promising approach for fine-tuning LLMs on specific tasks, and it has the potential to improve the performance of LLMs on a variety of applications.

PEFT does come with its share of drawbacks. Firstly, selecting the appropriate hyperparameters for PEFT can be a challenging task. Secondly, fine-tuning LLMs can be problematic when dealing with tasks that lack sufficient representation in the training data.

Quantization

Quantization is a process of converting a model from floating-point numbers to integers, which can be used to reduce the size of the model and make it faster to run on devices with limited memory. It can also be used to improve the accuracy of a model, as it can reduce the effects of rounding errors. For fine-tuning LLM models, quantization can be a powerful technique to improve the performance of the model. However, it is important to be aware of the potential drawbacks of quantization, such as introducing errors into the model and should be used carefully. To reduce the errors, a higher precision for the quantized model can be used. Additionally, quantization can be combined with other techniques, such as pruning, to further reduce the size and improve the accuracy of the model.

Pruning

Pruning is a technique used to reduce the size of the language model. This is done by removing connections between neurons in the model, which makes it less computationally expensive to run. Pruning can also improve the performance of the model by making it more focused on the task at hand. Pruning is typically done after the model has been fine-tuned on a specific task. The model is then evaluated on a separate set of data, and the connections that are least important for the task are removed. This process can be repeated multiple times to further reduce the size of the model. Pruning can be a very effective way to reduce the size of an LLM without sacrificing performance. However, it is important to note that pruning can also make the model more brittle, meaning that it may be more likely to make mistakes on new data.

RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation) is a technique that combines the strengths of both retrieval and generation models. Retrieval models are good at finding relevant information from a large corpus of text, while generation models are good at creating new text. RAG models first retrieve a set of relevant documents from a corpus, and then use those documents to help generate the final output. This can lead to more accurate and informative results than either retrieval or generation models alone.

RAG model first retrieves a set of relevant documents from a corpus. The documents are then used to create a context vector, which is a representation of the information in the documents. The context vector is then used to generate the final output.

RAG is effective for a variety of tasks, including question-answering, summarization, and translation. It is a promising technique that can be used to improve the performance of many natural language processing tasks. RAG demonstrates improved resource usage efficiency and offers quicker results, making it a good fit for applications that have limited computational resources, real-time demands, or low latency requirements.

RAG Implementation

Let's now try to understand RAG implementation, as explained before RAG is the combination of both retrieval and generation models. Retrieval is used to find the most relevant passages from a large corpus of text that match a given query. This is done by using a technique called semantic search, which involves using a large language model to understand the meaning of the query and then using that understanding to find the most relevant passages from the data corpus.

Now let's say we want to retrieve only a specific set of information from the corpus, Wouldn’t it be very inefficient if the LLM model needs to go through the entire set of data every time we want to retrieve a piece of information? Also, there is a limitation in the number of tokens that we can send in the LLM context window, so it might not be possible to feed the entire data corpus in the context window either. So the optimized way to solve this problem would be to break down the data corpus into smaller chunks of text in order to fit them into the LLM context window and store them in vector DB in the form of embedded text.

Embedded text is a technique used to represent text as a vector of numbers. This can be done by converting each word in the text to a unique number and then representing the entire text as a vector of these numbers. This allows computers to process text more easily, as they can then represent words and sentences as points in a vector space.

When the user makes a query, it is first broken down into individual words or phrases. These words or phrases are then converted into vectors. The vectors are then compared to the vectors in the database, and the results are returned to the user. The results of the vector database retrieval process are returned to the user in a ranked list. The results are ranked based on how similar they are to the user query.

Post the retrieval there is a generation part, here the retrieved documents are passed to the LLM model along with the user query and the model forms the result of the query using the information of these documents.

I have written several posts on Retrievals and Querying Vector DB. In those posts, I have also talked about the internal workings of the Retrievals and implemented the same using Langchain. Below are the links to the post.

Deep Dive into the Internals of Langchain Vector Store Retriever

Installing Chroma DB Locally & Querying Personal Data

Langchain: Question Answering over Personal Data

Building LLM ChatBot using Langchain

I hope you found this comparison post between fine-tuning and RAG useful. I regularly create similar content on Langchain, LLM, and AI topics. If you'd like to receive more articles like this, consider subscribing to my blog.

If you're in the Langchain space or LLM domain, let's connect on Linkedin! I'd love to stay connected and continue the conversation. Reach me at: linkedin.com/in/ritobrotoseth