Building RAG in 2024 with Langchain, Groq, Llama3, and Qdrant

Building RAG in 2024 with Langchain, Groq, Llama3, and Qdrant

In this blog, we will be going through the basics to advance of building RAG with Langchain. We will use Llama3 as our LLM model with the Groq API.

What is Groq?

Groq is an AI solutions company known for its cutting-edge technology, particularly the Language Processing Unit (LPU) Inference Engine, designed to enhance Large Language Models (LLMs) with ultra-low latency inference capabilities. Groq APIs enable developers to integrate state-of-the-art LLMs like Llama3 and Mixtral 8x7B into applications requiring real-time AI processing.

To use the Groq APIs, we will need to generate an API key. To generate the API key click on this link.

Next, we will be integrating the Groq chat model with Langchain.

Integrating with Groq chat model

The ChatGroq class uses the Groq chat LLM APIs. To instantiate the class we need to provide the model_name and groq_api_key. We are using the Llama3-8B model in our example.

from langchain_groq import ChatGroq
from google.colab import userdata

chat_model = ChatGroq(temperature=0,
              model_name="llama3-8b-8192",
              api_key=userdata.get("GROQ_API_KEY"))

For inferencing the chat model we use the invoke method.

chat_model.invoke("Why is the sky blue in color?")

Summarising Long Text

In this example, we are using a summarization instruction prompt to summarize long content.

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_groq import ChatGroq
from google.colab import userdata

chat_model = ChatGroq(temperature=0,
                      model_name="llama3-8b-8192",
                      api_key=userdata.get("GROQ_API_KEY"))

def generate_summary(content: str):
    prompt_template = """
        Provide a concise, high-level summary of the key points and most important information from the given content. \
        Highlight the main topics, key takeaways, and critical details. Structure your response in a clear, \
        well-organized manner using bullet points or short paragraphs. Avoid extraneous details and focus on distilling \
        the core substance. The summary should be informative, insightful, and easy to digest for the reader.

        Content:
        {content}
        --------------------------

        Summary:
    """

    prompt = PromptTemplate(template=prompt_template, input_variables=["content"])
    output_parser = StrOutputParser()
    chain = prompt | chat_model | output_parser
    return chain.invoke({"content": content})

When creating a chain we require a prompt, an LLM model, and an output parser. We have used LCEL (LangChain Expression Language) to combine these different components into a single chain.

chain = prompt | chat_model | output_parser

The | symbol is similar to a Unix pipe operator, which chains together the different components, feeding the output from one component as input into the next component. In this chain the user input is passed to the prompt template, then the prompt template output is passed to the model, then the model output is passed to the output parser.

What is LCEL?

LangChain's Language Expression Language (LCEL) is a declarative framework to streamline the composition of chains, enabling a seamless transition from prototyping to production.

LCEL offers several advantages:

  1. First-class streaming support for optimal time-to-first-token.

  2. Asynchronous API compatibility for both synchronous and asynchronous calls.

  3. Optimized parallel execution for minimal latency.

  4. Access to intermediate results for debugging and user feedback.

  5. Configurable retries and fallbacks for enhanced reliability.

This comprehensive set of features makes LCEL a powerful tool for developers looking to efficiently build applications with Large Language Models (LLMs) while maintaining flexibility and reliability throughout the development process.

For inferencing our summarisation code we need to call the generate_summary function.

text_content = """
A vector database indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, horizontal scaling, and serverless.

We’re in the midst of the AI revolution. It’s upending any industry it touches, promising great innovations - but it also introduces new challenges. Efficient data processing has become more crucial than ever for applications that involve large language models, generative AI, and semantic search.

All of these new applications rely on vector embeddings, a type of vector data representation that carries within it semantic information that’s critical for the AI to gain understanding and maintain a long-term memory they can draw upon when executing complex tasks.

Embeddings are generated by AI models (such as Large Language Models) and have many attributes or features, making their representation challenging to manage. In the context of AI and machine learning, these features represent different dimensions of the data that are essential for understanding patterns, relationships, and underlying structures.

That is why we need a specialized database designed specifically for handling this data type. Vector databases like Pinecone fulfill this requirement by offering optimized storage and querying capabilities for embeddings. Vector databases have the capabilities of a traditional database that are absent in standalone vector indexes and the specialization of dealing with vector embeddings, which traditional scalar-based databases lack.

The challenge of working with vector data is that traditional scalar-based databases can’t keep up with the complexity and scale of such data, making it difficult to extract insights and perform real-time analysis. That’s where vector databases come into play – they are intentionally designed to handle this type of data and offer the performance, scalability, and flexibility you need to make the most out of your data.

We are seeing the next generation of vector databases introduce more sophisticated architectures to handle the efficient cost and scaling of intelligence. This ability is handled by serverless vector databases, that can separate the cost of storage and compute to enable low-cost knowledge support for AI.

With a vector database, we can add knowledge to our AIs, like semantic information retrieval, long-term memory, and more. The diagram below gives us a better understanding of the role of vector databases in this type of application:
"""
print(generate_summary(text_content))

Building Simple RAG

RAG, or Retrieval Augmented Generation, is a method that combines the benefits of retrieval-based models and generative models. It involves retrieving relevant information from a large database and using that information to generate a response or solution. RAG helps in generating accurate and contextually appropriate responses by leveraging the vast amount of knowledge available in the database.

from langchain_community.vectorstores import Qdrant
import qdrant_client
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_groq import ChatGroq
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from google.colab import userdata

chat_model = ChatGroq(temperature=0,
                      model_name="llama3-8b-8192",
                      api_key=userdata.get("GROQ_API_KEY"))

embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")


def get_qdrant_retriever():
    qdrantClient = qdrant_client.QdrantClient(
        url=userdata.get("QDRANT_URL"),
        prefer_grpc=True,
        api_key=userdata.get("QDRANT_API_KEY"))
    qdrant = Qdrant(qdrantClient, "indian_food_info", embed_model)
    return qdrant.as_retriever(search_kwargs={"k": 6})


def retrieve_docs_from_vector_store(user_question):
    retriever = get_qdrant_retriever()
    print("User question: ", user_question)
    retrieved_docs = retriever.invoke(user_question)
    print("Number of documents found: ", len(retrieved_docs))
    return retrieved_docs


def answer_questions(user_question):
    retrieved_docs = retrieve_docs_from_vector_store(user_question)

    template = """
    You are a question answering bot. You will be given a QUESTION and a set of paragraphs in the CONTENT section. 
    You need to answer the question using the text present in the CONTENT section. 
    If the answer is not present in the CONTENT text then reply `I don't have answer to the question`

    CONTENT: {document}
    QUESTION: {question}
    """

    prompt = PromptTemplate(
        input_variables=["document", "question"], template=template
    )

    output_parser = StrOutputParser()
    chain = prompt | chat_model | output_parser
    llm_answer = chain.invoke({"document": retrieved_docs, "question": user_question})
    return llm_answer

In the above code, we have defined 2 functions get_qdrant_retriever and retrieve_docs_from_vector_store for vector store retrieval. In the get_qdrant_retriever function, we are creating the Qdrant retriever which connects to the Qdrant vector store. In the retrieve_docs_from_vector_store function, we are using the retriever to fetch documents from the vector store based on the user query.

In our RAG, we are using FastEmbedEmbeddings to convert the user's query to embeddings. These embeddings are then passed to the vector store to search for the closest vector matching the user's query.

In answer_questions function we are first calling the retrieve_docs_from_vector_store function to fetch the documents from the vector store. Then these documents are passed to the LLM along with the QA prompt to extract answers from the documents.

To get an answer from the QA RAG we call the answer_questions function with the question.

Chat with Memory

A memory allows a Large Language Model (LLM) to remember previous interactions with the user. By default, LLMs are stateless, meaning each incoming query is processed independently of other interactions. The only thing that exists for a stateless agent is the current input, nothing else.

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_groq import ChatGroq
from google.colab import userdata

chat_model = ChatGroq(temperature=0,
                      model_name="llama3-8b-8192",
                      api_key=userdata.get("GROQ_API_KEY"))

store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


def llm_chat(user_msg: str):
  system_message = "You're a helpful AI assistant who helps users with answering questions."
  prompt = ChatPromptTemplate.from_messages(
      [
          (
              "system",
              system_message,
          ),
          MessagesPlaceholder(variable_name="history"),
          ("human", "{input}"),
      ]
  )

  output_parser = StrOutputParser()
  chain = prompt | chat_model | output_parser

  chain_with_message_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
  )

  response = chain_with_message_history.invoke(
      {
          "input": user_msg
      },
      config={"configurable": {
            "session_id": "abc199"
          }
      },
  )

  return response

To add conversation memory or message history to our chain we will use the RunnableWithMessageHistory wrapper on top of our chain. RunnableWithMessageHistory is responsible for reading and updating the chat message history.

RunnableWithMessageHistory must always be called with a config that contains the appropriate parameters for the chat message history. By default, the Runnable is expected to take a single configuration parameter called session_id which is a string. This parameter is used to create a new or look up an existing chat message history that matches the given session_id.

To chat with memory we call llm_chat function with the user message.

QA RAG with Memory

When building a QA RAG with memory it is important that we take the historical messages and the latest user question and reformulates the question if it makes reference to any information in the historical information. We are doing the same in the function get_history_aware_retriever, we are using a prompt that includes a MessagesPlaceholder variable under the name chat_history. We then pass in a list of Messages to the prompt using the “chat_history” input key, and these messages will be inserted after the system message and before the human message containing the latest question.

We are using the create_history_aware_retriever helper function at this step, this function create a chain that takes conversation history and returns documents. If there is no chat_history, then the input is just passed directly to the retriever. If there is chat_history, then the prompt and LLM will be used to generate a search query. That search query is then passed to the retriever.

The output of this function is an LCEL Runnable. The runnable input must take in input, and if there is chat history should take it in the form of chat_history. The Runnable output is a list of Documents

from langchain.chains import create_history_aware_retriever
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
import qdrant_client
from langchain_community.vectorstores import Qdrant
from langchain_groq import ChatGroq
from google.colab import userdata
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings


embed_model = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

chat_model = ChatGroq(temperature=0,
                      model_name="llama3-8b-8192",
                      api_key=userdata.get("GROQ_API_KEY"))


def get_qdrant_retriever(qdrant_collection):
    qdrantClient = qdrant_client.QdrantClient(
        url=userdata.get("QDRANT_URL"),
        prefer_grpc=True,
        api_key=userdata.get("QDRANT_API_KEY"))
    qdrant = Qdrant(qdrantClient, qdrant_collection, embed_model)
    return qdrant.as_retriever(search_kwargs={"k": 6})


retriever = get_qdrant_retriever("indian_food_info")
store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


def get_history_aware_retriever():
    contextualize_q_system_prompt = """Given a chat history and the latest user question \
  which might reference context in the chat history, formulate a standalone question \
  which can be understood without the chat history. Do NOT answer the question, \
  just reformulate it if needed and otherwise return it as is."""

    contextualize_q_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", contextualize_q_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )

    history_aware_retriever = create_history_aware_retriever(
        chat_model, retriever, contextualize_q_prompt
    )

    return history_aware_retriever


def qa_with_memory(user_question: str):
    qa_system_prompt = """You are an assistant for question-answering tasks. \
  Use the following pieces of retrieved context to answer the question. \
  If you don't know the answer, just say that you don't know. \
  Use three sentences maximum and keep the answer concise.\

  {context}"""

    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", qa_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    question_answer_chain = create_stuff_documents_chain(chat_model, qa_prompt)

    history_aware_retriever = get_history_aware_retriever()
    rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

    conversational_rag_chain = RunnableWithMessageHistory(
        rag_chain,
        get_session_history,
        input_messages_key="input",
        history_messages_key="chat_history",
        output_messages_key="answer",
    )

    return conversational_rag_chain.invoke(
        {"input": user_question},
        config={
            "configurable": {"session_id": "abc125"}
        },
    )["answer"]

In the function qa_with_memory we have created the conversational QA RAG chain. The langhchain function create_stuff_documents_chain is being used to generate the question_answer_chain, the input of create_stuff_documents_chain function is the llm model and the prompt template which must contains the variable context (the context variable is used for passing in the formatted documents).

Another langchain function create_retrieval_chain is being used to build the rag_chain. This chain applies the history_aware_retriever and question_answer_chain in sequence, retaining intermediate outputs such as the retrieved context for convenience. In short, the create_retrieval_chain function retrieves documents from vector store and passes them on, the output of this function is an LCEL Runnable which returns a dictionary containing at the very least a context and answer key.

To get answer from QA RAG with memory we call qa_with_memory function with the question.

In this blog we through the basics of Langchain, we learned about Language Expression Language (LCEL) and invoked chains to get response from LLM. We learned to build a simple RAG, add memory to a LLMs and also learned the advance concept of building RAG with memory. I hope you found this concept useful and can easily use them as boilerplate code for your next application. If you have any questions regarding the topic, please don't hesitate to ask in the comment section. I will be more than happy to address them.

I regularly create similar content on LangChain, LLMs, and AI-related topics. If you'd like to read more articles like this, consider subscribing to my blog.

If you're in the Langchain space or LLM domain, let's connect on Linkedin! I'd love to stay connected and continue the conversation. Reach me at: linkedin.com/in/ritobrotoseth

Did you find this article valuable?

Support Ritobroto Seth by becoming a sponsor. Any amount is appreciated!