Installing Chroma DB Locally & Querying Personal Data

Installing Chroma DB Locally & Querying Personal Data

In this blog, we will learn how to run a Vector DB on the local machine with Docker. We will run a Chroma container and then connect with it using the Chroma-HTTP client. We will then create a collection, insert documents in DB, and also query the DB to get results.

What is Vector DB?

A vector database is a specialized database designed to store and retrieve data represented as numerical vectors efficiently. Unlike traditional databases that work with structured data (like tables), vector databases excel at handling unstructured data such as text, images, and audio.

A vector DB provides the capabilities required to scale, optimize, manage, and secure high-dimensional vector data. Some examples of Vector DB are:

  1. Chroma

  2. Pinecone

  3. Weaviate

  4. Qdrant, etc

What is Chroma?

Chroma is an open-source embedding database that can be used to store embeddings and their metadata, embed documents and queries, and search embeddings.

Chroma provides its own Python as well as JavaScript/TypeScript client SDK which can be used to connect to the DB.

There are two ways to use Chroma:

  1. In-memory DB

  2. Running in Docker as a DB server

Running Chroma in a container

The first step is to clone the repository from Github.

git clone https://github.com/chroma-core/chroma

Then move to the chroma directory and start the docker container using this command

docker-compose up -d --build

This command will build the image first and then will start the container in detached mode. Once the container is up and running we should see the below message.

Creating chroma_server_1 ... done

I have attached the screenshot for the same.

The DB server is exposed to port 8000 on the host machine and is accessible at localhost:8000. To verify server availability and functionality, a GET request can be issued to the endpoint http://localhost:8000/api/v1/heartbeat


Connecting ChromaDB with Python SDK

For connecting to the Chroma DB server we need to install the chromadb-client.

pip install chromadb-client

This is a lightweight HTTP client for connecting to the DB server. Once the client is installed the next step is to connect to the DB.

Viewing Chroma Collections

import chromadb
chroma_client = chromadb.HttpClient(host="localhost", port=8000)
print(chroma_client.list_collections())

For connecting to the Chroma DB server we need to create an HTTP client object by specifying the host and port details. The above code connects to the DB server and displays the list of collections. Since we don't have any collection in our DB currently it will return an empty array.

For creating a new collection, create_collection() function is used.

collection = chroma_client.create_collection(name="my_test_collection")

Post creating the my_test_collection collection, now if we print the collection list using the list_collections() function, it will return the newly created collection.

[Collection(name=my_test_collection)]

Viewing Collection Data

To view the first 5 records of the collection we can use the peek() function.

collection = chroma_client.get_collection("my_test_collection")
collection.peek()

There are 4 attributes of the persisted record: id, embedding, metadata, and document.

  1. id: It is the unique identifier of the record.

  2. embedding: It contains the embedding data of the document.

  3. metadata: It contains the metadata that we want to persist along with the document.

  4. document: The data chunk that we want to store in the collection.

{'ids': [], 'embeddings': [], 'metadatas': [], 'documents': []}

Since our newly created collection is empty we are getting an empty array for all the attributes.

Inserting data into the Collection

Documents are turned into embeddings before storing in the Vector DB, we have used OpenAIEmbeddingFunction to transform the text. When instantiating the collection object we have specified the collection name and the embedding function, in the below example, my_test_collection is the collection name and OpenAI Embedding is the function.

We have used RecursiveCharacterTextSplitter for splitting our data into smaller chunks of size 200 with an overlap of 30. This function returns an array of docs that are to be inserted into the collection.

For inserting documents into the collection we are using the collection.add function. In the below example, we are passing two parameters to the add function, id and document.

import chromadb
import uuid
from langchain.text_splitter import RecursiveCharacterTextSplitter
from chromadb.utils import embedding_functions

def insert(content):
    client = chromadb.HttpClient(host="localhost", port=8000)
    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name="text-embedding-ada-002"
            )
    collection = client.get_collection(name="my_test_collection", embedding_function=openai_ef)

    text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n"], chunk_size=200, chunk_overlap=30)
    docs = text_splitter.create_documents([content])

    for doc in docs:
        uuid_val = uuid.uuid1()
        print("Inserted documents for ", uuid_val)
        collection.add(ids=[str(uuid_val)], documents=doc.page_content)

Now let's use the insert method to insert data in the DB. In the content variable, I added a paragraph and then passed it to the insert function. In the below screenshot, we can see that records were inserted into this collection.

By running the collection.peek() method we can get more insight on our inserted data. Below is the output of the peek() function.

{'ids': ['ca0ede6a-328c-11ee-9349-4cd577cb7c58',
  'cac5ab41-328c-11ee-bb75-4cd577cb7c58'],
 'embeddings': [[0.0011678459122776985,
   -0.008723867125809193,
   -0.01643379032611847,
  .
  .
   -0.018491532653570175,
   0.024459557607769966,
   ...]],
 'metadatas': [None, None],
 'documents': ['The story centres on Alice, a young girl who falls asleep in a meadow and dreams that she follows the White Rabbit down a rabbit hole. She has many wondrous, often bizarre adventures with thoroughly illogical and very strange creatures, often changing size unexpectedly (she grows as tall as a house and shrinks to 3 inches [7 cm]). She encounters the hookah-smoking Caterpillar, the Duchess (with a baby that becomes a pig), and the Cheshire Cat, and she attends a strange endless tea party with the Mad Hatter and the March Hare. She plays a game of croquet with an unmanageable flamingo for a croquet mallet and uncooperative hedgehogs for croquet balls while the Queen calls for the execution of almost everyone present. Later, at the Queen’s behest, the Gryphon takes Alice to meet the sobbing Mock Turtle, who describes his education in such subjects as Ambition, Distraction, Uglification, and Derision. ',
  '\nAlice is then called as a witness in the trial of the Knave of Hearts, who is accused of having stolen the Queen’s tarts. However, when the Queen demands that Alice be beheaded, Alice realizes that the characters are only a pack of cards, and she then awakens from her dream.']}

I have trimmed the embedding data to keep the information short and readable. In the above output, there are 2 entries of ids, documents, and embedding but no data is present in the metadata field, that's because we didn't pass any metadata information with the document.

Below I have attached a high-level diagram that gives an overview of the entire flow.

Querying Database

In the above example, we have already used Langchain to split our data now we will be using Langchain to query the collection.

For using Chroma DB with Langchain we need to install the chromadb package.

pip install chromadb

Let's see an example where we will retrieve data from the DB by using the similarity_search function.

import chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings


def queryDB(query):
    client = chromadb.HttpClient(host="localhost", port=8000)
    embedding_function = OpenAIEmbeddings()
    db4 = Chroma(client=client, collection_name="my_test_collection", embedding_function=embedding_function)
    docs = db4.similarity_search(query)
    return docs

We are using the Langchain OpenAIEmbeddings function which will be used to embed the query and then we will be using the Langchain similarity_search function to get the matching results. Below is an illustration of how the query fetches results from the DB.

Now let's try to query our DB using the queryDB function. Below is the screenshot where we tried to query the DB with the sentence: Alice the young girl , and it returned the closest match to this text.

So till now, we have learned how to run Chroma DB locally, connect to it, store documents, and query the records. Now let's do a small activity where we will insert personal data into Chroma and then will query them.

First, we will write a scrape function that will be used to scrape data from a website.

from langchain.document_loaders import SeleniumURLLoader
from bs4 import BeautifulSoup

def scrape(url):
    urls = [url]
    loader = SeleniumURLLoader(urls=urls)
    data = loader.load()

    if data is not None and len(data) > 0:
        soup = BeautifulSoup(data[0].page_content, "html.parser")
        text = soup.get_text()
        return text

    return ''

We will be scraping the following site which has all the information about the 10 AI tools which content creators must have: https://tech.hindustantimes.com/tech/news/10-ai-tools-that-all-content-creator-must-have-in-their-arsenal-71687251644210.html

Above I have pasted the screenshot of the website.

Next, we have written a function addDataToDB whose job is to break down the website data into chunks and then add the documents to the DB.

import chromadb
import uuid
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from chromadb.utils import embedding_functions

def addDataToDB(content):
    client = chromadb.HttpClient(host="localhost", port=8000)
    openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name="text-embedding-ada-002"
            )
    collection = client.get_collection(name="my_collection", embedding_function=openai_ef)

    text_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n"], chunk_size=500, chunk_overlap=50)
    docs = text_splitter.create_documents([content])

    for doc in docs:
        uuid_val = uuid.uuid1()
        print("Inserted documents for ", uuid_val)
        collection.add(ids=[str(uuid_val)], documents=doc.page_content)

Lastly, we have our queryDB function which we will use to fetch records from the DB.

import chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings


def queryDB(query):
    client = chromadb.HttpClient(host="localhost", port=8000)
    embedding_function = OpenAIEmbeddings()
    db4 = Chroma(client=client, collection_name="my_collection", embedding_function=embedding_function)
    docs = db4.similarity_search(query)
    return docs

Below I have attached a screenshot of the results that we got from the queryDB function.

The results are very close to what we are looking for, based on our query it performed a similarity search of the embeddings and returned the closest match.

I hope you found this blog useful for setting up Chroma DB locally and learning the process of inserting and querying data. I regularly create similar content on Langchain, LLM, and AI topics. If you'd like to receive more articles like this, consider subscribing to my blog.

If you're in the Langchain space or LLM domain, let's connect on Linkedin! I'd love to stay connected and continue the conversation. Reach me at: linkedin.com/in/ritobrotoseth

Did you find this article valuable?

Support Ritobroto Seth by becoming a sponsor. Any amount is appreciated!