In this blog, we will learn how to run a Vector DB on the local machine with Docker. We will run a Chroma container and then connect with it using the Chroma-HTTP client. We will then create a collection, insert documents in DB, and also query the DB to get results.
What is Vector DB?
A vector database is a specialized database designed to store and retrieve data represented as numerical vectors efficiently. Unlike traditional databases that work with structured data (like tables), vector databases excel at handling unstructured data such as text, images, and audio.
A vector DB provides the capabilities required to scale, optimize, manage, and secure high-dimensional vector data. Some examples of Vector DB are:
Chroma
Pinecone
Weaviate
Qdrant, etc
What is Chroma?
Chroma is an open-source embedding database that can be used to store embeddings and their metadata, embed documents and queries, and search embeddings.
Chroma provides its own Python as well as JavaScript/TypeScript client SDK which can be used to connect to the DB.
There are two ways to use Chroma:
In-memory DB
Running in Docker as a DB server
Running Chroma in a container
The first step is to clone the repository from Github.
git clone https://github.com/chroma-core/chroma
Then move to the chroma
directory and start the docker container using this command
docker-compose up -d --build
This command will build the image first and then will start the container in detached mode. Once the container is up and running we should see the below message.
Creating chroma_server_1 ... done
I have attached the screenshot for the same.
The DB server is exposed to port 8000 on the host machine and is accessible at localhost:8000
. To verify server availability and functionality, a GET request can be issued to the endpoint http://localhost:8000/api/v1/heartbeat
Connecting ChromaDB with Python SDK
For connecting to the Chroma DB server we need to install the chromadb-client
.
pip install chromadb-client
This is a lightweight HTTP client for connecting to the DB server. Once the client is installed the next step is to connect to the DB.
Viewing Chroma Collections
import chromadb
chroma_client = chromadb.HttpClient(host="localhost", port=8000)
print(chroma_client.list_collections())
For connecting to the Chroma DB server we need to create an HTTP client object by specifying the host
and port
details. The above code connects to the DB server and displays the list of collections. Since we don't have any collection in our DB currently it will return an empty array.
For creating a new collection, create_collection()
function is used.
collection = chroma_client.create_collection(name="my_test_collection")
Post creating the my_test_collection
collection, now if we print the collection list using the list_collections()
function, it will return the newly created collection.
[Collection(name=my_test_collection)]
Viewing Collection Data
To view the first 5 records of the collection we can use the peek()
function.
collection = chroma_client.get_collection("my_test_collection")
collection.peek()
There are 4 attributes of the persisted record: id, embedding, metadata, and document.
id
: It is the unique identifier of the record.embedding
: It contains the embedding data of the document.metadata
: It contains the metadata that we want to persist along with the document.document
: The data chunk that we want to store in the collection.
{'ids': [], 'embeddings': [], 'metadatas': [], 'documents': []}
Since our newly created collection is empty we are getting an empty array for all the attributes.
Inserting data into the Collection
Documents are turned into embeddings before storing in the Vector DB, we have used OpenAIEmbeddingFunction
to transform the text. When instantiating the collection
object we have specified the collection name and the embedding function, in the below example, my_test_collection
is the collection name and OpenAI Embedding
is the function.
We have used RecursiveCharacterTextSplitter
for splitting our data into smaller chunks of size 200 with an overlap of 30. This function returns an array of docs that are to be inserted into the collection.
For inserting documents into the collection we are using the collection.add
function. In the below example, we are passing two parameters to the add function, id
and document
.
import chromadb
import uuid
from langchain.text_splitter import RecursiveCharacterTextSplitter
from chromadb.utils import embedding_functions
def insert(content):
client = chromadb.HttpClient(host="localhost", port=8000)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=OPENAI_API_KEY,
model_name="text-embedding-ada-002"
)
collection = client.get_collection(name="my_test_collection", embedding_function=openai_ef)
text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n"], chunk_size=200, chunk_overlap=30)
docs = text_splitter.create_documents([content])
for doc in docs:
uuid_val = uuid.uuid1()
print("Inserted documents for ", uuid_val)
collection.add(ids=[str(uuid_val)], documents=doc.page_content)
Now let's use the insert method to insert data in the DB. In the content
variable, I added a paragraph and then passed it to the insert function. In the below screenshot, we can see that records were inserted into this collection.
By running the collection.peek()
method we can get more insight on our inserted data. Below is the output of the peek()
function.
{'ids': ['ca0ede6a-328c-11ee-9349-4cd577cb7c58',
'cac5ab41-328c-11ee-bb75-4cd577cb7c58'],
'embeddings': [[0.0011678459122776985,
-0.008723867125809193,
-0.01643379032611847,
.
.
-0.018491532653570175,
0.024459557607769966,
...]],
'metadatas': [None, None],
'documents': ['The story centres on Alice, a young girl who falls asleep in a meadow and dreams that she follows the White Rabbit down a rabbit hole. She has many wondrous, often bizarre adventures with thoroughly illogical and very strange creatures, often changing size unexpectedly (she grows as tall as a house and shrinks to 3 inches [7 cm]). She encounters the hookah-smoking Caterpillar, the Duchess (with a baby that becomes a pig), and the Cheshire Cat, and she attends a strange endless tea party with the Mad Hatter and the March Hare. She plays a game of croquet with an unmanageable flamingo for a croquet mallet and uncooperative hedgehogs for croquet balls while the Queen calls for the execution of almost everyone present. Later, at the Queen’s behest, the Gryphon takes Alice to meet the sobbing Mock Turtle, who describes his education in such subjects as Ambition, Distraction, Uglification, and Derision. ',
'\nAlice is then called as a witness in the trial of the Knave of Hearts, who is accused of having stolen the Queen’s tarts. However, when the Queen demands that Alice be beheaded, Alice realizes that the characters are only a pack of cards, and she then awakens from her dream.']}
I have trimmed the embedding data to keep the information short and readable. In the above output, there are 2 entries of ids
, documents
, and embedding
but no data is present in the metadata field, that's because we didn't pass any metadata information with the document.
Below I have attached a high-level diagram that gives an overview of the entire flow.
Querying Database
In the above example, we have already used Langchain to split our data now we will be using Langchain to query the collection.
For using Chroma DB with Langchain we need to install the chromadb
package.
pip install chromadb
Let's see an example where we will retrieve data from the DB by using the similarity_search
function.
import chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
def queryDB(query):
client = chromadb.HttpClient(host="localhost", port=8000)
embedding_function = OpenAIEmbeddings()
db4 = Chroma(client=client, collection_name="my_test_collection", embedding_function=embedding_function)
docs = db4.similarity_search(query)
return docs
We are using the Langchain OpenAIEmbeddings function which will be used to embed the query and then we will be using the Langchain similarity_search function to get the matching results. Below is an illustration of how the query fetches results from the DB.
Now let's try to query our DB using the queryDB function. Below is the screenshot where we tried to query the DB with the sentence: Alice the young girl
, and it returned the closest match to this text.
So till now, we have learned how to run Chroma DB locally, connect to it, store documents, and query the records. Now let's do a small activity where we will insert personal data into Chroma and then will query them.
First, we will write a scrape function that will be used to scrape data from a website.
from langchain.document_loaders import SeleniumURLLoader
from bs4 import BeautifulSoup
def scrape(url):
urls = [url]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()
if data is not None and len(data) > 0:
soup = BeautifulSoup(data[0].page_content, "html.parser")
text = soup.get_text()
return text
return ''
We will be scraping the following site which has all the information about the 10 AI tools which content creators must have: https://tech.hindustantimes.com/tech/news/10-ai-tools-that-all-content-creator-must-have-in-their-arsenal-71687251644210.html
Above I have pasted the screenshot of the website.
Next, we have written a function addDataToDB
whose job is to break down the website data into chunks and then add the documents to the DB.
import chromadb
import uuid
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from chromadb.utils import embedding_functions
def addDataToDB(content):
client = chromadb.HttpClient(host="localhost", port=8000)
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=OPENAI_API_KEY,
model_name="text-embedding-ada-002"
)
collection = client.get_collection(name="my_collection", embedding_function=openai_ef)
text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n"], chunk_size=500, chunk_overlap=50)
docs = text_splitter.create_documents([content])
for doc in docs:
uuid_val = uuid.uuid1()
print("Inserted documents for ", uuid_val)
collection.add(ids=[str(uuid_val)], documents=doc.page_content)
Lastly, we have our queryDB
function which we will use to fetch records from the DB.
import chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
def queryDB(query):
client = chromadb.HttpClient(host="localhost", port=8000)
embedding_function = OpenAIEmbeddings()
db4 = Chroma(client=client, collection_name="my_collection", embedding_function=embedding_function)
docs = db4.similarity_search(query)
return docs
Below I have attached a screenshot of the results that we got from the queryDB
function.
The results are very close to what we are looking for, based on our query it performed a similarity search of the embeddings and returned the closest match.
I hope you found this blog useful for setting up Chroma DB locally and learning the process of inserting and querying data. I regularly create similar content on Langchain, LLM, and AI topics. If you'd like to receive more articles like this, consider subscribing to my blog.
If you're in the Langchain space or LLM domain, let's connect on Linkedin! I'd love to stay connected and continue the conversation. Reach me at: linkedin.com/in/ritobrotoseth