Exploring Vector Databases: An Introduction to Chroma DB

In our previous blog post, we explored the basics of vector databases, highlighted their importance, and demonstrated a practical example using Pinecone. If you haven’t checked it out yet, you can read it here. Today, we’ll take a deeper dive into Chroma DB, an open-source vector database, and its unique features. We’ll also discuss the differences between Pinecone and Chroma, along with a practical example of building an AI-powered application with Chroma DB.

What Is Chroma DB?

Chroma DB is an open-source vector database designed for developers looking for cost-effective and customizable solutions for managing high-dimensional vectors. Its flexibility and transparency make it a popular choice for smaller teams and those wanting more control over their infrastructure.

Key Features of Chroma DB

Open Source: Fully open source, allowing you to inspect, modify, and deploy the code as per your requirements.
Easy Integration: Works seamlessly with popular embedding models and machine learning pipelines.
Efficient Similarity Search: Provides fast and accurate similarity search across high-dimensional vectors.
Community-Driven: Backed by an active community that regularly contributes improvements and new features.

Pinecone vs. Chroma: Which One to Choose?

Feature	Pinecone	Chroma DB
Open Source	No	Yes
Managed Service	Yes, fully managed cloud offering	No, requires self-hosting
Ease of Setup	Simple with managed hosting	Requires manual setup
Cost	Subscription-based	Free (self-hosted)
Customizability	Limited	High

Pros of Chroma DB

Free and open source.
High customizability for niche applications.
No dependency on external services or subscription costs.

Cons of Chroma DB

Requires manual setup and maintenance.
Lacks the managed infrastructure convenience offered by Pinecone.

A Practical Example: Implementing a LangChain-based Document Retrieval System with ChromaDB and OpenAI

Prerequisites

Before we start, make sure to install the necessary dependencies:

!pip -q install openai langchain tiktoken
!pip install -U langchain-chroma
!pip -q install chromadb
!pip show chromadb

Step 1: Download and Extract Articles

In our case, we are using a ZIP file that contains articles in .txt format. The first step is to download this file from Dropbox and extract its contents.

import requests

url = "https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip?dl=1"
response = requests.get(url)

with open("new_articles.zip", "wb") as file:
file.write(response.content)

print("Download complete!")

After the file is downloaded, we extract the ZIP file using the zipfile module:

import zipfile
import os

zip_file_path = "new_articles.zip" # Path to the zip file
extract_to = "new_articles" # Directory where files will be extracted

# Create the target directory if it doesn't exist
os.makedirs(extract_to, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(extract_to)

print(f"Files extracted to '{extract_to}'")

Step 2: Load and Split Documents

We use LangChain's DirectoryLoader to load the text files from the extracted folder. The TextLoader class is used to read the .txt files.

from langchain.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("new_articles", glob="./*.txt", loader_cls=TextLoader)
data = loader.load()

Once the documents are loaded, we need to split them into smaller chunks. This step is important for improving the efficiency of our retrieval system. We use LangChain’s RecursiveCharacterTextSplitter to split the documents into manageable chunks, each with a maximum size of 1000 characters and an overlap of 200 characters between chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
text_chunks = text_splitter.split_documents(data)

Step 3: Create Document Embeddings with OpenAI

To convert our text documents into numerical representations that can be used for retrieval, we use OpenAI embeddings. These embeddings will help us represent the text in a high-dimensional space.

from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings(openai_api_key="YOUR_OPENAI_API_KEY")

Now, we can use the Chroma vector database to store these embeddings. ChromaDB is an efficient vector storage and retrieval solution. We'll persist the embeddings in a specified directory (db in this case).

from langchain_chroma import Chroma

persist_directory = 'db'
vectordb = Chroma.from_documents(documents=text_chunks, embedding=embedding, persist_directory=persist_directory)

# Reinitialize Chroma to load the persisted database
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

Step 4: Create a Retriever for Querying

With the embeddings stored, we can now create a retriever that will help us query the knowledge base for relevant documents based on user input.

retriever = vectordb.as_retriever()

# Example query to retrieve information
result = retriever.invoke("How much money did Amazon raise?")
print(result[0].page_content)

Step 5: Set Up a Question-Answering Chain

Next, we set up a question-answering chain using LangChain's RetrievalQA class. This class will allow us to use the retrieved documents to answer a user’s query using an OpenAI model.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

llm = OpenAI(openai_api_key="YOUR_OPENAI_API_KEY")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

We also define a helper function to process the model's response and display the result along with the source documents.

def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

query = "How much money did Microsoft raise?"
llm_response = qa_chain.invoke(query)
process_llm_response(llm_response)

Step 6: Clean Up

After you’re done testing, you can delete the ChromaDB collection to free up space:

vectordb.delete_collection()

This implementation demonstrates how to build a document retrieval system using LangChain, OpenAI, and ChromaDB. By following these steps, you can create a system that loads, splits, and indexes a set of documents, then uses powerful language models to answer queries based on that data.

Conclusion

Chroma DB is a powerful, open-source alternative for developers who want to explore vector databases without incurring additional costs. Its flexibility and customizability make it an excellent choice for applications where control and cost-efficiency are priorities. Whether you choose Pinecone for its managed infrastructure or Chroma DB for its open-source nature, the possibilities of vector databases are endless.

"Innovation happens when technology and creativity meet the freedom to explore."