Vector databases have emerged as a powerful tool for similarity search and recommendation systems. They allow you to store and search high-dimensional vectors efficiently, enabling applications like image search, document similarity, and more. In this article, we will explore how to build a vector database using Django and pgvector, a PostgreSQL extension for vector similarity search.

Why pgvector?

When it comes to choosing a vector database, there are several options available, such as Pinecone, Weaviate, Milvus, and FAISS. However, if you are already using PostgreSQL as your primary database, pgvector offers a seamless integration experience. With pgvector, you can store vectors directly in your PostgreSQL database and perform similarity searches using SQL queries.

pgvector provides excellent performance for vector similarity search and is easy to set up and use with Django. It supports various distance metrics, including cosine distance, which is commonly used for measuring similarity between vectors.

Setting up pgvector in Django

To get started with pgvector in your Django project, you need to install the pgvector package:


# Install the package into your virtual environment
pip install pgvector

Next, you need to enable the pgvector extension in your PostgreSQL database. You can do this by creating a new migration file and adding the VectorExtension operation:


from pgvector.django import VectorExtension

class Migration(migrations.Migration):
    operations = [
        VectorExtension()
    ]

To create an empty migration file, you can use the following Django management command:


# Make the Django migrations
python manage.py makemigrations [app_name] --name [migration_name] --empty

Replace [app_name] with the name of your Django app [migration_name] and with a descriptive name for the migration.
After creating the migration file, run the migrations to apply the changes to your database:


# Apply the migrations
python manage.py migrate

Defining a VectorField in your model

To store vectors in your Django model, you need to define a VectorField. Here’s an example model that includes a vector field for storing embeddings:


from django.db import models
from pgvector.django import VectorField

class File(models.Model):
    # ... other fields ...
    embedding_clip_vit_l_14 = VectorField(
        dimensions=768,
        help_text="Vector embeddings (clip-vit-large-patch14) of the file content",
        null=True,
        blank=True,
    )

In this example, the File model represents a file object with various fields, including the embedding_clip_vit_l_14 field, which is a VectorField. The dimensions parameter specifies the dimensionality of the vectors you want to store. In this case, it is set to 768, which corresponds to the dimensionality of the CLIP ViT-L/14 embeddings.

Creating an index for efficient similarity search

To perform efficient similarity searches on the vector field, you need to create an index.

Similarity Search Indexes

Indexes helps to improve performance in similarity search. An example of IVF-Flat index for easier understanding of the concept.

pgvector provides the HnswIndex and IVFFlat for this purpose. Check more about the different index types here. We will use HnswIndex in this tutorial. You can define the index in the Meta class of your model:


from pgvector.django import HnswIndex

class File(models.Model):
    # ... other fields ...
    
    class Meta:
        indexes = [
            HnswIndex(
                name="clip_l14_vectors_index",
                fields=["embedding_clip_vit_l_14"],
                m=16,
                ef_construction=64,
                opclasses=["vector_cosine_ops"],
            )
        ]

The HnswIndex is a special type of index that uses the Hierarchical Navigable Small World (HNSW) algorithm for efficient nearest neighbor search. The name parameter specifies the name of the index, and the fields parameter indicates the vector field to be indexed.

The m and ef_construction parameters control the index’s performance and memory usage. Higher values result in better search accuracy but require more memory and slower indexing speed. You can adjust these values based on your specific requirements.

The opclasses parameter specifies the distance metric to use for similarity search. In this example, it is set to ["vector_cosine_ops"], which uses cosine distance.

Performing similarity search

With the vector field and index set up, you can now perform similarity searches using Django’s ORM. pgvector provides the CosineDistance function for this purpose. Here’s an example of how to search for similar files based on a given vector embedding:


from pgvector.django import CosineDistance

def search_image_embedding(self, embedding):
    user_files = File.objects.filter(connector__user=self.user)
    
    files_with_distance = user_files.annotate(
        distance=CosineDistance("embedding_clip_vit_l_14", embedding)
    ).order_by("distance")[:12]
    
    # Process and return the search results
    # ...

In this example, the search_image_embedding method takes an embedding vector as input. It retrieves all the files associated with the user using the File.objects.filter() query.

Then, it uses the annotate() method to calculate the cosine distance between the input embedding and the embedding_clip_vit_l_14 field of each file. The CosineDistance function computes the distance, and the results are ordered by the distance in ascending order. The [:12] slicing retrieves the top 12 most similar files.

You can further process the search results, retrieve additional file details, and return them as needed.

Handling custom PostgreSQL image in CI/CD

If you are using a custom PostgreSQL image with pgvector in your CI/CD pipeline, you may need to update your configuration to use the custom image. For example, in GitLab CI/CD, you can specify the custom image using the services section:


services:
  - name: ankane/pgvector:latest
    alias: postgres

In this example, the ankane/pgvector:latest image is used as the PostgreSQL service, and it is aliased as postgres so that other parts of the CI/CD environment can access it.
If you have built your own custom PostgreSQL image with pgvector, you can replace ankane/pgvector:latest with the path to your custom image.

Conclusion

Building a vector database with Django and pgvector provides a powerful and efficient solution for similarity search and recommendation systems. By leveraging the pgvector extension, you can seamlessly integrate vector storage and search capabilities into your existing PostgreSQL database.

With the steps outlined in this article, you can set up pgvector in your Django project, define vector fields in your models, create indexes for efficient search, and perform similarity searches using the Django ORM.

Remember to adjust the index parameters based on your specific requirements and test the performance of your vector database with different configurations to find the optimal balance between search accuracy and resource usage.

By combining the power of Django and pgvector, you can build scalable and high-performance applications that leverage the capabilities of vector similarity search.

Last Update: 28/05/2024