In recent days, I came across an intriguing new PostgreSQL extension called pgvectorscale that claims to offer impressive performance compared to specialized vector databases like Pinecone (source).

As a fan and frequent user of the pgvector extension for similarity search in my AI applications, my curiosity was immediately triggered.

While I don’t currently use Pinecone in production, making the performance comparison less directly relevant, I was still keen to investigate whether pgvectorscale could meaningfully improve upon the already strong capabilities of pgvector. After all, if pgvectorscale delivers superior speed or accuracy, it would be well worth adapting my applications to leverage its advantages wherever feasible.

So, I spent a few hours conducting a head-to-head experiment: pgvector vs. pgvectorscale on a dataset of 5 million 768-dimensional vectors. The initial test to make index with 15 million of vectors was unsuccessful on my server. I believe that after a given a small threshold, constructing an index on CPU-hardware become impossible. This higlight some limitations of those methods, but that’s a topic of another article.

In this article, I’ll share the results of that benchmark along with the methodology and motivations behind it.

But before we dive into those details, it’s worth taking a step back to consider the overarching question: why use PostgreSQL for AI applications at all?

It’s a decision I thoughtfully made for myself some months ago, won over by pgvector’s impressive performance, seamless integrability into my core applications, and lack of reliance on external APIs. Cost-effectiveness is always a priority in my work, and the potentially steep expenses associated with many cloud platforms raise concerns that PostgreSQL handily alleviates.

The rich ecosystem developing around PostgreSQL for AI applications is another significant draw. Helpful extensions like the pgvector support for Django further streamline AI implementation in web applications and APIs. The team behind pgvectorscale appears to share a similar vision for PostgreSQL’s role as a powerhouse database for AI workloads, as evinced by their recent talk at AI Engineer World’s Fair 2024, link: video with their presentation (starts from 2:06:08).

In short, PostgreSQL has earned my conviction as an excellent foundation for AI applications, and I’m clearly not alone in that view. Stay tuned as we explore whether the pgvectorscale extension marks another leap forward on that already promising trajectory. Before analyzing its performance implications, we’ll first take a closer look at what exactly pgvectorscale brings to the table and how it builds upon the strong foundation laid by pgvector.

Why use PostgreSQL for AI tasks at all?

PostgreSQL has emerged as a compelling choice for building AI applications due to its unique combination of reliability, flexibility, and performance. As an open-source, general-purpose database with a rich ecosystem of extensions, PostgreSQL provides a solid foundation for managing and processing data in AI workloads.

PostgreSQL for everything

Source: https://www.timescale.com/blog/postgres-the-birdhorse-of-databases/

One of the key reasons developers are turning to PostgreSQL for AI is its support for vector data types and similarity search through extensions like pgvector. This allows for efficient storage and retrieval of high-dimensional vector embeddings, which are central to many AI use cases like semantic search, recommendation systems, and more. The ability to perform these crucial operations within the database itself simplifies application architectures and reduces reliance on external specialized tools.

Moreover, PostgreSQL boasts an impressive array of features that make it well-suited for the demands of AI applications:

  1. Mature and reliable: With a development history spanning over 30 years, PostgreSQL is battle-tested and trusted for production deployments across industries.
  2. Highly scalable: PostgreSQL can efficiently handle terabytes of data and scale horizontally using techniques like sharding and replication.
  3. ACID compliant: Full support for ACID (Atomicity, Consistency, Isolation, Durability) properties ensures data integrity and transactional reliability.
  4. Extensible: A well-defined extension framework allows developers to enhance PostgreSQL’s capabilities to suit their specific needs, as exemplified by AI-focused extensions like pgvector, pgvectorscale, and pgAI.
  5. Vibrant community: PostgreSQL benefits from a large, active open-source community continuously driving innovation and improvements.

With extensions like pgvector and pgvectorscale pushing the boundaries of what’s possible, PostgreSQL is well-positioned to become the go-to choice for developers building the next generation of AI applications.

In my experience, the decision to use PostgreSQL for AI has been reaffirmed time and again. The ease of integration with web frameworks like Django, the cost-effectiveness compared to cloud-based alternatives, and the growing ecosystem of AI-focused tools built atop PostgreSQL make it an increasingly appealing platform.

Just Use Postgres

Source: https://www.amazingcto.com/postgres-for-everything/

As we explore the capabilities of cutting-edge extensions like pgvectorscale, it’s clear that PostgreSQL is not just a viable option but a uniquely powerful one for AI workloads. Its combination of stability, scalability, and extensibility is unmatched, empowering developers to build production-ready AI applications with confidence.

In the following sections, we’ll dive deeper into pgvectorscale’s enhancements and benchmark its performance to see just how far PostgreSQL can take us on the path to smarter, more sophisticated AI solutions.

What is pgvectorscale?

Pgvectorscale is a new extension built on top of pgvector, the popular open-source vector data extension for PostgreSQL. Developed by Timescale, pgvectorscale aims to enhance pgvector’s capabilities by delivering higher performance embedding search and cost-efficient storage solutions specifically tailored for AI applications.

While pgvector is written in C, pgvectorscale takes a different approach, leveraging the power and safety of the Rust programming language and the PGRX framework.

The key innovations introduced by pgvectorscale are:

  1. StreamingDiskANN: A novel index type inspired by Microsoft’s DiskANN algorithm, which enables efficient approximate nearest neighbor (ANN) search on disk-resident data. This approach allows for cost-effective scaling of vector search to massive datasets that exceed available RAM.
  2. Statistical Binary Quantization (SBQ): A compression technique developed by Timescale researchers (info) that improves upon standard binary quantization methods. SBQ achieves better accuracy-performance trade-offs, enabling more efficient storage and faster search.

It’s important to note that pgvectorscale is still in its early stages, and while the initial results are promising, it may require further testing and optimization before being considered production-ready for all use cases.

Comparison with Pinecone

To showcase the potential of pgvectorscale, Timescale conducted a benchmark comparing PostgreSQL (with pgvector and pgvectorscale) against Pinecone, a popular vector database.

The benchmark utilized a dataset of 50 million Cohere embeddings, each with 768 dimensions.

The results were impressive. PostgreSQL with pgvector and pgvectorscale achieved a remarkable 28x lower p95 latency and 16x higher query throughput compared to Pinecone’s storage-optimized (s1) index for approximate nearest neighbor queries at 99% recall.

Moreover, the combination of pgvector and pgvectorscale provided this performance at 75% lower cost when self-hosted on AWS EC2.

PostgreSQL with pgvector and pgvectorscale extensions outperformed Pinecone’s s1 pod-based index type, offering 28x lower p95 latency.

Source: https://www.timescale.com/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/

In addition to outperforming Pinecone’s storage-optimized index, pgvector and pgvectorscale also demonstrated superior performance compared to Pinecone’s performance-optimized (p2) index.

With pgvectorscale, PostgreSQL achieved 1.4x lower p95 latency and 1.5x higher query throughput for ANN queries at 90% recall, while maintaining a 79% cost advantage.

PostgreSQL with pgvector and pgvectorscale extensions outperformed Pinecone’s p1 pod-based index type, offering 1.4x lower p95 latency.

Source: https://www.timescale.com/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/

These benchmarks highlight the potential for pgvectorscale to deliver excellent vector search performance and cost-efficiency, making it an attractive option for AI applications that require fast, scalable, and economical embedding search.

How to use pgvectorscale in AI applications

Integrating pgvectorscale into your AI applications is straightforward, especially if you’re already familiar with PostgreSQL and pgvector.

Here’s a brief overview of the standard process, after you install the extension (more about my benchmarks & test in the next section).

1. Create a table with an embedding column:


CREATE TABLE IF NOT EXISTS document_embedding (
    id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    metadata JSONB,
    contents TEXT,
    embedding VECTOR(1536)
);

2. Populate the table with your data, including the vector embeddings. You can use the same clients and methods as pgvector for this step.

3. Create a StreamingDiskANN index on the embedding column:


CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

4. Query the table to find the closest embeddings using the index:


SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10;

Note that pgvectorscale currently supports cosine distance (<=>) queries. If you require additional distance types, you can create an issue on the pgvectorscale GitHub repository, as they suggested.

Pgvectorscale also offers tuning options to optimize performance for your specific workload. You can adjust index build-time parameters, such as num_neighbors and max_alpha, to fine-tune the graph construction process.

Additionally, query-time parameters like diskann.query_rescore allow you to control the accuracy-performance trade-off on a per-query basis. I think this gives a room for nice adjustments.

For example, to set the num_neighbors parameter when creating the index:


CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding) WITH(num_neighbors=50);

And to adjust the diskann.query_rescore parameter for a specific query:


SET diskann.query_rescore = 400;
SELECT *
FROM document_embedding
ORDER BY embedding <=> $1
LIMIT 10;

In the next chapter I will show you my complete experience with installation and usage for my experiment.

My benchmarks

Server configuration

For this experiment, I used a server with the following specifications:

  • CPU: 16 virtual cores from AMD EPYC-Milan Processor
  • RAM: 64GB
  • Storage: 400GB NVMe

This configuration provides a good environment for testing the performance and scalability of pgvector and pgvectorscale. It was pure installation, I used Ubuntu 22.04, pgvector 0.7.2, pgvectorscale 0.2.0.

server configuration

htop on my test server

Dataset

I used a subset of the LAION-5B dataset, consisting of 5 million vectors, each with 768 dimensions. The dataset also includes metadata such as image paths, captions, NSFW flags, similarity scores, licenses, and more.

Although LAION-5B is an extremely large dataset, I focused on the first 5 million vectors for this benchmark. Maybe in the next experiment I will test with first 100 million embeddings, but will need some time (index constructions is a heavy operation).

Installation troubles

First, I encountered the following error while following the instructions from the pgvectorscale repository.

Compiling pgrx v0.11.4
Compiling vectorscale v0.2.0 (/tmp/pgvectorscale/pgvectorscale)
error: On x86, the AVX2 feature must be enabled. Set RUSTFLAGS=”-C target-feature=+avx2,+fma”
–> src/access_method/distance.rs:4:1

On x86, the AVX2 feature must be enabled for pgvectorscale

On x86, the AVX2 feature must be enabled for installing pgvectorscale

It was obvious how to solve, as the error messages indicate that the Rust compiler is unable to compile the pgvectorscale extension because it requires specific CPU features (AVX2 and FMA) to be enabled. These are advanced vector instruction sets that can significantly speed up certain computations, especially those involved in vector operations.


export RUSTFLAGS="-C target-feature=+avx2,+fma"

It is worth checking that your CPU support the instructions:


lscpu | grep -E 'avx2|fma'

Then, I got other error:
error: linking with `cc` failed: exit status: 1

and solve it with installing libopenblas-dev like:


apt-get install libopenblas-dev

In short, depending to your OS and initial configuration, you can see some errors.

But Rust errors are clear enought and you can easy understand what’s wrong. In my case: missing OpenBlas and missing flags.

pgvector installation

pgvectorscale successfully installed

PostgreSQL Tuning with timescaledb-tune

To optimize PostgreSQL’s performance, I used the timescaledb-tune tool provided by Timescale. This tool analyzes the server’s hardware and provides recommended settings for various PostgreSQL parameters. It is not required for such an experiment, but it semi-automatically adapts some of the parameters of pg_config. To be honest, it was the first time I used it, and I am happy that it saved me some time in finding the best values for my Postgres 16 installation.

Timescaledb tune

Timescaledb tune example

After a fresh installation, the tool made the following changes on PG configuration:

  1. Increased shared_buffers from 128MB to 15696MB
  2. Set effective_cache_size to 47089MB
  3. Increased maintenance_work_mem from 64MB to 2047MB (I will change that later on manually)
  4. Increased work_mem from 4MB to 10045kB
  5. Set timescaledb.max_background_workers to 16
  6. Increased max_worker_processes from 8 to 35
  7. Increased max_parallel_workers_per_gather from 2 to 8
  8. Increased max_parallel_workers from 8 to 16
  9. Set wal_buffers to 16MB
  10. Increased min_wal_size from 80MB to 512MB
  11. Set random_page_cost to 1.1
  12. Increased max_locks_per_transaction from 64 to 1024
  13. Increased autovacuum_max_workers from 3 to 10
  14. Increased autovacuum_naptime from 1min to 10s
  15. Set effective_io_concurrency to 256

Additionally, I manually set max_parallel_maintenance_workers to 12 and commented out the shared_preload_libraries line. To accommodate the increased memory requirements for HNSW graph construction (pgvector index type I choosed), I set maintenance_work_mem to 12GB.

Data ingestion

I created a tables named pgvector_benchmark and pgvectorvectorscale_benchmark with the following schema:


CREATE TABLE pgvector_benchmark (
    id SERIAL PRIMARY KEY,
    vector vector(768),
    image_path TEXT,
    caption TEXT,
    NSFW TEXT,
    similarity DOUBLE PRECISION,
    LICENSE TEXT,
    url TEXT,
    key TEXT,
    status TEXT,
    error_message TEXT,
    width BIGINT,
    height BIGINT, 
    original_width BIGINT,
    original_height BIGINT,
    exif TEXT,
    md5 TEXT
);

I could use JSONB for some of the text data, but the input from LAION-5B in this part is messy, and I don’t want to spend extra time debugging JSON import issues. TEXT vs JSONB is not relevant in this example and the queries I used later.

I used the COPY command to insert the data. It is one of the most performant ways to ingest such an amount of information.


# Ingest the data from a CSV file I prepared
COPY pgvector_benchmark(id, vector, image_path, caption, NSFW, similarity, LICENSE, url, key, status, error_message, width, height, original_width, original_height, exif, md5) FROM '/home/processed_data.csv' WITH (FORMAT CSV, DELIMITER ',');

Index creation

For pgvector, I created an HNSW index with ef_construction set to 64. This is the size of the dynamic candidate list for constructing the graph. It is the dafault value basically, but I added it as a parameter also:

CREATE INDEX ON pgvector_benchmark USING hnsw (vector vector_cosine_ops) WITH (ef_construction = 64);

For pgvectorscale, I created a diskann index with search_list_size set to 64 and num_neighbors set to 40:


CREATE INDEX vector_cosine_ops ON pgvectorscale_benchmark USING diskann (vector) WITH(search_list_size=64, num_neighbors=40);
Higher values of search_list_size improve graph quality at the cost of slower index builds. num_neighbors sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower.
My idea was to have similar conditions for the two index. So to compare apple with apples.
Pgvectorscale index construction

Pgvectorscale index construction start

I monitored the index creation progress using the pg_stat_progress_create_index view.

Checking current index status with pg_stat_progress_create_index

Checking current index creation status with pg_stat_progress_create_index

Queries and results

I ran several queries to compare the performance and result quality of pgvector and pgvectorscale. It was possible to create some additional indexes (on weight and height columns, for example). However, since I experimented under the same conditions for pgvector and pgvectorscale, this was not crucial for the goal of this test. That could be done quickly in future experiments where superior performance is requested.

I picked 3 initial  vectors (as my test set) and make the query 2 times. For each query, I  searched for the top 10 nearest neighbors.

Query 1: Simple nearest neighbor search


SELECT id, image_path, caption
FROM pgvector_benchmark
ORDER BY vector <=> '[input_vector_here]'
LIMIT 10;

Pgvector latencies (ms): 48, 39, 121, 39, 147, 40

Pgvectorscale latencies (ms): 41, 41, 48, 42, 42, 45

pgvectorscale performance

pgvectorscale performance on simple query

Overall: better speed with similar results.

Query 2: Nearest neighbor search with text filtering


SELECT id, image_path, caption
FROM pgvector_benchmark
WHERE caption ILIKE '%room%'
ORDER BY vector <=> '[input_vector_here]'
LIMIT 10;

Pgvector latencies (ms): 45, 49, 38, 35, 38, 47 (no results, needs bigger hnsw.ef_search)

Pgvectorscale latencies (ms): 154, 180, 55, 50, 90, 91 (great results)

Pgvector results

Pgvector results

Pgvectorscale results

Pgvectorscale results

Overall: A bit worst performance (still under 200 ms), but significant better result list. This repeats till the end of this experiment.

Query 3: Nearest neighbor search with numeric filtering


SELECT id, image_path, caption, width
FROM pgvector_benchmark
WHERE width > 300  
ORDER BY vector <=> '[input_vector_here]'
LIMIT 10;

Pgvector latencies (ms): 43, 37, 38, 45, 45, 45 (no results, needs bigger hnsw.ef_search)

Pgvectorscale latencies (ms): 157, 163, 144, 157, 42, 51 (great results)

Query 4: Nearest neighbor search with multiple filters


SELECT id, image_path, caption, width, height
FROM pgvector_benchmark  
WHERE caption ILIKE '%room%' AND width > 300 AND height > 300
ORDER BY vector <=> '[input_vector_here]'
LIMIT 10;
Pgvector latencies (ms): 37, 47, 45, 46, 38, 38 (no results, needs bigger hnsw.ef_search)
Pgvectorscale latencies (ms): 162, 158, 189, 280, 52, 52 (great results)
Pgvector results with complex filters

Pgvector results with complex filters

Pgvectorscale results with complex filters

Pgvectorscale results with complex filters

To compare the result quality, I used the following query to calculate the percentage of matching IDs in the top 5 results:

WITH pgvector_results AS (
    SELECT id 
    FROM pgvector_benchmark
    ORDER BY vector <=> '[input_vector_here]'
    LIMIT 5
),
pgvectorscale_results AS (
    SELECT id
    FROM pgvectorscale_benchmark
    ORDER BY vector <=> '[input_vector_here]'  
    LIMIT 5
),
comparison AS (
    SELECT 
        CASE
            WHEN pv.id IS NOT NULL AND pvs.id IS NOT NULL THEN 1
            ELSE 0
        END AS match
    FROM pgvector_results pv
    FULL OUTER JOIN pgvectorscale_results pvs ON pv.id = pvs.id  
)
SELECT
    (SUM(match)::float / COUNT(*)) * 100 AS match_percentage
FROM comparison;

The “quality” checks showed that the top 5 results were similar for simple queries, confirming the good result quality on simple ANN. In a more complex scenario (with filters), a 100 % match score will most likely not be achieved due to how the two types of indexes work internally.

Run the same query in pgvector and pgvectorscale

Comparing results (top 5) from pgvector and pgvectorscale in a simple ANN

Observations and results

One of the key observations from my benchmarking experiment was the difference in result quality between pgvector and pgvectorscale, particularly in queries involving filtering.

In some cases, pgvector returned no results at all, while pgvectorscale managed to return accurate results, albeit with slightly higher latency – which is normal pgvector returned 0 rows. This can be attributed to the different approaches these extensions take when handling filtering.

In pgvector I used the HNSW (Hierarchical Navigable Small World) index, suffers from a limitation known as the “ef_search” cutoff. As illustrated in the first image below, the HNSW index first retrieves a fixed number of records determined by the hnsw.ef_search parameter before applying secondary filters. If not enough items in this initial set match the filters, the query may return incomplete or even empty results.

The problem with two-stage post-filtering

In contrast, pgvectorscale’s StreamingDiskANN index employs a streaming model that allows for continuous retrieval of the next closest item without any arbitrary cutoffs. The PostgreSQL execution system keeps requesting the next closest item until it has gathered enough results that satisfy the filters. This approach ensures accurate results without loss in recall, even when dealing with complex filtering conditions.

Streaming filtering with pgvectorscale

In contrast, the StreamingDiskANN index has no “ef_search” type cutoff. Instead, as shown in the image, it uses a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, potentially even traversing the entire graph. Source: https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/

In my benchmarking setup, I tried to create a fair comparison between pgvector and pgvectorscale by setting similar parameters for their respective indexes. For pgvector, I created an HNSW index with ef_construction set to 64, which represents the size of the dynamic candidate list for graph construction. For pgvectorscale, I created a diskann index with search_list_size set to 64 and num_neighbors set to 40.

The search_list_size parameter in pgvectorscale is analogous to ef_construction in pgvector, as it influences the graph quality during index construction. Higher values improve quality but slow down the build process. The num_neighbors parameter determines the maximum number of neighbors per node, with higher values increasing accuracy at the cost of slower graph traversal.

By setting these parameters to similar values, I aimed to create comparable conditions for the two indexes. This allowed me to evaluate their performance and result quality on a more equal footing.

It’s important to note that while pgvectorscale’s diskann index took slightly longer to construct compared to pgvector’s HNSW index, it consumed significantly fewer CPU resources. This suggests that there may be room for further optimization in the index build configuration, which could potentially improve both the build time and the overall performance of pgvectorscale. Also, and maybe most importantly, it will allow the index to index more data on a server with limited resources. My next experiment should be 100 million 768-dimensional vectors.

As a conclusion of my observations, pgvectorscale’s ability to handle metadata filtering without sacrificing result quality is an advantage over pgvector. The StreamingDiskANN index’s design allows for accurate and complete results, even in the presence of complex filtering conditions. This, combined with the promising out-of-the-box performance and the potential for further tuning, makes pgvectorscale an attractive choice for developers working with vector similarity search in PostgreSQL.

Further work

After conducting this benchmark and seeing the promising results of pgvectorscale, I’m excited to explore more aspects of this extension and its ecosystem in future.

One area I’m particularly interested in is the pgAI extension, also developed by Timescale.

PgAI aims to bring more AI workflows to PostgreSQL, such as embedding creation and model completion. I’m curious to see how it integrates with pgvectorscale and whether it can further streamline the development of AI applications using PostgreSQL.

Another task on my list is to investigate the feasibility and value of writing a Django extension for pgvectorscale. As a frequent user of Django for web applications, having a seamless integration between pgvectorscale and Django could greatly enhance my productivity and the performance of my AI-powered applications. I plan to assess the effort required to develop such an extension and weigh it against the potential benefits.

To gain a broader perspective on the vector database landscape, I also intend to compare pgvectorscale with other notable solutions like Weaviate. By understanding the strengths and weaknesses of each system, I can make more informed decisions when choosing the right tool for specific use cases. This comparison will involve factors such as performance, scalability, ease of use, and community support.

Lastly, I’m keen on further experimenting with the search parameters exposed by pgvectorscale, as stated in the previous section. The flexibility provided by these parameters allows for fine-tuning the search performance based on specific workload requirements. By carefully adjusting settings like search_list_size, num_neighbors, and other query-time parameters, I hope to find the optimal balance between speed and accuracy for various scenarios.

Through these further explorations, I aim to deepen my understanding of pgvectorscale and its surrounding ecosystem. By sharing my findings and experiences, I hope to contribute to the growing community of developers and researchers leveraging PostgreSQL for AI applications.

Conclusion

In this article, I benchmarked pgvectorscale, a new extension that enhances pgvector’s performance and scalability for vector similarity search in PostgreSQL. Using a 5 million vector dataset from LAION-5B, I compared pgvectorscale against pgvector in terms of speed and result quality with out-of-the-box settings.

The benchmarks showed that pgvectorscale delivered comparable speed to pgvector while providing better results, especially in queries with filtering. The diskann index, based on Microsoft’s DiskANN algorithm (Jayaram Subramanya, Suhas, et al. “Diskann: Fast accurate billion-point nearest neighbor search on a single node.” Advances in Neural Information Processing Systems 32 (2019).), efficiently handled the large dataset, even with limited CPU resources / memory during index creation.

Pgvectorscale’s support for different type of filtering (as PostgreSQL extension) greatly enhances its usefulness in real-world AI applications. The query-time parameters offer flexibility to fine-tune search performance based on specific needs.

Despite being a relatively new extension, pgvectorscale’s impressive  performance and result quality make it a compelling choice for building AI applications on PostgreSQL. While more comprehensive benchmarks and real-world use cases are needed to fully understand pgvectorscale’s capabilities and limitations, the initial results are highly encouraging. As the extension evolves, it has the potential to significantly contribute to the adoption of PostgreSQL as a foundation for AI applications.

pgvectoscale vs pgvector