As a developer passionate about efficient document retrieval systems, I’m excited to introduce LitePali, a lightweight wrapper I’ve created for the ColPali model.
LitePali is designed to streamline the process of document image processing and retrieval, making it easier for developers to integrate advanced vision-language models into their projects, using a start-of-the-art approach.
Introduction to LitePali
LitePali is built on the foundation of the ColPali architecture, which leverages Vision Language Models (VLMs) for efficient document retrieval.
The primary goal of LitePali is to provide a user-friendly interface for working with document images, allowing developers to focus on retrieval tasks without getting bogged down in the complexities of PDF parsing and image processing.
Key features of LitePali
- Minimal dependencies: LitePali is designed to be lightweight, requiring fewer dependencies than similar libraries.
- Direct image processing: Unlike some other libraries (e.g. byaldi, see next section), LitePali works exclusively with images, separating the logic of PDF parsing from the retrieval process. On that way, I can make only the heavy tasks on GPU (💸) and the others on CPU (e.g. PDF to image).
- Deterministic file processing: This ensures consistent results across different runs.
- Batch processing: Efficient handling of multiple files for improved performance.
- Cloud-optimized: LitePali is tailored for deployment in cloud environments, like Replicate.com and Inference Endpoints in Huggingface.
Inspiration and differentiation
While developing LitePali, I drew inspiration from the byaldi library. However, I made several key design decisions to differentiate LitePali and address specific needs:
- Focus on images: By working exclusively with images, LitePali allows for more flexible deployment options. PDF processing can be handled separately, potentially on CPU-only environments.
- Simplified dependencies: LitePali doesn’t require Poppler or other PDF-related dependencies, making it easier to set up and maintain.
- Updated engine: LitePali utilizes colpali-engine >=0.3.0 for improved performance.
- Customized functionality: I’ve tailored the library for specific document retrieval needs while building upon the solid foundation laid by byaldi.
These differences make LitePali a more streamlined and focused tool for image-based document retrieval, offering flexibility in deployment and integration with existing PDF processing pipelines.
Under the hood: ColPali architecture
At the core of LitePali is the ColPali architecture, which employs Vision Language Models for efficient document retrieval.
Let’s dive into some of the key components that make ColPali, and by extension LitePali, so powerful:
- Late interaction mechanism: This allows for efficient query matching while maintaining context, crucial for understanding complex document structures.
- Multi-vector representations: ColPali generates fine-grained representations of both text and images, enabling more accurate retrieval.
- Visual and textual understanding: By processing document images directly, ColPali can understand both the content and layout of documents.
- Efficient indexing: Compared to traditional PDF parsing methods, ColPali offers faster corpus indexing.
These features combine to create a robust system capable of handling complex document retrieval tasks with high accuracy and efficiency.
Getting started with LitePali
To start using LitePali in your projects, you can install it via pip:
pip install litepali
Once installed, you can import and use LitePali in your Python scripts.
Here’s a basic example of how to use LitePali:
from litepali import LitePali, ImageFile
# Initialize LitePali
litepali = LitePali()
# Add some images with metadata and page information
litepali.add(ImageFile(
path="path/to/image1.jpg",
document_id=1,
page_id=1,
metadata={"title": "Introduction", "author": "John Doe"}
))
litepali.add(ImageFile(
path="path/to/image2.png",
document_id=1,
page_id=2,
metadata={"title": "Results", "author": "John Doe"}
))
# Process the added images
litepali.process(batch=4)
# Perform a search
results = litepali.search("Your query here", k=5)
# Print results
for result in results:
print(f"Image: {result['image'].path}, Score: {result['score']}")
# Save the index
litepali.save_index("path/to/save/index")
# Later, load the index
new_litepali = LitePali()
new_litepali.load_index("path/to/save/index")
This example demonstrates the basic workflow of adding images, processing them, performing a search, and saving/loading the index.
Real-world example: Processing research papers
To showcase the capabilities of LitePali, let’s walk through a real-world example of processing and searching through research papers. We’ll use three papers from arXiv:
- “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training“
- “Learning Transferable Visual Models From Natural Language Supervision“
- “ColPali: Efficient Document Retrieval with Vision Language Models“
If you want to skip this part and goes directly to the code, here is a link to the notebook.
And here’s a step-by-step breakdown of the process, if you want to run the process yourself:
Step 1: Setting up the environment
First, we’ll import the necessary libraries and define our PDF URLs:
import os
import requests
import PyPDF2
from pdf2image import convert_from_bytes
from litepali import LitePali, ImageFile
# Define PDF URLs and metadata
pdf_metadata = {
"https://arxiv.org/pdf/2403.09611.pdf": "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training",
"https://arxiv.org/pdf/2103.00020.pdf": "Learning Transferable Visual Models From Natural Language Supervision",
"https://arxiv.org/pdf/2407.01449.pdf": "ColPali: Efficient Document Retrieval with Vision Language Models"
}
# Create base directory
base_dir = os.path.join(os.getcwd(), "litepali_data")
os.makedirs(base_dir, exist_ok=True)
Step 2: Downloading and processing PDFs
Next, we’ll download the PDFs and convert them to images:
def download_pdf(url, save_dir):
response = requests.get(url)
filename = url.split('/')[-1]
save_path = os.path.join(save_dir, filename)
with open(save_path, 'wb') as f:
f.write(response.content)
return save_path
def parse_pdf(pdf_path):
images = []
metadata = {}
with open(pdf_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
metadata = reader.metadata
num_pages = len(reader.pages)
pdf_images = convert_from_bytes(open(pdf_path, 'rb').read())
pdf_dir = os.path.dirname(pdf_path)
for i, img in enumerate(pdf_images):
img_path = os.path.join(pdf_dir, f"page_{i+1}.jpg")
img.save(img_path, 'JPEG')
images.append(img_path)
return images, metadata, num_pages
# Download and process PDFs
downloaded_paths = []
for url in pdf_metadata.keys():
url_dir = os.path.join(base_dir, url.split('/')[-1].replace('.pdf', ''))
os.makedirs(url_dir, exist_ok=True)
downloaded_paths.append(download_pdf(url, url_dir))
parsed_pdfs = []
for pdf_path in downloaded_paths:
images, metadata, num_pages = parse_pdf(pdf_path)
parsed_pdfs.append({
'pdf_path': pdf_path,
'images': images,
'metadata': metadata,
'num_pages': num_pages
})
Step 3: Creating the LitePali index
Now we’ll create and process the LitePali index, which you can reuse after:
litepali = LitePali()
for pdf in parsed_pdfs:
pdf_url = next(url for url in pdf_metadata.keys() if url.endswith(os.path.basename(pdf['pdf_path'])))
custom_title = pdf_metadata[pdf_url]
for i, img_path in enumerate(pdf['images']):
litepali.add(ImageFile(
path=img_path,
document_id=os.path.basename(pdf['pdf_path']),
page_id=i+1,
metadata={
'title': custom_title,
'author': pdf['metadata'].get('/Author', ''),
'num_pages': pdf['num_pages'],
'url': pdf_url
}
))
litepali.process()
Step 4: Saving and loading the Index
To demonstrate how to save and load the index:
index_path = os.path.join(base_dir, "litepali_index")
litepali.save_index(index_path)
# Load the index (simulating a new session)
new_litepali = LitePali()
new_litepali.load_index(index_path)
Step 5: Performing searches
Finally, let’s perform some searches:
queries = [
"What is ColPali?",
"Explain the concept of vision language models",
"How does MM1 compare to other multimodal models?",
"What is CLIP and where can I use it?"
]
for query in queries:
print(f"\nQuery: {query}")
results = new_litepali.search(query, k=3)
for result in results:
print(f"Document: {result['image'].document_id}")
print(f"Title: {result['image'].metadata['title']}")
print(f"Page: {result['image'].page_id}")
print(f"Score: {result['score']}")
print(f"URL: {result['image'].metadata['url']}")
print("---")
This example demonstrates the full workflow of downloading PDFs, converting them to images, creating a LitePali index, and performing searches.
The search results include the document ID, page number, relevance score, and the original URL of the paper.
Performance and scalability
One of the key advantages of LitePali is its ability to handle large-scale document collections efficiently.
In the example above, we processed three research papers totaling 109 pages. Here are some performance metrics from the index creation process:
litepali.index_stats()
# Output:
# {'total_images': 109,
# 'processed_images': 109,
# 'unique_documents': 3,
# 'image_extensions': ['.jpg']}
This shows that LitePali successfully processed all 109 images from the three documents. The processing was done in batches of 4 images at a time on T4 (on Lightning.ai infra), which helps manage memory usage and improves overall performance.
Future plans for LitePali
As the creator of LitePali, I have several plans for future improvements and features:
- Enhanced index storage: I’m working on implementing storage of base64-encoded versions of images within the index. This will allow for quick retrieval and display of images without needing to access the original files. But it will significantly boost index size of course, so should be optional.
- Performance optimizations: I plan to test LitePali with flash-attention, which is expected to significantly speed up processing times, especially for large batches of images. In the experiment with the 109 pages, I indexed it in ~5 minutes on the T4 index.
- Quantization support: Adding support for lower precision (e.g., int8, int4) will help reduce memory footprint and increase inference speed.
- API enhancements: I’m developing a more comprehensive API for advanced querying and filtering options.
- Documentation expansion: Creating more detailed documentation, including advanced usage examples and best practices, is a priority to help developers make the most of LitePali.
Conclusion
LitePali represents a significant step forward in making advanced document retrieval technologies accessible to a wider range of developers. By focusing on image processing and providing a clean separation between PDF parsing and retrieval tasks, LitePali offers flexibility and efficiency that can be particularly valuable in cloud environments.
As I continue to develop and improve LitePali, we welcome contributions from the community. Whether you’re interested in using LitePali in your projects or contributing to its development, you can find more information and resources at:
- GitHub repository: https://github.com/s-emanuilov/litepali
- Official website: https://litepali.com/
The full example notebook, which demonstrates the entire workflow from PDF download to search, is available at: https://github.com/s-emanuilov/litepali/blob/main/examples/rag_example.ipynb
By releasing LitePali, I hope to empower developers to build more sophisticated document retrieval systems and contribute to the advancement of this exciting field. I look forward to seeing how the community will use and extend LitePali in the future.