Ollama, the open-source project for running large language models locally, has released version 0.2.0, followed quickly by a 0.2.1 small fix. This update brings significant improvements, particularly in concurrency and model management, making it a game-changer for local LLM enthusiasts.
Concurrency: The heart of the update
The star feature of Ollama 0.2 is concurrency, which is now enabled by default. This unlocks two major capabilities: parallel requests and multiple model support.
Parallel Requests
Ollama can now serve multiple requests simultaneously, using only a small amount of additional memory for each. This enhancement enables users to handle multiple chat sessions at once, host code completion LLMs for teams, process different parts of a document simultaneously, and run multiple agents concurrently.
Multiple model support
The new version allows users to load different models at the same time. This improvement dramatically enhances use cases such as Retrieval Augmented Generation (RAG), where both embedding and text completion models can coexist in memory. It also allows multiple agents to run simultaneously and enables the side-by-side operation of large and small models.
Ollama intelligently manages GPU memory, automatically loading and unloading models based on requests and available resources. To monitor loaded models, users can now use the new ollama ps
command, which displays model name, ID, size, processor usage, and time until unload.
New models and improvements
Ollama 0.2 introduces support for several new models:
- GLM-4: A powerful multi-lingual model competitive with Llama 3
- CodeGeeX4: A versatile model for AI software development scenarios, including code completion
- Gemma 2: Offering improved output quality and updated base text generation models
The update also brings various improvements and fixes, including enhanced Gemma 2 performance, resolution of token generation issues, better error messages for unsupported architectures, improved Modelfile handling, and the addition of memory requirement checks on Linux systems.
Check our article about the finetunning speed up from UnslothAI in Gemma 2.
Advanced concurrency management
Ollama 0.2 introduces concurrency management options, giving users fine-grained control over how the server handles multiple requests and models:
OLLAMA_MAX_LOADED_MODELS
: Controls the maximum number of concurrently loaded models, defaulting to 3 times the number of GPUs (or 3 for CPU inference).OLLAMA_NUM_PARALLEL
: Sets the maximum number of parallel requests each model can process, with a default that auto-selects between 4 or 1 based on available memory.OLLAMA_MAX_QUEUE
: Determines the maximum number of requests Ollama will queue when busy before rejecting additional ones, defaulting to 512.
These settings allow users to optimize Ollama’s performance for their specific hardware and use cases. The system tries to intelligently queues requests when memory is insufficient for new model loads, unloading idle models to make room for new ones as needed.
It’s worth noting that parallel request processing increases the context size proportionally. For instance, a 2K context with 4 parallel requests results in an 8K context, requiring additional memory allocation.
Quick Fix in 0.2.1
Shortly after the 0.2.0 release, version 0.2.1 was released to address an issue where setting OLLAMA_NUM_PARALLEL
caused models to reload after each request, ensuring smoother operation for users leveraging parallel processing.
Conclusion
Ollama 0.2 represents a significant leap forward in local LLM management. The introduction of concurrency, improved model handling, and new model support make it an even more powerful tool for developers and AI enthusiasts. These updates open up new possibilities for complex, multi-model workflows and efficient resource utilization, solidifying Ollama’s position as a leading solution for running cutting-edge language models locally.
Ollama 0.2 is here! Concurrency is now enabled by default. https://t.co/UvvgrIeCjv
This unlocks 2 major features:
Parallel requests
Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases… pic.twitter.com/XlCXqnDGaQ— ollama (@ollama) July 9, 2024