In the current landscape of AI applications, running LLMs locally on CPU has become an attractive option for many developers and organizations. This approach isn’t about accessing the most powerful models, but rather about finding a balance between capability and practicality.

Local inference allows users to leverage good-enough models that can handle a wide range of tasks effectively while maintaining control over data privacy and reducing dependency on cloud services.

Running LLMs locally on CPU offers several advantages:

  1. Data privacy: Keeping sensitive information within your own infrastructure.
  2. Cost efficiency: Eliminating ongoing cloud computation expenses.
  3. Offline capability: Ensuring AI functionality without internet connectivity.
  4. Customization: Easier fine-tuning and adaptation of models to specific use cases.
  5. Latency reduction: Potentially lower response times for certain applications.

While these locally-run models may not match the raw power of the largest cloud-based LLMs (e.g. Claude 3.5 Sonnet), they often provide sufficient capability for many real-world applications, from content generation and analysis to specialized domain tasks. This approach democratizes AI technology, making it accessible to a broader range of users and use cases.

In this article, we’ll explore running LLMs on local CPUs using Ollama, covering optimization techniques, model selection, and deployment considerations, with a focus on Google’s Gemma 2—one of the best models for its size.

Introduction to Ollama

ollama logoOllama is an open-source tool that simplifies the process of running LLMs on local machines. It provides an easy-to-use interface for downloading, running, and managing various language models, including popular ones like Llama 3, Qwen2, Phi-3 and the recently released Gemma 2.

Key features of Ollama include:

  1. Simple command-line interface
  2. Support for multiple models
  3. Easy model switching and usage
  4. Customizable prompts
  5. Integration with popular AI/ML frameworks

Ollama 0.2: A game-changing update

The recent release of Ollama 0.2.0 brings significant improvements, particularly in concurrency and model management. Let’s explore the key enhancements in short:

Concurrency

The star feature of Ollama 0.2 is concurrency, which is now enabled by default. This unlocks two major capabilities:

  1. Parallel requests: Ollama can now serve multiple requests simultaneously, using only a small amount of additional memory for each. This enables users to:
    • Handle multiple chat sessions at once
    • Host code completion LLMs for teams
    • Process different parts of a document simultaneously
    • Run multiple agents concurrently
  2. Multiple model support: Users can now load different models at the same time. This dramatically enhances use cases such as:
    • Retrieval Augmented Generation (RAG), where both embedding and text completion models can coexist in memory
    • Running multiple agents simultaneously
    • Side-by-side operation of large and small models

While our focus is on CPU deployment, it’s worth noting that Ollama 0.2 also introduces smart memory management for GPU users, automatically handling model loading and unloading based on resource availability. This feature, along with the new ollama ps command for monitoring loaded models, enhances Ollama’s versatility across different hardware configurations.

Advanced concurrency management

Ollama 0.2 introduces several environment variables for fine-grained control over concurrency:

  • OLLAMA_MAX_LOADED_MODELS: Controls the maximum number of concurrently loaded models (default: 3 times the number of GPUs, or 3 for CPU inference)
  • OLLAMA_NUM_PARALLEL: Sets the maximum number of parallel requests each model can process (default: auto-selects between 4 or 1 based on available memory)
  • OLLAMA_MAX_QUEUE: Determines the maximum number of requests Ollama will queue when busy (default: 512)

These settings allow users to optimize Ollama’s performance for their specific hardware and use cases.

Setting up Ollama

To get started with Ollama, follow these steps:

  1. Visit the official Ollama website: https://ollama.ai/
  2. Download the appropriate version for your operating system (Windows, macOS, or Linux)
  3. Install Ollama following the instructions provided for your OS
  4. Open a terminal or command prompt to interact with Ollama

Before we explore further how to run models, let’s take a closer look at quantization – a key technique that makes local LLM execution possible on standard hardware.

Quantization

Quantization is a crucial technique for running large language models on consumer hardware, especially when using CPUs. This process converts the model’s weights from higher precision (e.g., 32-bit floating-point) to lower precision formats (e.g., 8-bit integers).

What is quantization?

What is quantization?

The benefits of quantization include:

  1. Reduced memory usage: Quantized models require significantly less RAM, making it feasible to run larger models on devices with limited memory.
  2. Faster inference: Lower precision calculations can be performed more quickly, especially on CPUs.
  3. Smaller storage footprint: Quantized models take up less disk space, which is beneficial for deployment and distribution.

For example, a 27B parameter model that might require over 100GB of RAM in full precision could potentially run on a system with 32GB or less after quantization. Tools like transformers from Hugging Face or specialized quantization libraries can help you perform this optimization.

How much memory I need

The memory required to run an LLM is closely tied to its number of parameters. As a general rule of thumb, you need about 4 bytes of memory per parameter for full precision (FP32) models. This means a 27B parameter model would theoretically require around 108GB of RAM in its full form.

However, through techniques like quantization (reducing precision to INT8 or even INT4), memory requirements can be significantly reduced, potentially by 2-4 times or more. Even with these optimizations, running a 27B parameter model on an example 8GB Mac M1 would be almost impossible.

Model size measureTypically, consumer devices with 8GB of RAM are better suited for models in the 1B-7B parameter range, depending on the level of quantization and optimization techniques applied. For larger models like the 27B version, you’d generally need more RAM (16GB+) or need to employ even more advanced techniques like disk swapping, which can severely impact performance.

For a precise estimate of memory requirements for different model sizes, check out this awesome calculator provided by Hugging Face: https://huggingface.co/spaces/hf-accelerate/model-memory-usage

Model selection

Choosing the right model for local CPU execution is crucial for balancing performance and resource utilization. Consider the following factors:

  1. Model size: Smaller models (1B-7B parameters) are more suitable for most consumer CPUs.
  2. Task specialization: Choose models fine-tuned for your specific task.
  3. Language support: For non-English tasks, consider models trained on the target language.
  4. Quantization readiness: Look for models that have quantized versions available or are easily quantizable. Many models on Hugging Face now offer GGUF (GPT-Generated Unified Format) versions, which are optimized for CPU inference.
  5. Community support: Models with active community support often have more resources and optimizations available for local deployment.
💡
Remember, the largest or most recent model isn’t always the best choice for local CPU execution. Often, a well-optimized smaller model can provide better real-world performance on limited hardware.

For this article, we’ll focus on Google’s Gemma 2, but it’s worth noting that there are several other excellent options for local usage, including Microsoft’s Phi-3 and Meta’s Llama 3.

Running Gemma 2 on Ollama

Google’s Gemma 2 is a state-of-the-art language model available in two sizes: 9B and 27B parameters. The paper is available here, if you are interested to learning more about training data, architecture, etc.

Gemma 2 evaluation against a comprehensive set of metrics

Gemma 2 evaluation against a comprehensive set of metrics

Let’s explore how to run these models using Ollama. According to many opinions from practicioners, it is one of the best models for their size.

Gemma 2: 9B parameters

To run the 9B parameter version of Gemma 2, use the following command:

ollama run gemma2

This command will download the model if it’s not already present on your system and start an interactive session.

Gemma 2: 27B parameters

For the larger 27B parameter version, use:

ollama run gemma2:27b

The 27B version offers superior performance but requires more computational resources.

Performance and efficiency

Gemma 2 is designed for class-leading performance and efficiency. Let’s look at how it compares to other popular models across various benchmarks:

Benchmark Metric Gemma 2 9B Gemma 2 27B Llama 3 8B Llama 3 70B Grok-1 314B
General MMLU (5-shot, top-1) 71.3 75.2 66.6 79.5 73.0
Reasoning BBH (3-shot, CoT) 68.2 74.9 61.1 81.3
HellaSwag (10-shot) 81.9 86.4 82
Math GSM8K (5-shot, maj@1) 68.6 74.0 45.7 62.9 (8-shot)
MATH (4-shot) 36.6 42.3 23.9
Code HumanEval (pass@1) 40.2 51.8 63.2 (0-shot)

This table demonstrates Gemma 2’s efficiency, particularly the 9B version, which competes well with larger models in these benchmarks.

To explore fine-tuning techniques for Gemma 2, including significant speed and memory improvements, check out this article.

Practical applications

Gemma 2, like other open LLMs, has a wide range of applications across various domains. Here are some key areas where you can utilize this model:

  1. Content creation and communication: Automated writing, email drafting, and chatbot development
  2. Research and education: Literature summarization, question answering, and hypothesis generation
  3. Software development: Code completion, documentation generation, and bug detection
  4. Data analysis: Text classification, sentiment analysis, and report generation
  5. Creative tasks: Story plotting, poetry creation, and language translation

Integrating Gemma 2 with popular ML frameworks

Ollama makes it easy to integrate Gemma 2 with popular machine learning frameworks.

Here are examples using LangChain and LlamaIndex:

LangChain integration


from langchain_community.llms import Ollama

llm = Ollama(model="gemma2")
response = llm.invoke("Why is the sky blue?")
print(response)

LlamaIndex integration


from llama_index.llms.ollama import Ollama

llm = Ollama(model="gemma2")
response = llm.complete("Why is the sky blue?")
print(response)

Considerations for local LLM usage

When setting up LLMs to run locally on CPU, several practical factors come into play:

  1. Operating system compatibility: Ensure that the tools and libraries you’re using are compatible with your OS. For example, some Docker settings may need adjustment between Linux, macOS, and Windows.
  2. CPU architecture: Different CPU architectures (e.g., x86, ARM) may require specific builds or optimizations of your chosen model or framework.
  3. Cooling and power: Running LLMs can be computationally intensive. Ensure your system has adequate cooling, especially for prolonged use. On laptops, be aware of increased power consumption.
  4. Storage speed: Fast storage (like SSDs) can improve model loading times and performance when swapping is necessary.
  5. Background processes: Close unnecessary applications to free up CPU and memory resources for the LLM.
  6. Update frequency: Keep your libraries and frameworks updated, as newer versions often include performance improvements and bug fixes crucial for local LLM execution.

Optimizing CPU performance

  1. Running LLMs on CPU requires careful optimization to achieve the best possible performance. The first consideration is selecting an appropriate model size for your hardware.
  2. Smaller models, typically in the 1B to 7B parameter range, are more suitable for most consumer CPUs. These models strike a balance between capability and resource requirements, allowing for reasonable inference speeds on standard hardware.
  3. Memory management is crucial when running LLMs on CPU. Increasing your system’s RAM can significantly improve performance, as it allows more of the model to be held in memory, reducing the need for disk swapping. However, even with limited RAM, techniques like memory mapping and efficient data loading can help manage larger models.
  4. Efficient prompt engineering is another key factor in optimizing CPU performance. Crafting clear, concise prompts that directly address the task at hand can reduce unnecessary computations. This not only improves response time but also helps maintain context within the model’s limited context window.
  5. When dealing with multiple queries or large datasets, consider implementing batch processing. By grouping similar queries together, you can leverage the CPU’s ability to perform parallel computations more effectively, potentially increasing throughput.
  6. Limiting concurrent tasks on your system while running the LLM can also boost performance. Background processes can compete for CPU resources, so closing unnecessary applications and services can ensure more processing power is available for the model.
  7. Temperature management is often overlooked but is critical for sustained performance, especially during long inference sessions. Ensure your system has adequate cooling, as CPUs can throttle their performance when they overheat, leading to slower inference times.
  8. Quantization, as mentioned earlier, is a powerful technique for CPU optimization. By reducing the precision of the model’s weights, quantization not only decreases memory usage but can also speed up computations on CPU architectures optimized for lower-precision operations.
  9. Lastly, consider the impact of your chosen software stack. Some frameworks and libraries are better optimized for CPU inference than others. For instance, ONNX Runtime or TensorRT can provide significant speed improvements over standard PyTorch or TensorFlow implementations for certain models.

By applying these optimization strategies, you can significantly enhance the performance of LLMs running on CPU, making local AI more accessible and efficient for a wide range of applications.

GUI options for interacting with local LLMs

While command-line interfaces are powerful, graphical user interfaces (GUIs) can make interacting with your local LLM more intuitive and user-friendly. Here are some popular options:

  1. Chatbot Ollama: A ChatGPT-like interface that works well with Ollama-wrapped models. It provides a familiar chat experience and is relatively easy to set up.
  2. Open WebUI: Another web-based interface designed specifically for Ollama models, offering a clean and responsive design.
  3. Streamlit: A Python library that allows you to create custom web interfaces quickly. It’s highly customizable and can be tailored to your specific use case.
  4. Gradio: Similar to Streamlit, Gradio allows for rapid development of web interfaces for machine learning models, including LLMs.
  5. LM Studio: A desktop application that provides a user-friendly interface for downloading, running, and interacting with various open-source LLMs.
Gradio playground

Gradio playground

When choosing a GUI, consider factors like ease of setup, customization options, resource usage, and compatibility with your chosen model and deployment method. Some interfaces may require additional setup steps or dependencies, so factor this into your deployment plan.

Alternatives to Ollama

While Ollama provides an excellent solution for running LLMs locally, there are other alternatives worth considering:

1. Hugging Face Transformers

Hugging Face offers a comprehensive library for working with transformer-based models, including tools for running models locally.

Advantages:

  • Extensive model library
  • Flexible API
  • Strong community support

Example usage:


from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

input_text = "Hello, how are you?"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50)

print(tokenizer.decode(output[0], skip_special_tokens=True))

2. GGUF Format and llama.cpp

GGUF (GPT-Generated Unified Format) is a file format designed for efficient storage and loading of large language models. It’s particularly useful for running models on consumer hardware.

Advantages:

  • Highly optimized for CPU inference
  • Supports quantization for reduced memory usage
  • Cross-platform compatibility

Example usage with llama.cpp:


# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build the project
make

# Download a GGUF model (e.g., Gemma 2B)
wget https://huggingface.co/mlabonne/gemma-2b-GGUF/resolve/main/gemma-2b.Q2_K.gguf

# Run the model
./llama-cli -m gemma-2b.Q2_K.gguf -n 128 -p "Hello, how are you?"

3. Local Deployment on Mac with Apple Silicon

For users with Mac devices featuring Apple Silicon chips (M1, M2, M3), there are optimized solutions for running LLMs locally:

MLX framework

Apple’s MLX framework is designed specifically for machine learning on Apple Silicon.

Advantages:

  • Optimized for Apple Silicon
  • Supports various model architectures
  • Integrates well with the Apple ecosystem

Example usage:

import mlx.core as mx
import mlx.nn as nn

# Load a pre-trained model (simplified example)
model = nn.TransformerLM(...)
model.load_weights("path/to/weights")

# Generate text
input_text = "Hello, how are you?"
tokens = tokenize(input_text)
output = model.generate(tokens, max_length=50)
print(detokenize(output))

Premium content from UnfoldAI (ebooks, cheat sheets, tutorials)

Conclusion

Running LLMs locally on CPU using tools like Ollama and its alternatives opens up a world of possibilities for developers, researchers, and enthusiasts.

The efficiency of models like Gemma 2, combined with the ease of use provided by these tools, makes it feasible to experiment with and deploy state-of-the-art language models on standard hardware.

As you explore the capabilities of locally run LLMs, remember to:

  • Choose the appropriate tool and model for your specific use case
  • Optimize your workflows for CPU inference
  • Consider the trade-offs between model size, performance, and resource requirements
  • Stay updated with the latest developments in local LLM deployment, as this field is rapidly evolving

With these tools and models at your disposal, you can leverage the power of advanced language models to enhance your applications, research, and development processes, all while maintaining control over your data and reducing dependency on cloud services.