Torchchat is a great library that enables seamless and high-performance execution of large language models like Llama 3 and 3.1 across a wide range of devices, from laptops and desktops to mobile phones. Built on the robust foundation of PyTorch, it significantly expands on previous work to provide a comprehensive solution for local LLM inference, addressing the growing demand for on-device AI capabilities. GitHub repo is here.

Key features

Torchchat offers a rich set of features designed to make working with LLMs more accessible and efficient:

  • Support for popular LLMs: Run state-of-the-art models including Llama 3, Llama 2, Mistral, and more, right on your local device.
  • Multiple execution modes: Choose between Python (Eager, Compile) and Native (AOT Inductor, ExecuTorch) modes to best suit your use case and performance requirements.
  • Cross-platform compatibility: Runs smoothly on Linux (x86), macOS (M1/M2/M3), Android, and iOS, ensuring broad accessibility.
  • Advanced quantization support: Reduce memory footprint and accelerate inference through various quantization techniques.
  • Flexible export capabilities: Easily prepare models for deployment on desktop and mobile platforms.
  • Robust evaluation framework: Assess model accuracy and performance with built-in evaluation tools.

These features combine to create a powerful toolkit for developers and researchers looking to leverage LLMs in resource-constrained environments or privacy-sensitive applications.

💡
For those interested in exploring alternatives to PyTorch-based solutions, our comprehensive guide on running LLMs locally using Ollama offers insights into CPU-based deployment, covering model selection, quantization, and optimization techniques for a range of hardware configurations.

Architecture overview

Torchchat’s architecture is designed to provide flexibility and performance across different platforms. It is organized into three main areas:

  1. Python: At its core, Torchchat provides a REST API that can be accessed via a command-line interface (CLI) or through a browser. This allows for easy integration into existing Python workflows and rapid prototyping.
  2. C++: For desktop environments, Torchchat can produce highly optimized binaries using PyTorch’s AOT Inductor technology. This enables near-native performance for inference tasks.
  3. Mobile: Leveraging ExecuTorch, Torchchat can export models in a format optimized for on-device inference on mobile platforms. This opens up possibilities for AI-powered mobile applications with low latency and high privacy.
Torchchat architecture

Torchchat architecture overview. Source: https://pytorch.org/blog/torchchat-local-llm-inference/

The high-level architecture of Torchchat includes several key components:

  • Entry points: A Python CLI that provides access to various functions such as chat, generate, browser UI, and model export.
  • Model manager: A central component responsible for loading, unloading, and managing different LLM models efficiently.
  • LLM runners: Specialized components like the AOTI native runner and ExecuTorch runner, optimized for different target platforms.
  • Export functionality: Converts models to optimized formats like PyTorch JIT, enabling deployment across various environments.
  • Inference engine: Executes models on target devices, handling input processing and output generation.

This modular architecture allows Torchchat to be easily extended and customized for specific use cases while maintaining high performance across diverse hardware.

Getting started

Getting up and running with Torchchat is straightforward. Here’s how to get started:

Installation

First, clone the repository and set up a virtual environment:

git clone https://github.com/pytorch/torchchat.git
cd torchchat

python3 -m venv .venv
source .venv/activate

./install_requirements.sh

This process will install all necessary dependencies and prepare your environment for using Torchchat.

Basic usage

Torchchat provides several easy-to-use commands for interacting with LLMs. Here are some basic examples:

Chat with a model interactively:

python3 torchchat.py chat llama3.1

This command starts an interactive chat session with the Llama 3.1 model, allowing you to have a conversation with the AI.

Generate text based on a prompt:

python3 torchchat.py generate llama3.1 --prompt "Write a story about a boy and his bear"

This command generates a story based on the given prompt using the Llama 3.1 model.

Run Torchchat in a browser-based interface:

streamlit run torchchat.py -- browser llama3.1

This launches a web-based UI for interacting with the Llama 3.1 model, providing a more visual and user-friendly experience.

These basic commands demonstrate the flexibility of Torchchat, allowing users to interact with LLMs in various ways depending on their needs and preferences.

Performance benchmarks

One of Torchchat’s key strengths is its ability to deliver impressive performance across a range of devices. Here are some benchmark results that showcase its capabilities:

Llama 3 8B instruct on Apple M1 Max

The following table shows the performance of Llama 3 8B Instruct model on an Apple MacBook Pro with M1 Max chip and 64GB of RAM:

Mode Dtype Tokens/Sec
Arm Compile float16 5.84
Arm Compile int8 1.63
Arm Compile int4 3.99
Arm AOTI float16 4.05
Arm AOTI int8 1.05
Arm AOTI int4 3.28
MPS Eager float16 12.63
MPS Eager int8 16.9
MPS Eager int4 17.15

These results demonstrate the impact of different execution modes and quantization levels on performance. Notably, the MPS (Metal Performance Shaders) Eager mode shows significant speedups, especially with reduced precision.

Llama 3 8B Instruct on Linux x86 with CUDA (A100)

For high-performance server environments, here are the results on a Linux system with an Intel Xeon Platinum 8339HC CPU, 180GB RAM, and an NVIDIA A100 (80GB) GPU:

Mode Dtype Tokens/Sec
x86 Compile bfloat16 2.76
x86 Compile int8 3.15
x86 Compile int4 5.33
CUDA Compile bfloat16 83.23
CUDA Compile int8 118.17
CUDA Compile int4 135.16

These benchmarks highlight the massive performance gains achievable with GPU acceleration, particularly when combined with quantization techniques.

Mobile performance

Torchchat’s performance on mobile devices is equally impressive. On both the Samsung Galaxy S23 and iPhone, it achieves over 8 tokens per second using 4-bit GPTQ quantization via ExecuTorch. This level of performance enables responsive, on-device AI experiences on modern smartphones.

These benchmarks demonstrate Torchchat’s ability to efficiently utilize available hardware resources across a diverse range of devices, from high-end servers to mobile phones. This flexibility makes it an excellent choice for developers looking to deploy LLMs in various environments without sacrificing performance.

Advanced usage

Torchchat offers a range of advanced features and customization options for users who need more control over model execution and deployment. Let’s explore these in more detail:

Model customization

Torchchat provides several options to fine-tune model execution for specific use cases:

Device selection

Specify the target device for model execution:

python3 torchchat.py chat --device [cpu|cuda|mps] ...

This allows you to leverage specific hardware accelerators like CUDA GPUs or Apple’s Metal Performance Shaders.

JIT compilation

Enable JIT compilation for improved performance:

python3 torchchat.py chat --compile --compile_prefill ...

The --compile flag enables JIT compilation of the model, while --compile_prefill additionally compiles the prefill operation, trading longer startup time for faster inference.

💡
To learn more about compilation, check out: https://pytorch.org/get-started/pytorch-2.0/

Precision control

Control the numerical precision used in model computations:

python3 torchchat.py chat --dtype [fast|fast16|bf16|fp16|fp32] ...

This allows you to balance between precision and performance based on your specific requirements.

Quantization

Apply quantization to reduce model size and improve inference speed:

python3 torchchat.py chat --quantize quant_config.json ...

Quantization can significantly reduce memory usage and computation time, especially on resource-constrained devices.

These customization options provide fine-grained control over how models are executed, allowing users to optimize for their specific hardware and performance requirements.

Quantization

Quantization is a powerful technique for reducing model size and improving inference performance. Torchchat supports various quantization schemes:

  • Linear quantization (asymmetric): Supports 4-8 bit quantization with group sizes ranging from 32 to 256.
  • Linear with dynamic activations (symmetric): Implements the a8w4dq scheme for efficient quantization.
  • Embedding quantization: Allows 4-8 bit quantization of embedding layers with group sizes from 32 to 256.

Here’s an example of a quantization configuration:

{
  "linear:a8w4dq": {"groupsize": 256},
  "embedding": {"bitwidth": 4, "groupsize": 32}
}

This configuration applies a8w4dq quantization to linear layers with a group size of 256, and 4-bit quantization to embedding layers with a group size of 32.

Quantization can be particularly beneficial for mobile deployments, where model size and inference speed are critical factors.

Export for Mobile

Torchchat simplifies the process of deploying models on mobile devices. Here’s how to export and use a model for mobile:

1. Export the model:

python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte

This command exports the Llama 3 model with mobile-optimized quantization settings.

2. Use the exported model:

python3 torchchat.py generate llama3 --pte-path llama3.pte --prompt "Hello my name is"

This generates text using the exported model, simulating how it would run on a mobile device.

The exported .pte file can be integrated into mobile applications using ExecuTorch, enabling on-device inference with high performance and low latency.

GGUF model support

Torchchat extends its compatibility by supporting GGUF (GGML Unified Format) files, which are commonly used in the GGML ecosystem. This support includes parsing GGUF files with the following tensor types:

  • F16 (16-bit floating point)
  • F32 (32-bit floating point)
  • Q4_0 (4-bit quantization)
  • Q6_K (6-bit quantization)

Here’s an example of how to use a GGUF model with Torchchat:

python3 torchchat.py generate --gguf-path model.gguf --tokenizer-path tokenizer.model --prompt "Once upon a time"

This feature allows users to leverage models from the broader GGML ecosystem within the Torchchat framework, expanding the range of available models and use cases.

Supported models

Torchchat boasts support for a wide range of state-of-the-art language models, catering to various use cases and performance requirements. Here’s a comprehensive list of supported models:

  • Llama 3 and 3.1: The latest iterations of Meta’s powerful language model, available in various sizes.
  • Llama 2: Available in 7B, 13B, and 70B parameter versions, including chat-tuned variants.
  • Mistral: Supports Mistral 7B v0.1, as well as the Instruct-tuned versions v0.1 and v0.2.
  • CodeLlama: Specialized for code generation, available in 7B and 34B parameter versions.
  • TinyLlamas: Ultra-lightweight models (15M, 42M, 110M parameters) for resource-constrained environments.
  • Open Llama: An open-source alternative to Meta’s Llama, available in a 7B parameter version.
Pytorchchat supported models

Pytorchchat supported models

This diverse model support ensures that users can choose the most appropriate model for their specific use case, whether it’s general-purpose text generation, code completion, or specialized tasks requiring domain-specific knowledge.

Premium content from UnfoldAI (ebooks, cheat sheets, tutorials)

Conclusion

Torchchat represents a significant leap forward in making local LLM inference accessible and performant across a wide range of devices. Its flexible architecture, support for multiple execution modes, and advanced features like quantization and mobile export make it a powerful tool for developers working with large language models in diverse environments.

By providing a unified framework for desktop and mobile deployment, Torchchat opens up new possibilities for edge AI and privacy-preserving applications. It addresses the growing demand for on-device AI capabilities, allowing developers to create responsive and secure AI-powered applications that don’t rely on constant network connectivity or remote servers.

As the project continues to evolve, several exciting directions for future development emerge:

  1. Expanded model support: Integrating support for emerging LLM architectures and specialized models for specific domains or tasks.
  2. Enhanced quantization techniques: Exploring novel quantization methods to further improve the balance between model size, inference speed, and accuracy.
  3. Optimizations for new hardware: Adapting to emerging AI accelerators and specialized chips to maximize performance across an even wider range of devices.
  4. Improved developer tools: Creating more robust debugging, profiling, and optimization tools to help developers fine-tune their LLM deployments.
  5. Integration with other AI frameworks: Exploring interoperability with other popular AI and machine learning frameworks to create a more cohesive ecosystem.

I encourage you to clone the Torchchat repository, experiment with different models and configurations, and contribute to this exciting open-source project. Whether you’re interested in optimizing performance, adding support for new hardware targets, or implementing novel quantization schemes, there are many opportunities to get involved and help shape the future of local LLM inference.

As the field of AI continues to advance at a rapid pace, tools like Torchchat will play a crucial role in democratizing access to powerful language models and enabling innovative applications across a wide range of industries and use cases.