Torchchat is a great library that enables seamless and high-performance execution of large language models like Llama 3 and 3.1 across a wide range of devices, from laptops and desktops to mobile phones. Built on the robust foundation of PyTorch, it significantly expands on previous work to provide a comprehensive solution for local LLM inference, addressing the growing demand for on-device AI capabilities. GitHub repo is here.
Key features
Torchchat offers a rich set of features designed to make working with LLMs more accessible and efficient:
- Support for popular LLMs: Run state-of-the-art models including Llama 3, Llama 2, Mistral, and more, right on your local device.
- Multiple execution modes: Choose between Python (Eager, Compile) and Native (AOT Inductor, ExecuTorch) modes to best suit your use case and performance requirements.
- Cross-platform compatibility: Runs smoothly on Linux (x86), macOS (M1/M2/M3), Android, and iOS, ensuring broad accessibility.
- Advanced quantization support: Reduce memory footprint and accelerate inference through various quantization techniques.
- Flexible export capabilities: Easily prepare models for deployment on desktop and mobile platforms.
- Robust evaluation framework: Assess model accuracy and performance with built-in evaluation tools.
These features combine to create a powerful toolkit for developers and researchers looking to leverage LLMs in resource-constrained environments or privacy-sensitive applications.
Architecture overview
Torchchat’s architecture is designed to provide flexibility and performance across different platforms. It is organized into three main areas:
- Python: At its core, Torchchat provides a REST API that can be accessed via a command-line interface (CLI) or through a browser. This allows for easy integration into existing Python workflows and rapid prototyping.
- C++: For desktop environments, Torchchat can produce highly optimized binaries using PyTorch’s AOT Inductor technology. This enables near-native performance for inference tasks.
- Mobile: Leveraging ExecuTorch, Torchchat can export models in a format optimized for on-device inference on mobile platforms. This opens up possibilities for AI-powered mobile applications with low latency and high privacy.
The high-level architecture of Torchchat includes several key components:
- Entry points: A Python CLI that provides access to various functions such as chat, generate, browser UI, and model export.
- Model manager: A central component responsible for loading, unloading, and managing different LLM models efficiently.
- LLM runners: Specialized components like the AOTI native runner and ExecuTorch runner, optimized for different target platforms.
- Export functionality: Converts models to optimized formats like PyTorch JIT, enabling deployment across various environments.
- Inference engine: Executes models on target devices, handling input processing and output generation.
This modular architecture allows Torchchat to be easily extended and customized for specific use cases while maintaining high performance across diverse hardware.
Getting started
Getting up and running with Torchchat is straightforward. Here’s how to get started:
Installation
First, clone the repository and set up a virtual environment:
git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/activate
./install_requirements.sh
This process will install all necessary dependencies and prepare your environment for using Torchchat.
Basic usage
Torchchat provides several easy-to-use commands for interacting with LLMs. Here are some basic examples:
Chat with a model interactively:
python3 torchchat.py chat llama3.1
This command starts an interactive chat session with the Llama 3.1 model, allowing you to have a conversation with the AI.
Generate text based on a prompt:
python3 torchchat.py generate llama3.1 --prompt "Write a story about a boy and his bear"
This command generates a story based on the given prompt using the Llama 3.1 model.
Run Torchchat in a browser-based interface:
streamlit run torchchat.py -- browser llama3.1
This launches a web-based UI for interacting with the Llama 3.1 model, providing a more visual and user-friendly experience.
These basic commands demonstrate the flexibility of Torchchat, allowing users to interact with LLMs in various ways depending on their needs and preferences.
Performance benchmarks
One of Torchchat’s key strengths is its ability to deliver impressive performance across a range of devices. Here are some benchmark results that showcase its capabilities:
Llama 3 8B instruct on Apple M1 Max
The following table shows the performance of Llama 3 8B Instruct model on an Apple MacBook Pro with M1 Max chip and 64GB of RAM:
Mode | Dtype | Tokens/Sec |
---|---|---|
Arm Compile | float16 | 5.84 |
Arm Compile | int8 | 1.63 |
Arm Compile | int4 | 3.99 |
Arm AOTI | float16 | 4.05 |
Arm AOTI | int8 | 1.05 |
Arm AOTI | int4 | 3.28 |
MPS Eager | float16 | 12.63 |
MPS Eager | int8 | 16.9 |
MPS Eager | int4 | 17.15 |
These results demonstrate the impact of different execution modes and quantization levels on performance. Notably, the MPS (Metal Performance Shaders) Eager mode shows significant speedups, especially with reduced precision.
Llama 3 8B Instruct on Linux x86 with CUDA (A100)
For high-performance server environments, here are the results on a Linux system with an Intel Xeon Platinum 8339HC CPU, 180GB RAM, and an NVIDIA A100 (80GB) GPU:
Mode | Dtype | Tokens/Sec |
---|---|---|
x86 Compile | bfloat16 | 2.76 |
x86 Compile | int8 | 3.15 |
x86 Compile | int4 | 5.33 |
CUDA Compile | bfloat16 | 83.23 |
CUDA Compile | int8 | 118.17 |
CUDA Compile | int4 | 135.16 |
These benchmarks highlight the massive performance gains achievable with GPU acceleration, particularly when combined with quantization techniques.
Mobile performance
Torchchat’s performance on mobile devices is equally impressive. On both the Samsung Galaxy S23 and iPhone, it achieves over 8 tokens per second using 4-bit GPTQ quantization via ExecuTorch. This level of performance enables responsive, on-device AI experiences on modern smartphones.
These benchmarks demonstrate Torchchat’s ability to efficiently utilize available hardware resources across a diverse range of devices, from high-end servers to mobile phones. This flexibility makes it an excellent choice for developers looking to deploy LLMs in various environments without sacrificing performance.
Advanced usage
Torchchat offers a range of advanced features and customization options for users who need more control over model execution and deployment. Let’s explore these in more detail:
Model customization
Torchchat provides several options to fine-tune model execution for specific use cases:
Device selection
Specify the target device for model execution:
python3 torchchat.py chat --device [cpu|cuda|mps] ...
This allows you to leverage specific hardware accelerators like CUDA GPUs or Apple’s Metal Performance Shaders.
JIT compilation
Enable JIT compilation for improved performance:
python3 torchchat.py chat --compile --compile_prefill ...
The --compile
flag enables JIT compilation of the model, while --compile_prefill
additionally compiles the prefill operation, trading longer startup time for faster inference.
Precision control
Control the numerical precision used in model computations:
python3 torchchat.py chat --dtype [fast|fast16|bf16|fp16|fp32] ...
This allows you to balance between precision and performance based on your specific requirements.
Quantization
Apply quantization to reduce model size and improve inference speed:
python3 torchchat.py chat --quantize quant_config.json ...
Quantization can significantly reduce memory usage and computation time, especially on resource-constrained devices.
These customization options provide fine-grained control over how models are executed, allowing users to optimize for their specific hardware and performance requirements.
Quantization
Quantization is a powerful technique for reducing model size and improving inference performance. Torchchat supports various quantization schemes:
- Linear quantization (asymmetric): Supports 4-8 bit quantization with group sizes ranging from 32 to 256.
- Linear with dynamic activations (symmetric): Implements the a8w4dq scheme for efficient quantization.
- Embedding quantization: Allows 4-8 bit quantization of embedding layers with group sizes from 32 to 256.
Here’s an example of a quantization configuration:
{
"linear:a8w4dq": {"groupsize": 256},
"embedding": {"bitwidth": 4, "groupsize": 32}
}
This configuration applies a8w4dq quantization to linear layers with a group size of 256, and 4-bit quantization to embedding layers with a group size of 32.
Quantization can be particularly beneficial for mobile deployments, where model size and inference speed are critical factors.
See the quantization guide for examples and more details.
Export for Mobile
Torchchat simplifies the process of deploying models on mobile devices. Here’s how to export and use a model for mobile:
1. Export the model:
python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
This command exports the Llama 3 model with mobile-optimized quantization settings.
2. Use the exported model:
python3 torchchat.py generate llama3 --pte-path llama3.pte --prompt "Hello my name is"
This generates text using the exported model, simulating how it would run on a mobile device.
The exported .pte
file can be integrated into mobile applications using ExecuTorch, enabling on-device inference with high performance and low latency.
GGUF model support
Torchchat extends its compatibility by supporting GGUF (GGML Unified Format) files, which are commonly used in the GGML ecosystem. This support includes parsing GGUF files with the following tensor types:
- F16 (16-bit floating point)
- F32 (32-bit floating point)
- Q4_0 (4-bit quantization)
- Q6_K (6-bit quantization)
Here’s an example of how to use a GGUF model with Torchchat:
python3 torchchat.py generate --gguf-path model.gguf --tokenizer-path tokenizer.model --prompt "Once upon a time"
This feature allows users to leverage models from the broader GGML ecosystem within the Torchchat framework, expanding the range of available models and use cases.
Supported models
Torchchat boasts support for a wide range of state-of-the-art language models, catering to various use cases and performance requirements. Here’s a comprehensive list of supported models:
- Llama 3 and 3.1: The latest iterations of Meta’s powerful language model, available in various sizes.
- Llama 2: Available in 7B, 13B, and 70B parameter versions, including chat-tuned variants.
- Mistral: Supports Mistral 7B v0.1, as well as the Instruct-tuned versions v0.1 and v0.2.
- CodeLlama: Specialized for code generation, available in 7B and 34B parameter versions.
- TinyLlamas: Ultra-lightweight models (15M, 42M, 110M parameters) for resource-constrained environments.
- Open Llama: An open-source alternative to Meta’s Llama, available in a 7B parameter version.
See the documentation on GGUF to learn how to use GGUF files.
This diverse model support ensures that users can choose the most appropriate model for their specific use case, whether it’s general-purpose text generation, code completion, or specialized tasks requiring domain-specific knowledge.
Conclusion
Torchchat represents a significant leap forward in making local LLM inference accessible and performant across a wide range of devices. Its flexible architecture, support for multiple execution modes, and advanced features like quantization and mobile export make it a powerful tool for developers working with large language models in diverse environments.
By providing a unified framework for desktop and mobile deployment, Torchchat opens up new possibilities for edge AI and privacy-preserving applications. It addresses the growing demand for on-device AI capabilities, allowing developers to create responsive and secure AI-powered applications that don’t rely on constant network connectivity or remote servers.
As the project continues to evolve, several exciting directions for future development emerge:
- Expanded model support: Integrating support for emerging LLM architectures and specialized models for specific domains or tasks.
- Enhanced quantization techniques: Exploring novel quantization methods to further improve the balance between model size, inference speed, and accuracy.
- Optimizations for new hardware: Adapting to emerging AI accelerators and specialized chips to maximize performance across an even wider range of devices.
- Improved developer tools: Creating more robust debugging, profiling, and optimization tools to help developers fine-tune their LLM deployments.
- Integration with other AI frameworks: Exploring interoperability with other popular AI and machine learning frameworks to create a more cohesive ecosystem.
I encourage you to clone the Torchchat repository, experiment with different models and configurations, and contribute to this exciting open-source project. Whether you’re interested in optimizing performance, adding support for new hardware targets, or implementing novel quantization schemes, there are many opportunities to get involved and help shape the future of local LLM inference.
As the field of AI continues to advance at a rapid pace, tools like Torchchat will play a crucial role in democratizing access to powerful language models and enabling innovative applications across a wide range of industries and use cases.