In recent years,LLMs have become very important in many areas of our lives. However, these models are often too big and use too much energy to work well on mobile devices. A new study introduces MobileLLM, a method to make smaller but still powerful language models that can work on phones and other mobile devices.

Link to the paper: https://arxiv.org/abs/2402.14905

GitHub: https://github.com/facebookresearch/MobileLLM

Why we need smaller language models

Big language models like GPT-4, Opus, Sonnet 3.5 are amazing, but they have some problems:

  1. They use a lot of energy. If many people used these big models all the time, it would use as much power as many large companies combined.
  2. They need a lot of memory. Most phones don’t have enough memory to run these big models.
  3. They can drain phone batteries quickly. A 7 billion parameter model could use up a phone’s battery in just 2 hours of conversation.

The MobileLLM solution

Researchers have created MobileLLM, a new way to make language models that are much smaller but still work well. These models have less than 1 billion parameters, which means they can work on phones and other small devices.

Design roadmap of sub-billion sized transformermodels

Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R. and Lai, L., 2024. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905.

Key features of MobileLLM:

  1. Deep and thin structure:
    • MobileLLM uses a “deep and thin” design. This means it has more layers but each layer is smaller.
    • For example, a 125M model might have 30 layers instead of the usual 12.
    • This design helps the model understand complex ideas better.
  2. SwiGLU activation:
    • MobileLLM uses a special math function called SwiGLU in its neural network.
    • This helps the model learn better and faster.
  3. Embedding sharing:
    • In small models, the parts that handle words (called embeddings) use up a lot of space.
    • MobileLLM uses a clever trick to share these parts, saving space without losing quality.
  4. Grouped query attention:
    • This is a way to make the model’s “attention” mechanism more efficient.
    • It helps the model focus on important information without using too much memory.
  5. Layer sharing:
    • MobileLLM reuses some of its layers, which saves space and works almost as well as having separate layers.
Detailed architecture specifications of MobileLLM.

Detailed architecture specifications of MobileLLM. Source: https://arxiv.org/pdf/2402.14905

Performance results

The researchers tested MobileLLM on many different tasks. Here are some of the impressive results:

  1. Zero-shot common sense reasoning:
    • MobileLLM-125M scored 46.3% on average, which is 3.7% better than the previous best model of similar size.
    • MobileLLM-350M scored 51.3%, which is 4.7% better than the previous best.
  2. Question answering (TriviaQA):
    • MobileLLM-125M scored 13.9 (F1 score) on 1-shot learning, much better than OPT-125M’s 8.7.
    • MobileLLM-350M scored 22.0, again much higher than OPT-350M’s 11.0.
  3. Reading comprehension (RACE):
    • On the RACE-middle test, MobileLLM-125M scored 39.7%, while OPT-125M scored 34.7%.
    • MobileLLM-350M achieved 45.6%, compared to OPT-350M’s 37.1%.
  4. Chat performance:
    • On the AlpacaEval benchmark, MobileLLM-350M had a 47.08% win rate against a strong baseline model.
    • This is close to the 50% win rate of the much larger GPT-3 model.
  5. API calling task:
    • MobileLLM-350M performed almost as well as the much larger LLaMA-v2 7B model in correctly understanding and formatting API calls.
Performance on Trivia QA and RACE datasets

Performance on Trivia QA and RACE datasets. Source: https://arxiv.org/pdf/2402.14905

Practical benefits

  1. Energy efficiency:
    • A 350M parameter model uses only about 0.035 joules per token.
    • This means a phone could support conversation all day on a single charge.
  2. Speed:
    • The 125M model can process about 50 tokens per second on a phone.
    • This is much faster than larger models, which might only manage 3-6 tokens per second.
  3. Memory usage:
    • MobileLLM models can fit within the 10% of phone memory typically available for apps.
    • This makes them practical for real-world mobile applications.

Scaling to larger models

The researchers also tested their ideas on bigger models:

  • MobileLLM-600M scored 54.3% on zero-shot common sense tasks, better than models with more parameters.
  • MobileLLM-1B achieved 57.3%, outperforming models like BLOOM-1.1B and TinyLlama-1.1B.
  • MobileLLM-1.5B reached 59.4%, even beating some 3B parameter models.

Technical details

For those interested in the more technical aspects:

  1. Training setup
    • Used 32 A100 GPUs
    • Batch size of 32 per GPU
    • Initial learning rate of 2e-3 with cosine decay
    • Trained on 1 trillion tokens for final models
  2. Model architectures:
    • MobileLLM-125M: 30 layers, 9 heads, 3 KV-heads, 576 embedding dimension
    • MobileLLM-350M: 32 layers, 15 heads, 5 KV-heads, 960 embedding dimension
    • (Similar details provided for 600M, 1B, and 1.5B models)
  3. Quantization:
    • 8-bit weight, 8-bit activation quantization tested
    • Only 0.2-0.4% accuracy drop compared to full precision
  4. Layer sharing:
    • Best results with sharing every two layers
    • Minimal increase in computation time

Conclusion

MobileLLM represents a significant step forward in making powerful language models work well on mobile devices. By using clever design choices and new techniques, these models can perform almost as well as much larger models while using much less energy and memory. This could lead to new and exciting applications of AI on our phones and other small devices.

Future research might focus on making these models even smaller or more efficient, or on developing new applications that take advantage of having powerful language models right on our devices.

Last Update: 11/07/2024