NVIDIA has developed an impressive new large language model called Nemotron-4 15B. With 15 billion parameters, it was trained on a massive dataset of 8 trillion tokens spanning English text, text in 53 other languages, and code in 43 programming languages. The result is a highly capable general-purpose model that demonstrates strong performance across a wide range of natural language and coding tasks.

Nemotron-4 15B

Comparison of Nemotron-4 15B across seven evaluation areas against similarly sized models.

Links

Key performance results

Nemotron-4 15B was evaluated on several categories of downstream tasks:

  • Commonsense reasoning (0-shot): On a suite of tasks like SIQA, ARC, PIQA, Winogrande, and Hellaswag, Nemotron-4 achieved the highest average score of 73.4% compared to other models of similar size.
  • Popular aggregated benchmarks: Nemotron-4 attained a score of 58.7% on the challenging Big-Bench Hard (BBH) benchmark, nearly 7% higher than the closest similarly-sized model. On the MMLU benchmark, it scored a highly competitive 64.2%.
  • Math: Nemotron-4 achieved a score of 46.0% on the GSM8K math word problem dataset, on par with the 7B parameter Gemma model and significantly outperforming larger models like LLaMA-2 13B/34B.
  • Code: on the HumanEval and MBPP coding task benchmarks, Nemotron-4 attained scores of 31.6% and 40.6% respectively. This is competitive with the 14B parameter QWEN model and superior to Mistral 7B and LLaMA-2 13B/34B.
Data distribution Nemotron-4

Data distribution of the 43 programming languages used for pre-training.

In a comparison across 11 different programming languages, Nemotron-4 achieved the highest average score, outperforming both the code-specific 15B parameter Starcoder model and the 7B parameter Mistral model. It especially excelled on low-resource languages like Scala, Julia, and R.

  • Multilingual: Perhaps most impressively, Nemotron-4 demonstrated the best multilingual performance of any similarly-sized general purpose model. On the XCOPA multilingual classification task, it improved accuracy by nearly 12% in the 4-shot setting compared to the next best model.

On the challenging MGSM benchmark combining math and multilingual abilities, Nemotron-4 outscored the next best model by nearly 30%. And on the TyDiQA-GoldP multilingual question answering task, it achieved an average exact match score of 50.5%, significantly higher than both the 62B parameter PaLM model and other smaller models.

Model architecture and training

Nemotron-4 15B utilizes a standard decoder-only transformer architecture. The key specs are:

  • 15 billion total parameters (12.5B non-embedding parameters, 3.2B embedding parameters)
  • 32 transformer layers
  • Hidden dimension of 6144
  • 48 attention heads
  • Sequence length of 4096 tokens
  • Vocabulary size of 256,000 tokens

Training was conducted on 384 DGX H100 nodes, with each node containing 8 H100 80GB GPUs. A combination of 8-way tensor parallelism and data parallelism was used. The degree of data parallelism was increased from 96-way to 384-way during training as batch size was ramped up.

In total, training was completed in approximately 13 calendar days. Peak GPU utilization during training was 34.3%, highlighting the immense scale and computational intensity of the process.

Continued training for better results

After the main pre-training run on the full 8 trillion token dataset, the Nemotron-4 15B model underwent an additional “continued training” phase. In this phase, two specialized data distributions were utilized.

The first distribution re-sampled data that was seen during pre-training, but with higher weight given to text from sources deemed to be of better quality. The second distribution mixed in a small amount of data mimicking the format of benchmark task prompts. This allowed the model to better handle the types of questions seen in formal evaluations.

Importantly, a more aggressive learning rate decay schedule was used during this phase compared to pre-training. The combination of the tailored data distributions and learning rate decay allowed the model to smoothly adapt from the broad pre-training dataset to more targeted data.

Conclusion

Nemotron-4 15B is a highly capable, general-purpose language model that punches well above its size class. By training on a huge and diverse 8 trillion token dataset, it has achieved impressive results across English natural language tasks, coding tasks in many programming languages, and especially multilingual tasks.

The continued training approach, where the data distribution and learning rate schedule were modified after main pre-training, resulted in even better downstream performance. This demonstrates the importance of fine-tuning the training process based on the end goals.

Despite its large size, Nemotron-4 15B is still practical to deploy, as it can be run on a single high-end GPU. This opens the door to many useful applications in natural language processing, code generation and analysis, question answering, and more.

NVIDIA’s work with Nemotron-4 15B shows that we can achieve major jumps in language model performance and capabilities not just by scaling model size, but by vastly increasing the scale and diversity of the training data. As large language models continue to advance, this is an important lesson for the field. Nemotron-4 15B is an exciting milestone on the path to ever more powerful and capable AI systems.

Last Update: 16/06/2024