Lightweight champion: Nvidia's Tiny, Powerful AI

type

status

date

slug

summary

Optimization for Mistral-NeMo-Minitron 8B

The model optimization of Mistral-NeMo-Minitron 8B is achieved through the following two key steps:

Width Pruning:

Purpose: The purpose of width pruning is to reduce the size of the model without significantly affecting the model performance. It achieves this goal by reducing the number of neurons in the model as well as the number of attention heads and embedding channels.

Process: When pruning the Mistral NeMo 12B model, the researchers calculated the importance scores of each attention head, embedding channel, and MLP hidden dimension and pruned the model based on these scores. Specifically, the MLP intermediate dimension was reduced from 14,336 to 11,520, and the hidden size was reduced from 5,120 to 4,096, while retaining the number of attention heads and layers.

Knowledge Distillation:

Purpose: Knowledge distillation is about transferring the knowledge of a large, complex model (often called the teacher model) into a smaller student model, thereby creating a more efficient model while retaining most of the predictive power of the original large model.

Process: After pruning, the research team retrained the model lightly using a 380 billion labeled dataset. The retraining used a peak learning rate of 1e-4, a minimum learning rate of 4.5e-7, 60 steps of linear warm-up, a cosine decay schedule, and a global batch size of 768. This distillation process helped recover the model accuracy that may have been lost after pruning.

Through this combination of pruning and distillation , the Mistral-NeMo-Minitron 8B model significantly reduces computational costs while maintaining high-precision prediction capabilities. This optimization strategy provides an effective framework for building smaller and more efficient AI models.

Performance

The Mistral-NeMo-Minitron 8B model performed well in multiple benchmarks. Its performance can be understood from the following aspects:

Leading Benchmark Scores:

Nine popular benchmarks: Mistral-NeMo-Minitron 8B achieves excellent results on nine widely used benchmarks covering language comprehension, commonsense reasoning, mathematical reasoning, summary generation, programming code generation, and the ability to generate realistic answers.

Comparison results: In these benchmarks, the Mistral-NeMo-Minitron 8B base model performs close to or even better than its "big brother" Mistral NeMo 12B model. For example, in WinoGrande, ARC Challenge, MMLU, HellaSwag, GSM8K, TruthfulQA, XLSum en, MBPP and HumanEval, the model performs well, especially in WinoGrande and GSM8K tests, where the 8B model outperforms many similar models.

Efficient computational cost:

Training efficiency: Through pruning and knowledge distillation techniques, the Mistral-NeMo-Minitron 8B model not only has performance close to that of the 12B model, but also significantly reduces the demand for computing resources. Compared with training a model of the same size from scratch, pruning and distillation retraining can save up to 40 times the computing resources.

Highly adaptable:

Compact structure: The 8B parameter model of Mistral-NeMo-Minitron 8B has a compact structure and is suitable for use in application scenarios that require efficient AI processing, such as embedded devices, mobile devices or edge computing devices.

Balance between accuracy and efficiency: This model greatly improves operating efficiency while retaining high accuracy, making it suitable for applications that require low latency and high response speed, such as real-time chatbots, virtual assistants, and content generation tools.

NVIDIA also announced this week Nemotron-Mini-4B-Instruct , another small language model optimized for low memory usage and faster response times on NVIDIA GeForce RTX AI PCs and laptops . The model is available for cloud and device deployment as an NVIDIA NIM microservice and is part of NVIDIA ACE , a suite of digital human technologies powered by generative AI that deliver speech, intelligence, and animation.

Experience both models as Nim microservices at ai.nvidia.com via your browser or API .

Official introduction: https://blogs.nvidia.com/blog/mistral-nemo-minitron-8b-small-language-model/

💡

Experience unparalleled performance at your fingertips