type
status
date
slug
summary
tags
category
icon
password
NVIDIA has launched the Mistral-NeMo-Minitron 8B, a compact version of the Mistral NeMo 12B model developed in collaboration with Mistral AI. It is not only highly accurate but also computationally efficient, and can run models on GPU-accelerated data centers, clouds, and workstations.
Optimized through pruning and distillation techniques, this small model reduces computational cost while maintaining cutting-edge accuracy and can achieve real-time performance on devices such as workstations and laptops.
The Mistral-NeMo-Minitron 8B is available as an NVIDIA NIM microservice for a variety of applications, including AI-driven chatbots, virtual assistants, and content generators.
Unlike larger models, small language models can run in real time on workstations and laptops. This makes it easier for organizations with limited resources to deploy generative AI capabilities in their infrastructure while optimizing costs, operational efficiency, and energy use. Running language models locally on edge devices also brings security advantages because data does not need to be transmitted from the edge device to the server.
Developers can get started by packaging Mistral-NeMo-Minitron 8B as an NVIDIA NIM microservice using a standard application programming interface (API), or by downloading the model from Hugging Face . Soon, a downloadable NVIDIA NIM will be available that can be deployed in minutes on any GPU-accelerated system.
Optimization for Mistral-NeMo-Minitron 8B
The model optimization of Mistral-NeMo-Minitron 8B is achieved through the following two key steps:
Width Pruning:
- Purpose: The purpose of width pruning is to reduce the size of the model without significantly affecting the model performance. It achieves this goal by reducing the number of neurons in the model as well as the number of attention heads and embedding channels.
- Process: When pruning the Mistral NeMo 12B model, the researchers calculated the importance scores of each attention head, embedding channel, and MLP hidden dimension and pruned the model based on these scores. Specifically, the MLP intermediate dimension was reduced from 14,336 to 11,520, and the hidden size was reduced from 5,120 to 4,096, while retaining the number of attention heads and layers.
Knowledge Distillation:
- Purpose: Knowledge distillation is about transferring the knowledge of a large, complex model (often called the teacher model) into a smaller student model, thereby creating a more efficient model while retaining most of the predictive power of the original large model.
- Process: After pruning, the research team retrained the model lightly using a 380 billion labeled dataset. The retraining used a peak learning rate of 1e-4, a minimum learning rate of 4.5e-7, 60 steps of linear warm-up, a cosine decay schedule, and a global batch size of 768. This distillation process helped recover the model accuracy that may have been lost after pruning.
Through this combination of pruning and distillation , the Mistral-NeMo-Minitron 8B model significantly reduces computational costs while maintaining high-precision prediction capabilities. This optimization strategy provides an effective framework for building smaller and more efficient AI models.
Performance
The Mistral-NeMo-Minitron 8B model performed well in multiple benchmarks. Its performance can be understood from the following aspects:
Leading Benchmark Scores:
- Nine popular benchmarks: Mistral-NeMo-Minitron 8B achieves excellent results on nine widely used benchmarks covering language comprehension, commonsense reasoning, mathematical reasoning, summary generation, programming code generation, and the ability to generate realistic answers.
- Comparison results: In these benchmarks, the Mistral-NeMo-Minitron 8B base model performs close to or even better than its "big brother" Mistral NeMo 12B model. For example, in WinoGrande, ARC Challenge, MMLU, HellaSwag, GSM8K, TruthfulQA, XLSum en, MBPP and HumanEval, the model performs well, especially in WinoGrande and GSM8K tests, where the 8B model outperforms many similar models.
Efficient computational cost:
- Training efficiency: Through pruning and knowledge distillation techniques, the Mistral-NeMo-Minitron 8B model not only has performance close to that of the 12B model, but also significantly reduces the demand for computing resources. Compared with training a model of the same size from scratch, pruning and distillation retraining can save up to 40 times the computing resources.
Highly adaptable:
- Compact structure: The 8B parameter model of Mistral-NeMo-Minitron 8B has a compact structure and is suitable for use in application scenarios that require efficient AI processing, such as embedded devices, mobile devices or edge computing devices.
- Balance between accuracy and efficiency: This model greatly improves operating efficiency while retaining high accuracy, making it suitable for applications that require low latency and high response speed, such as real-time chatbots, virtual assistants, and content generation tools.
NVIDIA also announced this week Nemotron-Mini-4B-Instruct , another small language model optimized for low memory usage and faster response times on NVIDIA GeForce RTX AI PCs and laptops . The model is available for cloud and device deployment as an NVIDIA NIM microservice and is part of NVIDIA ACE , a suite of digital human technologies powered by generative AI that deliver speech, intelligence, and animation.
Experience both models as Nim microservices at ai.nvidia.com via your browser or API .
Official introduction: https://blogs.nvidia.com/blog/mistral-nemo-minitron-8b-small-language-model/
- Author:KCGOD
- URL:https://kcgod.com/lightweight-champion-a-small-model-by-nvidia
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Google Launches Gemini-Powered Vids App for AI Video Creation
FLUX 1.1 Pro Ultra: Revolutionary AI Image Generator with 4MP Resolution
X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer
8 Best AI Video Generators Your YouTube Channel Needs
Meta AI’s Orion AR Glasses: Smart AI-Driven Tech to Replace Smartphones