Alibaba Cloud Unveils Advanced Qwen2.5 Models

type

status

date

slug

summary

Key Highlights of Qwen2.5 Models

Compared with the Qwen2 series, the Qwen2.5 series has the following upgrades:

Fully open source: In addition to continuing to open source the four models of Qwen2 (0.5B, 1.5B, 7B and 72B), Qwen2.5 also added two medium-sized cost-effective models Qwen2.5-14B and Qwen2.5-32B, and a mobile model Qwen2.5-3B. These models are extremely competitive compared to similar open source models.

Larger-scale, higher-quality pre-training datasets: The size of the pre-training dataset has been expanded from 7 trillion tokens to 18 trillion tokens.

Knowledge Enhancement: Qwen2.5 has significantly improved its knowledge reserve. In the MMLU benchmark, the performance of Qwen2.5-7B and 72B increased from 70.3 to 74.2 and from 84.2 to 86.1 respectively.

Enhanced Programming Capability: Through the technical breakthrough of Qwen2.5-Coder, Qwen2.5 has been significantly improved in programming capability. Qwen2.5-72B-Instruct achieved scores of 55.5 , 75.1 and 88.2 in LiveCodeBench, MultiPL-E and MBPP benchmarks respectively.

Improved math capabilities: After integrating Qwen2-Math technology, the math capabilities of Qwen2.5 have been rapidly improved. The scores of Qwen2.5-7B/72B-Instruct in the MATH benchmark test have been improved from 52.9/69.0 to 75.5/83.1.

More in line with human preferences: Qwen2.5 is able to generate responses that are more in line with human preferences. In particular, the Arena-Hard score of Qwen2.5-72B-Instruct has been significantly improved from 48.1 to 81.2 , and the MT-Bench score has been improved from 9.12 to 9.35.

Other core capabilities have been enhanced: Qwen2.5 has made significant progress in following instructions, generating long texts (from 1k to 8K tokens ), understanding structured data (such as tables), and generating structured output (especially in JSON format). In addition, the Qwen2.5 model is more flexible in adapting to different system prompts, and has enhanced the ability of role-playing and conditional setting.

Specialized model highlights:

Qwen2.5-Coder: Focuses on programming tasks. Compared with its predecessor CodeQwen1.5, it has better programming capabilities, supports multiple programming languages, and can achieve excellent performance on the HumanEval benchmark.

Qwen2.5-Math: Qwen2.5-Math supports Chinese and English, and combines multiple reasoning methods, such as chain thinking (CoT), procedural thinking (PoT) and tool integrated reasoning (TIR), with stronger mathematical reasoning ability.

Performance Improvements

*Comprehensive results from instruction-tuned versions of different benchmarks, evaluating model capabilities and human preferences.*

The Qwen2.5 series models performed well in many benchmark tests, especially the 72B parameter model, which took the lead in comparison with open source large models such as Llama-3.1-70B and Mistral-Large-V2.

The programming model Qwen2.5-Coder performs well in coding tasks, while the mathematical reasoning ability of Qwen2.5-Math also surpasses many similar models.

Comparison between Llama and GPT4-o: In multiple benchmarks, Qwen2.5-72B performs well on many tasks, especially in comparison with models such as Llama-3 and Mistral-Large. Although Qwen2.5 is inferior to GPT4-o and Claude-3.5-Sonnet in some aspects, it remains highly competitive in overall performance, especially in the open source field.

Qwen-Plus:

As an API model of Qwen2.5, Qwen-Plus has advantages in inference speed and response cost. Compared with models such as Llama-3.1-405B and DeepSeek-V2.5, Qwen-Plus shows good performance, especially in terms of cost-effectiveness.

It is worth noting that despite the relatively small 3B parameter model of Qwen2.5, it still performs well on multiple tasks, proving that as knowledge density increases, small models can compete with larger models on some tasks. Qwen2.5-3B achieved a high score of over 65+ in MMLU, demonstrating its efficiency in handling multi-task language understanding.

Qwen2.5-Coder

Qwen2.5-Coder further enhances the coding capability after CodeQwen1.5 and is officially renamed Qwen-Coder. The main improvements of this version include: expanding the scale of code training data, and significantly improving the performance of coding tasks while retaining mathematical and general task capabilities.

Qwen2.5-Coder supports a context length of 128K tokens, covers 92 programming languages, and has made significant progress in multiple code-related evaluation tasks such as code generation, multi-language code generation, code completion, and code repair.

It is worth noting that the open source 7B version of Qwen2.5-Coder even surpasses larger models such as DeepSeek-Coder-V2-Lite and Codestral, making it one of the most powerful base code models. In addition to code tasks, Qwen2.5-Coder also shows strong mathematical capabilities in evaluations such as GSM8K and Math. In terms of general tasks, evaluations on MMLU and ARC show that Qwen2.5-Coder maintains the general performance of Qwen2.5.

Key Features of Qwen2.5-Coder:

Model size: Three model sizes are available: 1.5B, 7B, and the soon-to-be-released 32B parameter model.

Data scale: The training data is expanded to 5.5 trillion tokens, covering source code, text code, and synthetic data.

Enhanced code capabilities: There are significant improvements in tasks such as code generation, multi-language programming, code completion, and code repair, supporting up to 92 programming languages, and the 7B model outperforms some larger models.

Qwen2.5-Coder-Instruct: Instruction fine-tuning model

Based on Qwen2.5-Coder, we fine-tuned it using instruction data and launched Qwen2.5-Coder-Instruct. This instruction fine-tuning model not only further improved the task performance, but also demonstrated excellent generalization capabilities in multiple benchmarks.

Qwen2.5-Coder-Instruct excels in several key areas:

Multi-language experts: We extended multi-language evaluation using McEval to cover more than 40 programming languages. The results show that Qwen2.5-Coder-Instruct performs very well in many programming languages, including niche languages.

Code reasoning: We believe that code reasoning is closely related to general reasoning ability. Using CRUXEval as a benchmark, the results show that Qwen2.5-Coder-Instruct performs well in code reasoning tasks. As code reasoning ability improves, the model's ability to execute complex instructions also increases, which motivates us to further explore how to enhance general capabilities through code.

Mathematical reasoning: Mathematics is the foundation of code, and code is a key tool for mathematics. Qwen2.5-Coder-Instruct performed well in both code and math tasks, proving to be a "science student".

In addition, we also evaluate the general ability of Qwen2.5-Coder-Instruct, and the results show that it retains the advantages of Qwen2.5 in general ability.

The Qwen team is preparing to release the 32B version of Qwen2.5-Coder to challenge larger proprietary models. At the same time, they are also exploring more powerful models based on code reasoning to push the boundaries of code intelligence.

Qwen2.5-Math

Qwen2.5-Math is a large language model designed specifically for solving math problems. It mainly supports solving math problems in Chinese and English through chained thinking (CoT) and tool-integrated reasoning (TIR) . Compared with the previous Qwen2-Math series, Qwen2.5-Math has significantly improved performance in Chinese and English math tests. Qwen2.5-Math significantly improves mathematical reasoning ability, especially in complex algorithms and symbolic calculation tasks.

Key Features of Qwen2.5-Math:

Model series: includes basic models (1.5B, 7B, 72B parameters) and instruction tuning models (Instruct version), as well as math reward models (Qwen2.5-Math-RM-72B).

Bilingual support: In addition to English math problems, Chinese problems are now supported, enabling excellent performance in complex math reasoning tasks.

Performance improvement: In multiple math benchmarks (such as MATH, GSM8K, and Chinese College Entrance Examination Mathematics), the Qwen2.5-Math series models significantly outperformed the Qwen2-Math series, especially in TIR mode, with the highest score reaching 92.9.

The Qwen2.5-Math series extends support for using thought chains and tool-integrated reasoning (TIR) to solve Chinese and English math problems. Compared with the Qwen2-Math series models, the Qwen2.5-Math series models have achieved significant performance improvements in Chinese and English math benchmarks. Model performance

In multiple complex mathematical benchmarks (such as AIME2024 and AMC2023), the performance of the Qwen2.5-Math-72B-Instruct model far exceeds other open source models and some closed source models on the market (such as GPT-4o and Gemini).

Qwen2.5-Math performs particularly well on math competition-level problems, achieving scores above 80 even on smaller models (such as 1.5B parameters).

Qwen2.5-Math: Basic Model

The overall training process of Qwen2-Math and Qwen2.5-Math is shown in the figure. After completing the training of the Qwen2-Math basic model, it is upgraded to Qwen2.5-Math through the following three main ways:

Use the Qwen2-Math-72B-Instruct model to synthesize additional high-quality math pre-training data.

Multiple rounds of aggregating more high-quality mathematical data, especially Chinese data, from online resources, books, and codes.

The Qwen2.5 series basic model is used for parameter initialization. The model is more powerful in language understanding, code generation and text reasoning.

Finally, the Qwen Math Corpus v2 dataset for Qwen2.5-Math pre-training was constructed with a context length of 4K. Compared with the Qwen Math Corpus v1 used for Qwen2-Math training , the total amount of tokens increased from 700 billion to more than 1 trillion.

The Qwen2.5-Math base model is evaluated on three commonly used English mathematics benchmarks GSM8K, Math, and MMLU-STEM, and three Chinese mathematics benchmarks CMATH, College Entrance Examination Mathematics Fill-in-the-Blank, and College Entrance Examination Mathematics Questions and Answers. All evaluations are performed using the few-shot Chain-of-Thought prompt.

Qwen2.5-Math-Instruct: Instruction fine-tuning model

A mathematical reward model Qwen2.5-Math-RM-72B was trained based on Qwen2.5-Math-72B to construct supervised fine-tuning (SFT) data through rejection sampling, and Group Relative Policy Optimization (GRPO) was used for reinforcement learning after SFT.

In the development of Qwen2.5-Math-Instruct, Qwen2.5-Math-RM-72B was used in the Rejection Sampling stage to further improve the response quality, and Chinese and English TIR data and SFT data were introduced in subsequent training.

The Qwen2.5-Math-Instruct model performs well on math benchmarks in both Chinese and English, especially in complex math competition assessments such as AIME and AMC. For example, on the AMC 2023 benchmark, Qwen2.5-Math-1.5B-Instruct solves 21 out of 29 questions in CoT mode using RM@256, while Qwen2.5-Math-72B-Instruct almost achieves full marks in TIR mode.

A demo supporting TIR mode has been developed. In Qwen-Agent , the code can be run locally to experience the tool-integrated reasoning capabilities of Qwen2.5-Math.

A multimodal math demo is provided on Huggingface and Modelscope. The WebUI is based on Qwen2-VL for OCR and Qwen2-Math for math reasoning. You can input images, text, or sketches of math and arithmetic problems.

Official blog: https://qwenlm.github.io/blog/qwen2.5/

🔥

Don't Overpay: Choose RackNerd VPS for the best deal