Two New Open-Source Models: Phi-4-multimodal and Phi-4-min by Microsoft

type

status

date

slug

summary

category

icon

password

Microsoft has expanded its Phi model family with the launch of two new open-source models: Phi-4-multimodal and Phi-4-mini. These Small Language Models (SLMs) are engineered to empower developers with advanced AI capabilities, optimizing multimodal processing across text, speech, and vision, while delivering efficient inference and low computational demands.

Their advantages encompass high performance, low resource requirements, edge compatibility, and cost-effectiveness, making them suitable for diverse industries including finance and healthcare.

Phi-4-Multimodal: A Versatile Multimodal Model

This model is capable of concurrently processing speech, vision, and text, making it ideal for innovative applications requiring understanding and reasoning across diverse data types.

Phi-4-Mini: A Compact, High-Performance Model Focused on Textual Tasks

This model prioritizes accuracy and low resource consumption, making it well-suited for scenarios demanding efficient computation.

Phi-4-multimodal: Multimodal AI Language Model

The Phi-4 multimodal model employs a novel architecture that enhances efficiency and scalability. It features an expanded vocabulary to improve processing power, supports multilingual functionality, and integrates language reasoning with multimodal inputs.

Phi-4 Multimodal Audio and Visual Benchmarks

Core Features

Multimodal Fusion: Simultaneously processes speech, vision, and text, eliminating the need for supplementary pipelines or separate models for different input types.

Long Context Window: Capable of processing and reasoning over large datasets, such as documents, web pages, or code.

Efficient Inference & Low Computational Overhead: Optimized for on-device operation, making it suitable for mobile and edge computing environments.

Enhanced Cross-Modal Learning: Leverages cross-modal learning techniques, enabling AI devices to understand context more naturally and facilitate more intelligent interactions.

Industry-Leading Performance:

Automatic Speech Recognition (ASR) and Speech Translation (ST) Superiority: Outperforms WhisperV3 and SeamlessM4T-v2-Large.

Hugging Face OpenASR Leaderboard Excellence: Achieves a Word Error Rate (WER) of 6.14%, surpassing the previous best of 6.5%.

Robust Performance in Mathematical and Scientific Reasoning, OCR (Optical Character Recognition), Document and Table Understanding: Demonstrates superior performance compared to Gemini-2-Flash-lite-preview and Claude-3.5-Sonnet.

Phi-4-Multimodal Benchmark Results

Phi-4-multimodal is a 560 million parameter multimodal model adept at processing speech, vision, and text concurrently. Key performance highlights from benchmark testing are as follows:

Speech-Related Tasks:

Automatic Speech Recognition (ASR): Ranked first on the Hugging Face OpenASR leaderboard with a Word Error Rate (WER) of 6.14%, surpassing the previous leading model (6.5%, as of February 2025). It outperforms the professional ASR model WhisperV3.

Speech Translation (ST): Exceeds the performance of specialized speech translation models such as SeamlessM4T-v2-Large.

Speech Question Answering (Speech QA): While exhibiting strong performance, it still lags behind models like Gemini-2.0-Flash and GPT-4o-realtime-preview, primarily due to its smaller model size which limits factual knowledge retention.

Speech Summarization: As the first open-source model to achieve this capability, its performance approximates that of GPT-4o.

Vision-Related Tasks:

In common multimodal tasks including mathematical and scientific reasoning, document and chart understanding, Optical Character Recognition (OCR), and visual scientific reasoning, Phi-4-multimodal performs comparably to, and in some aspects outperforms, popular models such as Gemini-2-Flash-Lite-Preview and Claude-3.5-Sonnet.

In comprehensive tests involving both visual and audio inputs, Phi-4-multimodal significantly surpasses Gemini-2.0-Flash and holds an advantage when compared to InternOmni (a larger parameter open-source model specifically designed for multimodality).

Comparison with Other Models:

Outperforms WhisperV3 and SeamlessM4T-v2-Large in speech tasks, and competes with Gemini-2.0-Flash and Claude-3.5-Sonnet in multimodal tasks, approaching the level of GPT-4o.

Demonstrates a substantial lead over Gemini-2.0-Flash in combined visual and audio tests, highlighting its unique strengths in multimodal processing.

Overall Performance:

Achieves an average score of 72 in multiple internal Microsoft visual data processing benchmarks, slightly trailing OpenAI’s GPT-4 (by less than 1 point), while Gemini Flash 2.0 scores 74.3. This near top-tier performance underscores its competitiveness in multimodal tasks.

Phi-4-mini: Efficient Text AI Model

Core Features

Text Specialization: Excels in text-based tasks such as financial calculations, report generation, and multilingual document translation.

Smaller Yet Powerful: A 3.8B parameter model employing a dense decoder architecture, supporting a 200,000-word vocabulary, and optimized for textual tasks.

Efficient Computation: Supports a 128,000 token long context window and is proficient in tasks like reasoning, mathematics, programming, instruction following, and function calling.

External Knowledge Access Capability:

Through function calling, it can integrate with external tools and APIs to perform tasks such as querying databases or controlling intelligent systems (e.g., smart home control).

Low resource requirements, suitable for computationally constrained environments like edge devices. Applicable across various industries including manufacturing, healthcare, and retail.

Phi-4-Mini Benchmark Results

Phi-4-mini is a 3.8 billion parameter text-dedicated model focused on efficiency and textual tasks. Key benchmark performance indicators are as follows:

Text-Based Tasks:

Mathematics and Coding Tasks: In mathematics and coding tasks requiring complex reasoning, Phi-4-mini demonstrates significantly superior accuracy compared to other language models of comparable size. Specific numerical data is not provided in the article, but its performance in these areas is emphasized as "significantly better."

Multilingual Support: Exhibits excellent performance in multilingual text processing (such as translation), showing marked improvements over earlier Phi family models.

Reasoning Capabilities: Leveraging a long context window (length unspecified, but mentioned as supporting substantial text input) and function calling capabilities, it efficiently handles tasks such as financial calculations and report generation.

Overall Performance:

Microsoft states that Phi-4-mini has demonstrated capabilities exceeding models of comparable size in internal testing, particularly excelling in tasks requiring reasoning. However, the article does not provide direct numerical comparisons with other prominent models such as GPT-4o-mini or the Llama series.

Application Scenarios for Phi-4

Smartphone Integration:

Processing voice commands, image recognition, text understanding to provide real-time language translation, intelligent assistants, and enhanced photo and video analysis functionalities.

Autonomous Driving & In-Vehicle Assistants:

Recognizing driver voice commands, analyzing visual inputs (e.g., gestures, facial expressions), and providing driving safety alerts.

Finance & Business Automation:

Conducting complex financial calculations, generating reports, translating financial documents, and optimizing global customer relations.

These models are now available on Azure AI Foundry, HuggingFace, and the NVIDIA API Catalog.

Model: https://huggingface.co/microsoft/Phi-4-multimodal-instruct

Paper: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/phi_4_mm.tech_report.02252025.pdf

Blog: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/