type
status
date
slug
summary
tags
category
icon
password
Microsoft released Phi-3.5-vision, a lightweight, multimodal open source model that belongs to the Phi-3 model family. The model is designed for applications that require text and visual input, focusing on processing high-quality, high-inference density data. It supports 128K context length and has undergone a rigorous fine-tuning and optimization process. It is designed to be widely used in commercial and research fields in environments with limited memory or computing resources and high low-latency requirements.
The model has extensive capabilities such as image understanding, optical character recognition (OCR), chart and table parsing, and multi-image or video clip summarization. It is well suited for a variety of AI-driven applications and shows significant performance improvements in benchmarks related to image and video processing.
The Phi-3.5-vision model is trained using high-quality educational data, synthetic data, and strictly screened public documents to ensure data quality and privacy. Its architecture includes a 4.2 billion parameter system that integrates an image encoder, connector, projector, and Phi-3 Mini language model.
Phi-3.5-vision includes three models
1. Phi-3.5 Mini Instruct
Number of parameters
382 million parameters.
Design goal
This is a lightweight AI model that is targeted at scenarios that require powerful reasoning in environments with limited memory or computing resources, such as code generation, mathematical problem solving, and logic-based reasoning tasks.
Context length
supports a token context length of 128K.
Performance
Despite its small size, the model performs well in multi-language and multi-turn dialogue tasks, surpassing similar-sized models (such as Llama-3.1-8B-instruct and Mistral-7B-instruct) on the "Long Context Code Understanding" benchmark (RepoQA).
Application scenarios
It is particularly suitable for scenarios that require high computing resources and can reduce resource consumption while ensuring reasoning capabilities.
2. Phi-3.5 MoE (Mixture of Experts)
Parameter volume
4.19 billion parameters (with 4.2 billion active parameters, but 0.66 billion actually active parameters).
Design goal
This is Microsoft's first "Mixture of Experts" model, which combines multiple different types of models, each focusing on different tasks. This architecture enables the model to perform well in complex tasks such as multi-language understanding, code and mathematical reasoning.
Context length
supports a token context length of 128K.
Performance
It surpassed larger models in multiple benchmarks. For example, in the Massive Multi-Task Language Understanding (MMLU) test, Phi-3.5 MoE performed well in 5-shot tests in multiple fields such as STEM, humanities, and social sciences, defeating GPT-4o mini.
Application scenarios
Suitable for applications that need to process complex AI tasks, especially in multi-language environments and complex reasoning scenarios.
3. Phi-3.5 Vision Instruction
Number of parameters
415 million parameters.
Design goal
This multimodal model integrates text and image processing capabilities and is particularly suitable for tasks such as image understanding, optical character recognition (OCR), chart and table parsing, and video summarization.
Context length
also supports 128K token context length.
Performance
The model performs well in multi-frame image processing and complex visual tasks, and can efficiently manage complex multimodal tasks. The model's training data includes synthetic data and filtered public data, ensuring high quality and inference density.
Application scenarios
Mainly used in complex tasks that require comprehensive processing of visual and text data, such as multi-frame image comparison and video content summary.
Main features
Image Understanding
- Ability to understand single and multiple images in detail, identify content in images, and provide relevant descriptions and analysis.
- It can be used for general image understanding tasks, such as identifying objects, scenes, or other important elements in an image.
Optical Character Recognition (OCR)
- It can extract and recognize text content from images and is suitable for processing images containing text, such as document scans, annotations in images, etc.
Graph and table comprehension
- It can parse information in charts and tables, helping users extract useful insights from complex graphical data.
- Applicable to scenarios such as financial statement analysis and data visualization understanding.
Multi-image comparison
- It can compare and analyze multiple images to find out the similarities and differences between them.
- Suitable for comparison and summary of multiple frames or video clips, supporting complex multi-image reasoning.
Multiple image or video clip summary
- It provides a comprehensive summary function for multiple images or video clips, which can extract the key content and generate a concise summary description.
- It's perfect for news reporting, video editing, or any application that requires rapid comprehension and summary of large amounts of visual content.
Efficient reasoning ability
- Emphasizes the density of reasoning and is able to provide in-depth and logical reasoning results when dealing with complex problems.
- Suitable for scenarios that require high-quality reasoning, such as scientific research, complex problem solving, etc.
Low latency and memory optimized
- It is optimized for environments with limited computing resources and requiring low-latency responses, enabling it to run efficiently on a variety of devices and scenarios.
- It is very suitable for real-time applications that require fast response, such as interactive AI systems, embedded systems, etc.
Model Architecture:
- Parameter count: Phi-3.5-vision has 4.2 billion parameters, and its structure includes image encoder, connector, projector, and Phi-3 Mini language model.
- Input: The model accepts text and images as input and works best with prompts in a conversational format.
- Context length: supports context lengths up to 128K (in tokens).
- GPU: 256 NVIDIA A100-80G GPUs were used for training.
- Training time: The model training time is 6 days.
Training Data:
Data size
The model’s training data consists of 500 billion tokens (including visual and textual tokens).
Data source
- Public Documents: Contains high-quality public documents that have been strictly screened.
- Educational Data and Code: High-quality educational data and code were selected for training.
- Image-text data: Use high-quality image-text mixed data for training.
- Synthetic Data: We created synthetic data for teaching, covering mathematics, coding, common sense reasoning, world knowledge (such as science, daily activities, theory of mind, etc.), as well as newly created image data (such as charts, tables, slides) and multi-image and video data (such as short video clips, pairs of two similar images, etc.).
- Human-supervised data: We collected high-quality supervised data in conversational format covering a wide range of topics to reflect human preferences such as instruction following, truthfulness, honesty, and helpfulness.
Training methods:
- Fine-tuning process: Phi-3.5-vision has undergone rigorous fine-tuning, including supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) methods, to ensure that the model's performance in different tasks can meet high standards of safety and accuracy requirements.
- Data filtering: During the data collection process, a strict filtering process was performed to ensure the high quality of the training data and avoid including any potential personal information to protect privacy.
- Model stability: This model is a static model, and the deadline for training the dataset is March 15, 2024. An optimized version may be released later to further improve model performance.
Benchmark Results:
Phi-3.5-vision has demonstrated its outstanding performance in image understanding, reasoning, and text generation tasks in multiple benchmarks. Here are the specific results of some key benchmarks:
MMMU (Multi-Modal Multi-Image Understanding)
- Score: 43.0 (up from 40.2 in the previous version)
- Description: This benchmark evaluates the performance of the model in multimodal, multi-image understanding tasks. The improvement of Phi-3.5-vision in this test shows its enhanced ability in handling complex image understanding tasks.
MMBench (Multi-Modal Benchmark)
- Score: 81.9 (up from 80.5 in the previous version)
- Note: This test measures the overall performance of the model in multimodal tasks. The high score of Phi-3.5-vision indicates its wide applicability and strong performance in multimodal tasks.
TextVQA (Text-based Visual Question Answering)
- Score: 72.0 (up from 70.9 in the previous version)
- Description: This benchmark evaluates the model's ability to answer questions when processing images containing text. The improvements in Phi-3.5-vision show that its accuracy in visual question answering tasks has improved.
Video processing capabilities (Video-MME)
- Short videos (<2 minutes): 60.8
- Medium length videos (4-15 minutes): 47.7
- Long videos (30-60 minutes): 43.8
- Overall score: 50.8
- Description: Phi-3.5-vision performs well in video data processing, especially in short video processing, and can effectively analyze and summarize video content.
BLINK Benchmark
BLINK is a benchmark for evaluating the performance of multimodal large language models (Multimodal LLMs), which contains 14 visual tasks. These tasks are quickly solvable by humans, but are still challenging for current multimodal large language models. The BLINK benchmark is designed to test the model's ability to process complex visual information and evaluate the model in the following aspects:
In the BLINK benchmark, the Phi-3.5-vision model performed well and achieved high scores in many tasks. For example, it performed particularly well in tasks such as artistic style recognition, forensic detection, relative depth, and spatial relationships, demonstrating its strong processing capabilities in complex visual tasks.
This benchmark provides a multi-dimensional perspective to help researchers understand and improve the performance of large multimodal language models to bring them closer to human performance on vision tasks.
Main task types:
- Art Style Recognition: Identify and distinguish the artistic styles of images.
- Counting: Accurately count the number of objects of the same type in an image.
- Forensic Detection: Identifying anomalies or signs of tampering in images.
- Functional Correspondence: Detecting functional relationships between objects in an image.
- Intelligence Test: (IQ Test): Answer intelligence test questions through image reasoning.
- Jigsaw Puzzle: Solve image jigsaw puzzles to reconstruct the complete image.
- Multi- View: Reasoning: Reasoning using images from multiple perspectives.
- Object Localization: Accurately locate specific objects in an image.
- Relative Depth: Determine the relative depth of objects in an image.
- Relative Reflectance: Determines the difference in reflectivity of objects in an image.
- Semantic Correspondence: Identify the semantic correspondence between objects or scenes in an image.
- Spatial Relation: Understanding and judging the spatial relationship between objects in an image.
- Visual Correspondence: Determine the visual similarity or consistency between two or more images.
- Visual Similarity: Evaluate the visual similarity between different images.
Comparison with other models
Phi-3.5-vision outperforms competing models such as LlaVA-Interleave-Qwen, InternVL, etc. in multimodal and visual tasks. In particular, in tasks such as multi-frame image understanding, image comparison, and video summarization, the model not only outperforms models of the same size, but even surpasses larger models in some cases.
Specific application scenarios for performance improvement
- Multi-frame image understanding: Phi-3.5-vision has a strong ability to conduct comprehensive analysis and reasoning on multiple images, and is suitable for tasks such as complex image comparison and video clip analysis.
- Complex visual reasoning: The model performs particularly well in handling challenging visual reasoning tasks (such as scientific knowledge reasoning and mathematical reasoning), and can provide high-quality reasoning results.
- Text Generation and Visual Question Answering: When text generation and answering image-based questions are required, Phi-3.5-vision is able to provide accurate and contextually relevant responses.
Model download: https://huggingface.co/microsoft/Phi-3.5-vision-instruct
- Author:KCGOD
- URL:https://kcgod.com/phi-3.5-vision-by-microsoft
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Google Launches Gemini-Powered Vids App for AI Video Creation
FLUX 1.1 Pro Ultra: Revolutionary AI Image Generator with 4MP Resolution
X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer
8 Best AI Video Generators Your YouTube Channel Needs
Meta AI’s Orion AR Glasses: Smart AI-Driven Tech to Replace Smartphones