Lightweight Phi-3.5-Vision for Complex Visual Reasoning

type

status

date

slug

summary

Phi-3.5-vision includes three models

1. Phi-3.5 Mini Instruct

Number of parameters

382 million parameters.

Design goal

This is a lightweight AI model that is targeted at scenarios that require powerful reasoning in environments with limited memory or computing resources, such as code generation, mathematical problem solving, and logic-based reasoning tasks.

Context length

supports a token context length of 128K.

Performance

Despite its small size, the model performs well in multi-language and multi-turn dialogue tasks, surpassing similar-sized models (such as Llama-3.1-8B-instruct and Mistral-7B-instruct) on the "Long Context Code Understanding" benchmark (RepoQA).

Application scenarios

It is particularly suitable for scenarios that require high computing resources and can reduce resource consumption while ensuring reasoning capabilities.

2. Phi-3.5 MoE (Mixture of Experts)

Parameter volume

4.19 billion parameters (with 4.2 billion active parameters, but 0.66 billion actually active parameters).

Design goal

This is Microsoft's first "Mixture of Experts" model, which combines multiple different types of models, each focusing on different tasks. This architecture enables the model to perform well in complex tasks such as multi-language understanding, code and mathematical reasoning.

Context length

supports a token context length of 128K.

Performance

It surpassed larger models in multiple benchmarks. For example, in the Massive Multi-Task Language Understanding (MMLU) test, Phi-3.5 MoE performed well in 5-shot tests in multiple fields such as STEM, humanities, and social sciences, defeating GPT-4o mini.

Application scenarios

Suitable for applications that need to process complex AI tasks, especially in multi-language environments and complex reasoning scenarios.

3. Phi-3.5 Vision Instruction

Number of parameters

415 million parameters.

Design goal

This multimodal model integrates text and image processing capabilities and is particularly suitable for tasks such as image understanding, optical character recognition (OCR), chart and table parsing, and video summarization.

Context length

also supports 128K token context length.

Performance

The model performs well in multi-frame image processing and complex visual tasks, and can efficiently manage complex multimodal tasks. The model's training data includes synthetic data and filtered public data, ensuring high quality and inference density.

Application scenarios

Mainly used in complex tasks that require comprehensive processing of visual and text data, such as multi-frame image comparison and video content summary.

Main features

Image Understanding

Ability to understand single and multiple images in detail, identify content in images, and provide relevant descriptions and analysis.

It can be used for general image understanding tasks, such as identifying objects, scenes, or other important elements in an image.

Optical Character Recognition (OCR)

It can extract and recognize text content from images and is suitable for processing images containing text, such as document scans, annotations in images, etc.

Graph and table comprehension

It can parse information in charts and tables, helping users extract useful insights from complex graphical data.

Applicable to scenarios such as financial statement analysis and data visualization understanding.

Multi-image comparison

It can compare and analyze multiple images to find out the similarities and differences between them.

Suitable for comparison and summary of multiple frames or video clips, supporting complex multi-image reasoning.

Multiple image or video clip summary

It provides a comprehensive summary function for multiple images or video clips, which can extract the key content and generate a concise summary description.

It's perfect for news reporting, video editing, or any application that requires rapid comprehension and summary of large amounts of visual content.

Efficient reasoning ability

Emphasizes the density of reasoning and is able to provide in-depth and logical reasoning results when dealing with complex problems.

Suitable for scenarios that require high-quality reasoning, such as scientific research, complex problem solving, etc.

Low latency and memory optimized

It is optimized for environments with limited computing resources and requiring low-latency responses, enabling it to run efficiently on a variety of devices and scenarios.

It is very suitable for real-time applications that require fast response, such as interactive AI systems, embedded systems, etc.

Model Architecture:

Parameter count: Phi-3.5-vision has 4.2 billion parameters, and its structure includes image encoder, connector, projector, and Phi-3 Mini language model.

Input: The model accepts text and images as input and works best with prompts in a conversational format.

Context length: supports context lengths up to 128K (in tokens).

GPU: 256 NVIDIA A100-80G GPUs were used for training.

Training time: The model training time is 6 days.

Training Data:

Data size

The model’s training data consists of 500 billion tokens (including visual and textual tokens).

Data source

Public Documents: Contains high-quality public documents that have been strictly screened.

Educational Data and Code: High-quality educational data and code were selected for training.

Image-text data: Use high-quality image-text mixed data for training.

Synthetic Data: We created synthetic data for teaching, covering mathematics, coding, common sense reasoning, world knowledge (such as science, daily activities, theory of mind, etc.), as well as newly created image data (such as charts, tables, slides) and multi-image and video data (such as short video clips, pairs of two similar images, etc.).

Human-supervised data: We collected high-quality supervised data in conversational format covering a wide range of topics to reflect human preferences such as instruction following, truthfulness, honesty, and helpfulness.

Training methods:

Fine-tuning process: Phi-3.5-vision has undergone rigorous fine-tuning, including supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) methods, to ensure that the model's performance in different tasks can meet high standards of safety and accuracy requirements.

Data filtering: During the data collection process, a strict filtering process was performed to ensure the high quality of the training data and avoid including any potential personal information to protect privacy.

Model stability: This model is a static model, and the deadline for training the dataset is March 15, 2024. An optimized version may be released later to further improve model performance.

Benchmark Results:

Phi-3.5-vision has demonstrated its outstanding performance in image understanding, reasoning, and text generation tasks in multiple benchmarks. Here are the specific results of some key benchmarks:

MMMU (Multi-Modal Multi-Image Understanding)

Score: 43.0 (up from 40.2 in the previous version)

Description: This benchmark evaluates the performance of the model in multimodal, multi-image understanding tasks. The improvement of Phi-3.5-vision in this test shows its enhanced ability in handling complex image understanding tasks.

MMBench (Multi-Modal Benchmark)

Score: 81.9 (up from 80.5 in the previous version)

Note: This test measures the overall performance of the model in multimodal tasks. The high score of Phi-3.5-vision indicates its wide applicability and strong performance in multimodal tasks.

TextVQA (Text-based Visual Question Answering)

Score: 72.0 (up from 70.9 in the previous version)

Description: This benchmark evaluates the model's ability to answer questions when processing images containing text. The improvements in Phi-3.5-vision show that its accuracy in visual question answering tasks has improved.

Video processing capabilities (Video-MME)

Short videos (<2 minutes): 60.8

Medium length videos (4-15 minutes): 47.7

Long videos (30-60 minutes): 43.8

Overall score: 50.8

Description: Phi-3.5-vision performs well in video data processing, especially in short video processing, and can effectively analyze and summarize video content.

Video processing capabilities of Phi-3.5-vision — **Video processing capabilities of** Phi-3.5-vision

BLINK Benchmark

BLINK is a benchmark for evaluating the performance of multimodal large language models (Multimodal LLMs), which contains 14 visual tasks. These tasks are quickly solvable by humans, but are still challenging for current multimodal large language models. The BLINK benchmark is designed to test the model's ability to process complex visual information and evaluate the model in the following aspects:

In the BLINK benchmark, the Phi-3.5-vision model performed well and achieved high scores in many tasks. For example, it performed particularly well in tasks such as artistic style recognition, forensic detection, relative depth, and spatial relationships, demonstrating its strong processing capabilities in complex visual tasks.

This benchmark provides a multi-dimensional perspective to help researchers understand and improve the performance of large multimodal language models to bring them closer to human performance on vision tasks.

Main task types:

Art Style Recognition: Identify and distinguish the artistic styles of images.

Counting: Accurately count the number of objects of the same type in an image.

Forensic Detection: Identifying anomalies or signs of tampering in images.

Functional Correspondence: Detecting functional relationships between objects in an image.

Intelligence Test: (IQ Test): Answer intelligence test questions through image reasoning.

Jigsaw Puzzle: Solve image jigsaw puzzles to reconstruct the complete image.

Multi- View: Reasoning: Reasoning using images from multiple perspectives.

Object Localization: Accurately locate specific objects in an image.

Relative Depth: Determine the relative depth of objects in an image.

Relative Reflectance: Determines the difference in reflectivity of objects in an image.

Semantic Correspondence: Identify the semantic correspondence between objects or scenes in an image.

Spatial Relation: Understanding and judging the spatial relationship between objects in an image.

Visual Correspondence: Determine the visual similarity or consistency between two or more images.

Visual Similarity: Evaluate the visual similarity between different images.

Comparison with other models

Phi-3.5-vision outperforms competing models such as LlaVA-Interleave-Qwen, InternVL, etc. in multimodal and visual tasks. In particular, in tasks such as multi-frame image understanding, image comparison, and video summarization, the model not only outperforms models of the same size, but even surpasses larger models in some cases.

Specific application scenarios for performance improvement

Multi-frame image understanding: Phi-3.5-vision has a strong ability to conduct comprehensive analysis and reasoning on multiple images, and is suitable for tasks such as complex image comparison and video clip analysis.

Complex visual reasoning: The model performs particularly well in handling challenging visual reasoning tasks (such as scientific knowledge reasoning and mathematical reasoning), and can provide high-quality reasoning results.

Text Generation and Visual Question Answering: When text generation and answering image-based questions are required, Phi-3.5-vision is able to provide accurate and contextually relevant responses.

Model download: https://huggingface.co/microsoft/Phi-3.5-vision-instruct

💡

Don't let slow servers hold your business back