type
status
date
slug
summary
tags
category
icon
password
GOT-OCR2.0 is a general model for optical character recognition (OCR) tasks, which aims to address the limitations of traditional OCR systems (OCR-1.0) and existing large vision-language models (LVLMs) in OCR tasks.
Traditional OCR systems (OCR-1.0) usually adopt a multi-module pipeline mode (for example: element detection, region cropping, character recognition, etc.), which is prone to local optimal problems and has high maintenance costs. This model provides efficient character recognition capabilities through an end-to-end architecture and is suitable for a wide range of OCR tasks.
The model can handle a variety of complex optical character tasks, including not only ordinary text, but also complex content such as formulas, tables, and musical scores. Compared with the old OCR system, the GOT model is more intelligent, flexible, and easy to use. Experiments show that the model performs well in both Chinese and English recognition, and is particularly good at processing high-resolution and multi-page documents.
Main OCR Features of GOT-OCR2.0
1. Unified end-to-end architecture
- GOT-OCR2.0 adopts a unified end-to-end model architecture, which simplifies the complex multi-module processes in traditional OCR systems (such as text detection, area cropping, character recognition, etc.), greatly reducing the system maintenance costs.
- The model supports both global and local character recognition tasks by combining a highly compressed encoder and a long-context decoder.
2. Support multiple OCR tasks
- Scene text recognition: Able to handle text recognition tasks in natural scenes, such as text on street signs and billboards.
- Document OCR: Processes text recognition of complete pages in documents, whether they are plain text documents or complex documents containing tables, formulas, etc.
- Formatted Text OCR: Supports direct conversion of text in optical documents into Markdown, Latex and other formats, maintaining the original layout and formatting of complex documents.
3. Fine-grained OCR
- GOT-OCR2.0 can perform fine-grained regional recognition and supports fine character recognition of specific areas in high-density text scenarios, such as specific paragraphs in a document or specific areas in an image. This feature improves the accuracy and interactivity of recognition and is suitable for application scenarios that require high-precision recognition, such as extraction of key parts of legal documents and academic papers.
- Interactive OCR: GOT-OCR2.0 has an interactive OCR function that supports region-level character recognition based on user-provided coordinates or color cues. Users can define regions of interest or mark specific parts by color instead of the entire page content, which is suitable for scenarios such as form recognition. Precise control of the recognition range improves recognition flexibility in complex scenarios.
4. Dynamic resolution and multi-page OCR
- Dynamic Resolution: GOT-OCR2.0 supports OCR processing of ultra-high-resolution images (such as large posters, stitched PDF pages), using dynamic resolution technology to ensure that recognition accuracy is maintained when the image is too large.
- Multi-page OCR: GOT can batch process multi-page documents, such as long PDF files or OCR tasks containing multiple images, greatly improving processing efficiency.
5. Complex characters and formats support
- Formula, table, and chart recognition: In addition to basic text recognition, GOT can also recognize and process complex structures such as mathematical formulas, chemical molecular formulas, tables, and charts in documents, and convert them into editable formats (such as LaTeX or Python dictionary formats).
- Formatted output: GOT-OCR2.0 supports generating multiple formatted outputs, including Markdown, TikZ, SMILES, LATEX, etc., and can output recognized characters in a structured manner, such as tables, mathematical formulas, molecular structures, etc. Users can directly use the OCR results for further editing and processing, especially in academic papers, scientific computing and complex document management.
- Musical notation and geometric figure recognition: The model also supports the recognition of musical notation and geometric figures and converts them into editable text output in formats similar to TikZ or Kern.
6. High performance, low training and inference costs
- Compared with large-scale visual language models, GOT-OCR2.0 has fewer parameters (about 580M), so its training and inference costs are relatively low, making it suitable for deployment on consumer-grade GPUs.
- In experiments, GOT-OCR2.0 performs well in a variety of OCR tasks, including Chinese and English document OCR, scene text recognition, formatted document processing, and fine-grained region recognition.
7. Model Scalability
- GOT-OCR2.0 supports the addition of new OCR functions through fine-tuning, making it adaptable to new demand scenarios, such as supporting character recognition in more languages or more complex visual structures.
- Multi-language support: GOT-OCR2.0 mainly supports Chinese and English character recognition, and can be expanded to more languages through further fine-tuning. This allows GOT-OCR2.0 to be applied to multi-language document processing worldwide and adapt to OCR needs in different language scenarios.
Model Architecture of GOT-OCR2.0
The GOT (General OCR Theory) model architecture of OCR-2.0 is designed based on an end-to-end encoder-decoder structure. Its core goal is to handle a variety of optical character tasks through a simple and efficient architecture with strong generalization capabilities and low-cost training and reasoning requirements. The following are the main components of the GOT model architecture and their functions:
1. Encoder
- Function: The task of the encoder is to convert the optical image into a compressed feature representation, namely "image tokens".
- Architecture: GOT's encoder is based on the Vision Transformer (ViT) design, which has high compression rate capabilities. Specifically, it is able to compress the input 1024×1024 pixel image into 256 image tokens (each token size is 256×1024), which greatly reduces the computational complexity.
- Input support: The encoder supports a variety of input types, including scene images and document images, and can handle different optical characters (such as text, tables, formulas, geometric figures, etc.).
2. Linear Layer
- Function: The linear mapping layer is responsible for connecting the encoder and the decoder, mapping the image tokens generated by the encoder to the dimensions that the decoder can handle.
- Architecture: In the GOT model, the linear mapping layer resizes the encoder output from the dimension of 1024 × 768 to the dimension of 1024 × 1024 required by the decoder. It acts as a bridge in the entire model architecture to ensure smooth information transfer between the encoder and decoder.
3. Decoder
- Function: The decoder is responsible for converting the image tokens generated by the encoder into readable OCR results, that is, outputting the final recognized text.
- Architecture: GOT's decoder is designed based on the Qwen-0.5B language model (about 500M parameters) and supports long context processing (up to 8K tokens). The decoder's task is to gradually parse the input long text or complex optical characters and generate the corresponding OCR output.
- Output format: The decoder supports the output of ordinary text, formulas, tables, charts and other complex formats. Users can generate Markdown, TikZ, SMILES and other formatted results through simple prompts as needed.
4. Multi-stage training strategy
The training process of the GOT model is divided into three main stages, aiming to improve the model's generalization ability and adaptability to multiple tasks:
Stage 1: Encoder Pre-training
- Objective: To enable the encoder to have basic character encoding capabilities through pre-training on scene text and document-level character images.
- Strategy: Use a smaller decoder (such as OPT-125M) to jointly train with the encoder, and improve the encoder's encoding ability for multiple character formats by training on natural scenes and document images.
Stage 2: Joint training of encoder and decoder
- Goal: To form a complete GOT model by connecting a more powerful decoder (such as Qwen-0.5B).
- Strategy: At this stage, more complex OCR datasets (such as musical scores, mathematical formulas, geometric figures, etc.) are introduced for training to enable the model to handle more character types.
Stage 3: Decoder Post-Training (Fine-tuning)
- Goal: To further optimize the decoder for new tasks or user-defined requirements.
- Strategy: In this stage, the decoder is fine-tuned mainly through the generated synthetic datasets (such as multi-page documents, ultra-high-resolution images) to improve the practical application capabilities of the model, such as dynamic resolution processing, multi-page OCR, etc.
5. Data Engine and Synthetic Data
In order to improve the generalization ability of the GOT model, researchers designed multiple data engines to generate a large amount of synthetic data to support multi-task joint training. These data engines include:
- Ordinary OCR data: such as scene text and document OCR data.
- Formatted data: including mathematical formulas (LATEX format), molecular structures (SMILES format), tables (generated by LATEX), etc.
- General optical character data: such as music scores, geometric figures, charts, etc.
- Fine-grained data: used for scenarios such as region-level OCR, dynamic resolution processing, and interactive OCR.
6. Dynamic resolution and multi-page OCR support
- Dynamic resolution: The GOT model supports processing large-size images through sliding window technology in ultra-high resolution scenarios to ensure accurate character recognition.
- Multi-page OCR: Supports processing of multi-page PDF files and simplifies OCR tasks of multi-page documents by processing multiple pages at once.
7.Interactive OCR function
The GOT model can perform interactive OCR processing, allowing users to specify specific areas of the image to be recognized by entering coordinates or color prompts. This function is particularly suitable for local recognition in complex images or documents, which improves the flexibility of the model.
Experimental Results of GOT-OCR2.0
The experimental results of the GOT model show its excellent performance in multiple OCR tasks, including general document OCR, scene text OCR, formatted document OCR, and more extensive character OCR tasks. The following is a detailed summary of the experimental results of the GOT model in various tasks:
1. General document OCR performance
- Task description: Test the performance of the GOT model in common document OCR tasks, mainly processing Chinese and English documents in PDF format.
- Evaluation indicators: including Edit Distance, F1-score, Precision, Recall, BLEU score and METEOR score.
- Experimental results:
- GOT-OCR2.0 performs well in both Chinese and English document OCR tasks, outperforming other large-scale models (such as InternVL-ChatV1.5, Qwen-VL-Max, etc.).
- Especially in terms of edit distance, the GOT model performs significantly better than other competing models, with an edit distance of 0.038 for Chinese and 0.035 for English.
- GOT-OCR2.0 also achieved a high accuracy of nearly 98% in F1 score and BLEU score, demonstrating its powerful text perception and recognition capabilities.
Model | Parameter quantity | Edit distance(zh) | F1 score(en) | Edit distance | F1 score |
GOT | 580M | 0.038 | 0.980 | 0.035 | 0.972 |
InternVL-1.5 | 26B | 0.265 | 0.816 | 0.393 | 0.751 |
Qwen-VL-Max | >72B | 0.091 | 0.931 | 0.057 | 0.964 |
2. Scene Text OCR Performance
- Task description: Test the text recognition performance of GOT in natural scene images. Scene images include natural images containing text, such as signs and billboards in street scenes.
- Evaluation indicators: The same criteria as edit distance, F1 score, precision and recall are used.
- Experimental results:
- GOT also performs very well in scene text OCR tasks, especially in Chinese scene text. GOT's edit distance is 0.096 and F1 score is 0.928, which is much better than other models.
- This result demonstrates the robustness and adaptability of the GOT model in processing optical characters in real scenes.
Model | Parameter quantity | Edit distance(zh) | F1 score(en) | Edit distance | F1 score |
GOT | 580M | 0.096 | 0.928 | 0.112 | 0.926 |
Qwen-VL-Max | >72B | 0.168 | 0.867 | 0.182 | 0.881 |
InternVL-1.5 | 26B | 0.123 | 0.913 | 0.267 | 0.834 |
3.Formatted document OCR performance
- Task description: Test the performance of GOT in the OCR task of complex format documents, which contain formulas, tables and other contents that need to be formatted for output.
- Evaluation metrics: Use multiple evaluation criteria such as edit distance, F1 score, BLEU, and METEOR.
- Experimental results:
- GOT has already performed well at a single resolution (1024×1024), especially in formula and table OCR tasks. The performance of small text, formula and table recognition is further improved by a multi-crop method.
- In formula recognition, GOT achieves an F1 score of 0.865 and an edit distance of 0.159 in the case of multiple cropping, which is significantly better than the single cropping result, demonstrating the effectiveness of dynamic resolution.
type | Edit distance | F1 score | BLEU | METEOR |
Full text of the document | 0.086 | 0.953 | 0.896 | 0.903 |
formula | 0.159 | 0.865 | 0.628 | 0.828 |
sheet | 0.220 | 0.878 | 0.779 | 0.811 |
4. Fine-grained OCR performance
- Task Description: Test the performance of GOT in fine-grained OCR tasks, where users can identify characters in a specific area by specifying the area or color prompts.
- Evaluation indicators: mainly use edit distance and F1 score.
- Experimental results:
- GOT significantly outperforms the existing Fox model in fine-grained OCR tasks and achieves leading performance in fine-grained text recognition in both Chinese and English.
- Especially in the region-level OCR task, GOT achieves an edit distance of only 0.041 and an F1 score of up to 0.970, demonstrating its powerful interactive OCR capability.
Model | language | Edit distance | F1 score |
GOT | English | 0.041 | 0.970 |
GOT | Chinese | 0.033 | 0.965 |
Fox狐狸 | English | 0.059 | 0.957 |
Fox狐狸 | Chinese | 0.042 | 0.955 |
5. More general OCR features
- Task Description: Test the performance of GOT in more general OCR tasks, including musical scores, geometric figures, charts, etc.
- Evaluation metrics: Edit distance and F1 score are also used to evaluate the performance of the model.
- Experimental results:
- GOT still performs well in more complex OCR tasks such as music scores and geometric figures, with an F1 score of 0.963 for music score recognition and 0.882 for geometric figure recognition.
- In the chart OCR task, GOT even outperforms models designed specifically for charts (such as OneChart and ChartVLM), demonstrating its strong versatility.
type | Edit distance | F1 score |
Sheet Music | 0.046 | 0.963 |
Geometry | 0.061 | 0.882 |
Summary of GOT-OCR2.0
The GOT model has demonstrated its strong capabilities in general document OCR, text recognition OCR, formatted document OCR, and fine-grained and general OCR tasks through experimental performance in multiple OCR tasks. In particular, in core indicators such as edit distance and F1 score, the GOT model outperforms many large competitive models, demonstrating its potential in the OCR-2.0 era.
Model download: https://huggingface.co/ucaslcl/GOT-OCR2_0
Online experience: https://huggingface.co/spaces/ucaslcl/GOT_online
- Author:KCGOD
- URL:https://kcgod.com/GOT-OCR2.0
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts
Google Launches Gemini-Powered Vids App for AI Video Creation
FLUX 1.1 Pro Ultra: Revolutionary AI Image Generator with 4MP Resolution
X-Portrait 2: ByteDance's Revolutionary AI Animation Tool for Cross-Style Expression Transfer
8 Best AI Video Generators Your YouTube Channel Needs
Meta AI’s Orion AR Glasses: Smart AI-Driven Tech to Replace Smartphones