GOT-OCR2.0: Unlock Complex OCR Tasks with Ease

type

status

date

slug

summary

Main OCR Features of GOT-OCR2.0

1. Unified end-to-end architecture

GOT-OCR2.0 adopts a unified end-to-end model architecture, which simplifies the complex multi-module processes in traditional OCR systems (such as text detection, area cropping, character recognition, etc.), greatly reducing the system maintenance costs.

The model supports both global and local character recognition tasks by combining a highly compressed encoder and a long-context decoder.

2. Support multiple OCR tasks

Scene text recognition: Able to handle text recognition tasks in natural scenes, such as text on street signs and billboards.

Document OCR: Processes text recognition of complete pages in documents, whether they are plain text documents or complex documents containing tables, formulas, etc.

Formatted Text OCR: Supports direct conversion of text in optical documents into Markdown, Latex and other formats, maintaining the original layout and formatting of complex documents.

3. Fine-grained OCR

GOT-OCR2.0 can perform fine-grained regional recognition and supports fine character recognition of specific areas in high-density text scenarios, such as specific paragraphs in a document or specific areas in an image. This feature improves the accuracy and interactivity of recognition and is suitable for application scenarios that require high-precision recognition, such as extraction of key parts of legal documents and academic papers.

Interactive OCR: GOT-OCR2.0 has an interactive OCR function that supports region-level character recognition based on user-provided coordinates or color cues. Users can define regions of interest or mark specific parts by color instead of the entire page content, which is suitable for scenarios such as form recognition. Precise control of the recognition range improves recognition flexibility in complex scenarios.

4. Dynamic resolution and multi-page OCR

Dynamic Resolution: GOT-OCR2.0 supports OCR processing of ultra-high-resolution images (such as large posters, stitched PDF pages), using dynamic resolution technology to ensure that recognition accuracy is maintained when the image is too large.

Multi-page OCR: GOT can batch process multi-page documents, such as long PDF files or OCR tasks containing multiple images, greatly improving processing efficiency.

5. Complex characters and formats support

Formula, table, and chart recognition: In addition to basic text recognition, GOT can also recognize and process complex structures such as mathematical formulas, chemical molecular formulas, tables, and charts in documents, and convert them into editable formats (such as LaTeX or Python dictionary formats).

Formatted output: GOT-OCR2.0 supports generating multiple formatted outputs, including Markdown, TikZ, SMILES, LATEX, etc., and can output recognized characters in a structured manner, such as tables, mathematical formulas, molecular structures, etc. Users can directly use the OCR results for further editing and processing, especially in academic papers, scientific computing and complex document management.

Musical notation and geometric figure recognition: The model also supports the recognition of musical notation and geometric figures and converts them into editable text output in formats similar to TikZ or Kern.

6. High performance, low training and inference costs

Compared with large-scale visual language models, GOT-OCR2.0 has fewer parameters (about 580M), so its training and inference costs are relatively low, making it suitable for deployment on consumer-grade GPUs.

In experiments, GOT-OCR2.0 performs well in a variety of OCR tasks, including Chinese and English document OCR, scene text recognition, formatted document processing, and fine-grained region recognition.

7. Model Scalability

GOT-OCR2.0 supports the addition of new OCR functions through fine-tuning, making it adaptable to new demand scenarios, such as supporting character recognition in more languages or more complex visual structures.

Multi-language support: GOT-OCR2.0 mainly supports Chinese and English character recognition, and can be expanded to more languages through further fine-tuning. This allows GOT-OCR2.0 to be applied to multi-language document processing worldwide and adapt to OCR needs in different language scenarios.

Model Architecture of GOT-OCR2.0

The GOT (General OCR Theory) model architecture of OCR-2.0 is designed based on an end-to-end encoder-decoder structure. Its core goal is to handle a variety of optical character tasks through a simple and efficient architecture with strong generalization capabilities and low-cost training and reasoning requirements. The following are the main components of the GOT model architecture and their functions:

1. Encoder

Function: The task of the encoder is to convert the optical image into a compressed feature representation, namely "image tokens".

Architecture: GOT's encoder is based on the Vision Transformer (ViT) design, which has high compression rate capabilities. Specifically, it is able to compress the input 1024×1024 pixel image into 256 image tokens (each token size is 256×1024), which greatly reduces the computational complexity.

Input support: The encoder supports a variety of input types, including scene images and document images, and can handle different optical characters (such as text, tables, formulas, geometric figures, etc.).

2. Linear Layer

Function: The linear mapping layer is responsible for connecting the encoder and the decoder, mapping the image tokens generated by the encoder to the dimensions that the decoder can handle.

Architecture: In the GOT model, the linear mapping layer resizes the encoder output from the dimension of 1024 × 768 to the dimension of 1024 × 1024 required by the decoder. It acts as a bridge in the entire model architecture to ensure smooth information transfer between the encoder and decoder.

3. Decoder

Function: The decoder is responsible for converting the image tokens generated by the encoder into readable OCR results, that is, outputting the final recognized text.

Architecture: GOT's decoder is designed based on the Qwen-0.5B language model (about 500M parameters) and supports long context processing (up to 8K tokens). The decoder's task is to gradually parse the input long text or complex optical characters and generate the corresponding OCR output.

Output format: The decoder supports the output of ordinary text, formulas, tables, charts and other complex formats. Users can generate Markdown, TikZ, SMILES and other formatted results through simple prompts as needed.

4. Multi-stage training strategy

The training process of the GOT model is divided into three main stages, aiming to improve the model's generalization ability and adaptability to multiple tasks:

Stage 1: Encoder Pre-training

Objective: To enable the encoder to have basic character encoding capabilities through pre-training on scene text and document-level character images.

Strategy: Use a smaller decoder (such as OPT-125M) to jointly train with the encoder, and improve the encoder's encoding ability for multiple character formats by training on natural scenes and document images.

Stage 2: Joint training of encoder and decoder

Goal: To form a complete GOT model by connecting a more powerful decoder (such as Qwen-0.5B).

Strategy: At this stage, more complex OCR datasets (such as musical scores, mathematical formulas, geometric figures, etc.) are introduced for training to enable the model to handle more character types.

Stage 3: Decoder Post-Training (Fine-tuning)

Goal: To further optimize the decoder for new tasks or user-defined requirements.

Strategy: In this stage, the decoder is fine-tuned mainly through the generated synthetic datasets (such as multi-page documents, ultra-high-resolution images) to improve the practical application capabilities of the model, such as dynamic resolution processing, multi-page OCR, etc.

**Multi-stage training strategy of GOT-OCR2.0**

5. Data Engine and Synthetic Data

In order to improve the generalization ability of the GOT model, researchers designed multiple data engines to generate a large amount of synthetic data to support multi-task joint training. These data engines include:

Ordinary OCR data: such as scene text and document OCR data.

Formatted data: including mathematical formulas (LATEX format), molecular structures (SMILES format), tables (generated by LATEX), etc.

General optical character data: such as music scores, geometric figures, charts, etc.

Fine-grained data: used for scenarios such as region-level OCR, dynamic resolution processing, and interactive OCR.

6. Dynamic resolution and multi-page OCR support

Dynamic resolution: The GOT model supports processing large-size images through sliding window technology in ultra-high resolution scenarios to ensure accurate character recognition.

Multi-page OCR: Supports processing of multi-page PDF files and simplifies OCR tasks of multi-page documents by processing multiple pages at once.

7.Interactive OCR function

The GOT model can perform interactive OCR processing, allowing users to specify specific areas of the image to be recognized by entering coordinates or color prompts. This function is particularly suitable for local recognition in complex images or documents, which improves the flexibility of the model.

Experimental Results of GOT-OCR2.0

The experimental results of the GOT model show its excellent performance in multiple OCR tasks, including general document OCR, scene text OCR, formatted document OCR, and more extensive character OCR tasks. The following is a detailed summary of the experimental results of the GOT model in various tasks:

Performance comparison of dense English and Chinese OCR

1. General document OCR performance

Task description: Test the performance of the GOT model in common document OCR tasks, mainly processing Chinese and English documents in PDF format.

Evaluation indicators: including Edit Distance, F1-score, Precision, Recall, BLEU score and METEOR score.

Experimental results:

GOT-OCR2.0 performs well in both Chinese and English document OCR tasks, outperforming other large-scale models (such as InternVL-ChatV1.5, Qwen-VL-Max, etc.).
Especially in terms of edit distance, the GOT model performs significantly better than other competing models, with an edit distance of 0.038 for Chinese and 0.035 for English.
GOT-OCR2.0 also achieved a high accuracy of nearly 98% in F1 score and BLEU score, demonstrating its powerful text perception and recognition capabilities.

Model	Parameter quantity	Edit distance(zh)	F1 score(en)	Edit distance	F1 score
GOT	580M	0.038	0.980	0.035	0.972
InternVL-1.5	26B	0.265	0.816	0.393	0.751
Qwen-VL-Max	>72B	0.091	0.931	0.057	0.964

2. Scene Text OCR Performance

Task description: Test the text recognition performance of GOT in natural scene images. Scene images include natural images containing text, such as signs and billboards in street scenes.

Evaluation indicators: The same criteria as edit distance, F1 score, precision and recall are used.

Experimental results:

GOT also performs very well in scene text OCR tasks, especially in Chinese scene text. GOT's edit distance is 0.096 and F1 score is 0.928, which is much better than other models.
This result demonstrates the robustness and adaptability of the GOT model in processing optical characters in real scenes.

Model	Parameter quantity	Edit distance(zh)	F1 score(en)	Edit distance	F1 score
GOT	580M	0.096	0.928	0.112	0.926
Qwen-VL-Max	>72B	0.168	0.867	0.182	0.881
InternVL-1.5	26B	0.123	0.913	0.267	0.834

3.Formatted document OCR performance

Task description: Test the performance of GOT in the OCR task of complex format documents, which contain formulas, tables and other contents that need to be formatted for output.

Evaluation metrics: Use multiple evaluation criteria such as edit distance, F1 score, BLEU, and METEOR.

Experimental results:

GOT has already performed well at a single resolution (1024×1024), especially in formula and table OCR tasks. The performance of small text, formula and table recognition is further improved by a multi-crop method.
In formula recognition, GOT achieves an F1 score of 0.865 and an edit distance of 0.159 in the case of multiple cropping, which is significantly better than the single cropping result, demonstrating the effectiveness of dynamic resolution.

type	Edit distance	F1 score	BLEU	METEOR
Full text of the document	0.086	0.953	0.896	0.903
formula	0.159	0.865	0.628	0.828
sheet	0.220	0.878	0.779	0.811

4. Fine-grained OCR performance

Task Description: Test the performance of GOT in fine-grained OCR tasks, where users can identify characters in a specific area by specifying the area or color prompts.

Evaluation indicators: mainly use edit distance and F1 score.

Experimental results:

GOT significantly outperforms the existing Fox model in fine-grained OCR tasks and achieves leading performance in fine-grained text recognition in both Chinese and English.
Especially in the region-level OCR task, GOT achieves an edit distance of only 0.041 and an F1 score of up to 0.970, demonstrating its powerful interactive OCR capability.

Model	language	Edit distance	F1 score
GOT	English	0.041	0.970
GOT	Chinese	0.033	0.965
Fox狐狸	English	0.059	0.957
Fox狐狸	Chinese	0.042	0.955

5. More general OCR features

Task Description: Test the performance of GOT in more general OCR tasks, including musical scores, geometric figures, charts, etc.

Evaluation metrics: Edit distance and F1 score are also used to evaluate the performance of the model.

Experimental results:

GOT still performs well in more complex OCR tasks such as music scores and geometric figures, with an F1 score of 0.963 for music score recognition and 0.882 for geometric figure recognition.
In the chart OCR task, GOT even outperforms models designed specifically for charts (such as OneChart and ChartVLM), demonstrating its strong versatility.

type	Edit distance	F1 score
Sheet Music	0.046	0.963
Geometry	0.061	0.882

Summary of GOT-OCR2.0

The GOT model has demonstrated its strong capabilities in general document OCR, text recognition OCR, formatted document OCR, and fine-grained and general OCR tasks through experimental performance in multiple OCR tasks. In particular, in core indicators such as edit distance and F1 score, the GOT model outperforms many large competitive models, demonstrating its potential in the OCR-2.0 era.

GitHub: https://github.com/Ucas-HaoranWei/GOT-OCR2.0

Paper: https://arxiv.org/pdf/2409.01704

Model download: https://huggingface.co/ucaslcl/GOT-OCR2_0

Online experience: https://huggingface.co/spaces/ucaslcl/GOT_online

🔥

RackNerd VPS: Your server's speed, our priority