Convert HTML to Markdown with Reader-LM by Jina AI

type

status

date

slug

summary

Features of Reader-LM

HTML to Markdown conversion

Reader-LM is designed to convert raw HTML content into clean, structured Markdown files, simplifying the process of extracting and cleaning data from web pages. Without complex rules or regular expressions, the model can automatically process noisy content such as ads, scripts, navigation bars, etc., and generate clearly structured Markdown.

Small but efficient language model

Reader-LM provides two models: Reader-LM-0.5B and Reader-LM-1.5B . Although they have smaller parameters, they are specifically optimized for HTML to Markdown tasks and outperform many larger language models. Due to the compactness of the models, they are able to run efficiently in resource-limited environments.

Both models are multilingual and support context lengths of up to 256K tokens . Despite their small size, these models achieve state-of-the-art performance on this task, outperforming their larger LLM counterparts at 1/50 the size.

Reader-LM vs LLMs on the HTML2Markdown task

Multi-language support

Reader-LM supports multiple languages, enabling it to process web content from all over the world. This multilingual capability is particularly suitable for use in international projects, and can automatically identify and process HTML content in different languages.

Long context handling capabilities

The model is able to process context data up to 256K tokens long, which means that even very complex and large HTML files can be processed efficiently by the model. It is very suitable for content-rich web pages or documents.

End-to-end data cleaning and extraction

Different from traditional methods that rely on regular expressions or heuristic rules, Reader-LM is an end-to-end solution that can automatically clean HTML data and extract key content without tedious manual settings.

Performance of Reader-LM

Comparative performance

The Reader-LM model was tested against larger language models such as GPT-4 and Gemini. Despite having a smaller number of parameters, Reader-LM performed better than some of the larger models in the HTML to Markdown task.

In the task, Reader-LM-1.5B performed particularly well, with a higher ROUGE-L score (a measure of the similarity between the output and the reference), and lower Word Error Rate (WER) and Token Error Rate (TER) , indicating that it is more accurate and has lower errors when generating content.

Qualitative evaluation of reader-lm vs llms and jina reader api

Reader-LM-1.5B performs consistently well across all dimensions, especially in structure preservation and markdown syntax usage. While it does not always outperform the Jina Reader API, its performance is competitive with larger models such as Gemini 1.5 Pro, making it an efficient alternative to larger LLMs. Despite its smaller size, Reader-LM-0.5B still provides robust performance in terms of structure preservation.

Indicator comparison

The Reader-LM-0.5B and Reader-LM-1.5B performed as follows in the test:

ROUGE-L: 0.56 (0.5B model), 0.72 (1.5B model), outperforming larger models such as GPT-4.

WER (Word Error Rate): 1.87 (1.5B model), indicating that the output is more accurate and reduces error generation.

TER (Token Error Rate) : 0.19 (1.5B model), which shows the high accuracy of the model when converting HTML to Markdown.

Efficiency and resource usage

Since Reader-LM is a small model, it is more lightweight in terms of resource requirements, especially the Reader-LM-0.5B model, which can run efficiently on lower-configuration hardware (such as free GPUs in Google Colab).

Despite its small size, the model has powerful context handling capabilities and supports 256K tokens , which enables it to handle large and complex web content without compromising performance.

Training efficiency

Reader-LM uses a multi-stage training process to ensure performance when converting complex HTML content. Compared with pre-trained models, Reader-LM can more effectively complete the "selective copy" task from HTML to Markdown while maintaining high accuracy and processing speed.

Despite its small parameters, Reader-LM performs well in processing HTML to Markdown tasks, with high accuracy, low error rate, and strong long-context processing capabilities, and can run efficiently with low hardware resources. It outperforms some larger language models, especially in terms of accuracy and task-specific performance, and is very cost-effective.

Training of Reader-LM

Reader-LM is trained in two phases, focusing on data cleaning and processing long-context tasks. The model is carefully designed and trained to focus on extracting and converting Markdown content from raw, noisy HTML. The following are the detailed training process and technical details:

1. Data preparation

HTML to Markdown conversion pairs: Jina AI uses the Jina Reader API to generate a large amount of HTML to Markdown pairing data. These pairing data include the original HTML extracted from the web page and the corresponding Markdown conversion version.

Synthetic data: In addition to real web page data, synthetic data generated by GPT-4 is also introduced . This data is simpler and has a more predictable structure, which helps the model process HTML of different complexities.

High-quality data filtering: The training data is strictly screened to ensure that only high-quality Markdown entries are included in the training set, which improves the overall performance of the model.

2. Two-stage training process

Short sequence stage:

In the early stages of training, Reader-LM was trained using HTML + Markdown sequences of 32K tokens in length, using a total of 1.5 billion tokens.

The goal of this stage is to enable the model to master the conversion capabilities of short texts and relatively simple HTML structures.

Long sequence stage:

In the subsequent stage, Reader-LM processed more complex HTML files, the sequence length was extended to 128K tokens , and 1.2 billion tokens were introduced for training.

The Zigzag-Ring Attention mechanism (Zilin Zhu’s “Ring Flash Attention” technology) is used to enable the model to efficiently process long sequence content.

3. Model size and architecture

Reader-LM provides two models of different sizes:

Reader-LM-0.5B: With 494M parameters , it is a small but efficient model capable of converting HTML to Markdown with longer contexts.

Reader-LM-1.5B: It has a larger parameter of 1.54B and performs better in long text processing and complex content extraction.

Both models support long context processing capabilities of 256K tokens , ensuring that they remain efficient when processing long web page content.

4. Dealing with duplication and degradation issues

Repeated generation problem: A major problem encountered during training is that the model generates repeated content or falls into an infinite loop (called "degeneration"). To solve this problem, contrastive search and contrastive loss were introduced during training to effectively reduce the phenomenon of repeated generation.

Stopping criteria: To avoid repeated generation, a simple repeated stopping criterion is added to the training process. When the model starts to generate repeatedly, decoding is automatically stopped to prevent the "infinite loop" problem.

5. Training framework and optimization

A training framework based on Transformers Trainer is used. In order to optimize the training efficiency of long inputs, chunk-wise model forwarding is adopted to reduce the use of video memory and improve the training efficiency of long sequence processing.

When packing data, multiple short texts are concatenated into a long sequence to reduce padding and optimize training speed.

6. Experiments and Results

During the training process, experiments showed that small models (such as 65M and 135M parameter models) performed well when processing shorter inputs, but performed poorly when processing long texts (over 1K tokens). Therefore, the 0.5B and 1.5B models were selected as the publicly released versions.

The 0.5B model is considered to be the smallest model for handling long contexts , while the 1.5B model shows significant improvements in performance while maintaining high computational efficiency.

**Experiments and Results of all reader-lms**

Model Download:

Reader-LM-0.5B : Hugging Face – Reader-LM-0.5B

Reader-LM-1.5B : Hugging Face – Reader-LM-1.5B

If you want to try it on Google Colab, you can quickly experience the model through the Colab Notebook provided by Jina AI.

The model is published under the CC BY-NC 4.0 license, which allows non-commercial use. If commercial use is required, please contact Jina AI to apply for a license.

🔥

Server Freedom: Unleash your creativity with RackNerd's customizable VPS.