MINT-1T: AI's New Playground - Unlocking Billion-Scale Potential

type

status

date

slug

summary

Dataset construction principles of MINT-1T

Scale : MINT-1T has a data volume of one trillion tokens, which is nearly 10 times larger than the previous largest open source datasets (such as OBELICS and MMC4 ), allowing researchers to train larger multimodal models.

Diversity : MINT-1T contains not only HTML documents, but also PDF documents and ArXiv papers. These additional document sources significantly improve the coverage of scientific documents and enrich the diversity of the dataset.

Dataset Contents of MINT-1T

The construction of the MINT-1T dataset involves data collection, processing, and filtering steps from multiple sources to ensure high quality and diversity of the data.

Data Sources of MINT-1T

HTML document :

Extract HTML documents from CommonCrawl.

The processing period is from May 2017 to April 2024, using complete data from October 2018 to April 2024 and partial data from previous years.

Filter conditions: Exclude documents with no images or more than 30 images, and documents whose image URLs contain inappropriate substrings (such as logo, avatar, porn, xxx).

PDF Documents :

PDF documents were extracted from CommonCrawl WAT files, with the processing time ranging from February 2023 to April 2024.

Use the PyMuPDF tool to download and parse PDF files.

Filters: exclude PDFs larger than 50MB or with more than 50 pages, exclude pages without text, and determine the order in which images are inserted based on the text block and the bounding box position of the image on the page.

ArXiv Documents :

Build interleaved documents using LaTeX source code.

Parse figure tags in LaTeX code and interleave images with text.

Process multi-file papers, identify the main file and clean up the LaTeX code (e.g. remove imports, references, tables and citation labels).

https://interserver.awesome-vps.com

Data filtering and deduplication of MINT-1T

Text quality filtering :

Use the FastText model for language identification, excluding non-English documents.

Remove URLs containing inappropriate substrings (such as NSFW content).

Apply text filtering rules from RefinedWeb and MassiveText to remove duplicate n-grams and low-quality documents.

Image filtering :

Try to download all image URLs in the HTML dataset, removing links that cannot be retrieved.

Filtering criteria: Remove images smaller than 150 pixels (avoid noisy images such as logos and icons) and images larger than 20,000 pixels (usually irrelevant images).

For HTML documents, remove images with aspect ratio greater than 2; for PDF documents, adjust the threshold to 3 to keep scientific figures.

Security Filtering :

Apply NSFW image detector and remove the entire document if a single NSFW image is found.

Remove personally identifiable information such as email addresses and IP addresses and replace with templates or randomly generated invalid IPs.

Deduplication processing :

Use Bloom Filter for efficient text deduplication, set the false positive rate to 0.01, and dedupe 13-gram paragraphs.

Remove common HTML noise (like “Skip to content” or “Blog Archive”).

Image deduplication is performed based on SHA256 hash values, removing images that appear more than ten times in a snapshot, as well as duplicate images in a single document.

During data processing, an average of 2350 CPU cores were used, and a total of approximately 4.2 million CPU hours were used to build the dataset.

Model experiment

Pre-training : The XGen-MM multimodal model was pre-trained using MINT-1T, with 50% of the tokens coming from HTML documents and the rest from PDF and ArXiv documents.

Evaluation : Models trained with MINT-1T outperformed the previous leading dataset OBELICS on image captioning and visual question answering benchmarks. Evaluation results show that MINT-1T excels in maintaining background consistency, object location plausibility, and object relevance and quality.

Dataset analysis

The MINT-1T dataset has been significantly improved in terms of scale, data source diversity, and quality. The following is a detailed analysis of the dataset:

1. Document composition comparison

Text token distribution : Through a random sampling analysis of 50,000 documents, the number of text tokens for each document was calculated using GPT-2’s tagger. The results show that the HTML subset of MINT-1T is similar to OBELICS in token distribution, but the average length of PDF and ArXiv documents is significantly longer.

Image density : Analyzing the image density in documents, we found that the PDF and ArXiv documents of MINT-1T contain more images than HTML documents, with the ArXiv sample having the highest image density.

2. Data sources improve document diversity

Domain coverage : Using the LDA model to perform topic modeling on 100,000 documents, the results show that the documents of the OBELICS dataset are mainly concentrated in the humanities and social sciences, while the HTML subset of MINT-1T shows a wider range of domain coverage, and the PDF subset is mainly concentrated in the fields of science and technology.

3. Contextual Learning Performance

Demonstrating the impact of the number of examples : We evaluate the model’s performance in contextual learning when using 1 to 8 examples, and show that the model trained on MINT-1T outperforms the baseline model OBELICS at all numbers of examples.

4. Performance on different tasks

Image Captioning and Visual Question Answering : In the image captioning task, the OBELICS dataset performs better, while in the visual question answering task, the MINT-1T dataset significantly outperforms other baselines.

Performance in different domains : Performance analysis on the Multidisciplinary Multimodal Understanding and Reasoning Benchmark (MMMU) shows that MINT-1T significantly outperforms OBELICS and the HTML subset of MINT-1T in the science and technology domains.

5. Impact on the performance of model architecture

XGen-MM and Idefics2 experiments : Experiments were conducted using different model architectures (XGen-MM and Idefics2). The results showed that MINT-1T (HTML) under the Idefics2 architecture performed well in image caption generation and visual question answering tasks.

Summary of MINT-1T

Through the above analysis, it can be seen that the MINT-1T dataset is significantly superior to existing open source datasets in terms of diversity, quality, and scale, especially in the fields of science and technology. Models trained based on MINT-1T perform well in multimodal tasks, providing a solid foundation and rich resources for future multimodal research.

Paper | Datasaset |

💡

Performance That Matters. Experience lightning-fast speeds and reliability