LongWriter: LLMs for Ultra-Longform Writing

type

status

date

slug

summary

LongWriter's solution

AgentWrite Pipeline

Through an agent-based "plan-write" approach, AgentWrite breaks down the complex long text generation task into multiple subtasks, each of which only needs to generate a paragraph of text. This approach ensures that each generated paragraph is coherent and high-quality, and finally merged into a complete long text. In this way, even existing models can generate texts of more than 20,000 words.

LongWriter-6k dataset

A dataset of 6,000 long text outputs (LongWriter-6k) is generated using AgentWrite. These datasets are used to fine-tune existing LLMs, enabling them to generate high-quality texts of more than 10,000 words.

Main Capabilities

Ultra-long text generation capability

LongWriter introduces the AgentWrite pipeline, which enables large language models (LLMs) to generate long texts of more than 10,000 words, or even up to 20,000 words. This is far beyond the current limitation of most long-context models, which can only generate about 2,000 words of text. This capability makes it suitable for application scenarios that need to generate a large amount of coherent content, such as long articles, reports, or book chapters.

High quality output capability

Despite the significant increase in the length of the output, LongWriter can still maintain high-quality text generation. The AgentWrite pipeline uses a two-step "plan-write" approach, first formulating a detailed writing plan for long text generation (including paragraph structure and word count requirements for each paragraph), and then generating content paragraph by paragraph. This approach ensures the coherence and reasonable structure of the generated text, and even for very long texts, it can maintain clear logic and coherence.

Long context handling capabilities

LongWriter uses an advanced long-context large language model that can handle inputs of more than 100,000 tokens. This means it can refer to very long input text and generate very long output that is relevant and consistent with the context.

Automated data construction capabilities

Through the AgentWrite pipeline, LongWriter can automatically construct very long output data. This capability not only improves the efficiency of training data, but also expands the application scenarios of the model when generating long texts.

Long text generation evaluation capability

LongWriter not only improves the model's generation capabilities, but also develops the LongBench-Write benchmark to evaluate the model's ability to generate very long texts. The model passed this benchmark test and demonstrated superior generation quality and text length control capabilities. The research also further optimized the generation process, such as through the direct preference optimization (DPO) technology, to further improve the generation quality of long texts.

Direct Preference Optimization (DPO) capabilities

LongWriter can further optimize the model through DPO technology, so that it can meet the user-specified length requirements while improving the quality of the output content. LongWriter can adapt to various types of long text generation tasks, including but not limited to literary creation, academic papers, news reports, etc. This diversity makes LongWriter more widely applicable in practical applications.

Technical Methods of LongWriter

1. AgentWrite pipeline

Overview

AgentWrite is an agent-based segmented writing pipeline that decomposes the task of generating very long text into multiple subtasks. Each subtask corresponds to the generation of a paragraph of text, and finally these paragraphs are combined into a coherent long text.

Step

Planning Stage:The system first generates a detailed writing plan based on the user's input. This plan includes the topic and target word count for each paragraph. Plan generation is done by calling an existing language model, which is responsible for breaking down the overall task into reasonable subtasks.

Writing stage:According to the generated plan, the system generates text paragraph by paragraph. When each paragraph is generated, the system will use the previously generated paragraphs as context input to ensure the consistency and coherence of the new paragraph with the previous text. Although this limits the ability of parallel processing, it ensures the high quality of the output text.

2. LongWriter-6k Dataset

Construction

LongWriter-6k is a dataset containing 6,000 very long text output samples. These data are generated by the AgentWrite pipeline and cover a variety of output lengths, ranging from 2,000 words to 32,000 words.

Purpose

This dataset is used to fine-tune the existing language model so that the model can generate very long texts. By introducing this dataset, the generation length limit of the model is significantly increased from the original approximately 2,000 words to more than 10,000 words.

3. Model fine-tuning and training

Supervised Fine-Tuning (SFT)

During fine-tuning, LongWriter combines the LongWriter-6k dataset with the general SFT dataset. Through this hybrid training, the model not only retains its original general capabilities, but also gains the ability to generate long texts.

Loss function adjustment

During training, the system uses the method of averaging losses by token instead of averaging losses by sequence. This ensures that the contribution of each token in the long text output to the loss function is not weakened, thereby improving the performance of the model in long text generation tasks.

4. Direct Preference Optimization (DPO)

Overview

DPO is a technique used to further improve the quality of model output, especially when the model needs to strictly follow instruction length requirements.

Implementation

The LongWriter-9B model is trained with DPO to generate long texts with higher quality and better compliance with length constraints. The training data includes general DPO data and preference data specifically for long text instruction generation.

5. Evaluation and Benchmarking

LongBench-Write Benchmark

This benchmark is used to evaluate the performance of models when generating very long texts, including the accuracy of text length and the evaluation of text quality. The evaluation indicators include the relevance, accuracy, coherence, clarity, breadth and depth of the text, and reading experience.

Long text dependency evaluation

The system uses the cumulative mean negative log-likelihood (NLL) test to assess the presence of long-range dependencies in long texts. This helps ensure that the generated long text is logically coherent and interconnected, rather than simply a splicing of unrelated content.

Experimental Results of LongWriter

1. Test results of long text generation ability

Test setup

The research team used the LongWrite-Ruler test to evaluate the maximum generation length of the model. The test instructions required the model to generate texts ranging from 1,000 to 30,000 words (including Chinese and English instructions).

Test results

The maximum generation length of the LongWriter model has been extended to between 10k and 20k words, which is a significant improvement compared to existing models that can usually only generate text of around 2k words.

In the text generation task of [4k, 20k) words, traditional models can hardly reach the required output length, and in some cases, the output length is only 1/3 of the required length. However, LongWriter can effectively generate long texts that meet the requirements while maintaining high output quality by adding the LongWriter-6k dataset.

2. Assessment of quality and length consistency

Evaluation Metrics

Two main metrics were used to evaluate model performance:

Length score (Sl): used to measure how close the model output length is to the required length.

Quality Score (Sq): GPT-4o is used as the evaluation model to score the output quality on six dimensions, including relevance, accuracy, coherence, clarity, breadth and depth, and reading experience.

Evaluation results

Length score (Sl): The LongWriter-9B-DPO model performs particularly well in the output length range of [2k, 20k) words, and its length score is significantly better than that of the traditional model. In particular, in the range of [4k, 20k) words, the long text generation score is significantly improved.

Quality score (Sq): The LongWriter model can not only generate longer texts, but also maintain a high level of quality, especially in terms of width and depth, which has increased by 5%. However, in some cases, the coherence and clarity have slightly decreased (about 2%).

3. Effect of DPO optimization

Optimization effect

DPO optimization significantly improves the output quality of the model and its compliance with length requirements:

Sl score improvement: Compared with the model without DPO processing, the Sl score is improved by about 4%.

Sq score improvement: The quality score also improved by about 3%, which shows that DPO optimization is effective in the long text generation task.

LongWriter: Effect of DPO optimization — LongWriter: **Effect of DPO optimization**

4.Comparison of different models

Evaluation test results of LongWriter show that it outperforms other existing open source and proprietary models in multiple aspects:

In tasks with an output length of more than 2,000 words, the average length score of the LongWriter model far exceeds that of most models. In particular, in tasks ranging from 4,000 to 20,000 words, other models can hardly reach the expected length, while the LongWriter model can stably generate long texts that meet the requirements.

In terms of output quality, the LongWriter model performed particularly well in dimensions such as breadth and depth, coherence, and reading experience, and DPO optimization further improved the scores of these dimensions.

Comparison with other models: Compared with other common proprietary models (such as Claude 3.5 Sonnet, GPT-4 Turbo), the LongWriter model performs better in very long text generation tasks, especially in tasks that require generating more than 2,000 words, the LongWriter model can better meet the requirements.

Human preference testing: In a manual comparison of text generated by the LongWriter model and the GPT-4o model, the LongWriter-9B-DPO model was preferred by human reviewers in 58% of the tests.

LongWriter compared to different models — LongWriter compared to **different models**

5.Long text dependency evaluation

Cumulative mean negative log-likelihood (NLL) test

Tests show that there are significant long-distance dependencies in the text generated by the LongWriter model, which is manifested in that the NLL value is significantly reduced in the subsequent part of the text, which means that the text generated by the model is logically coherent and tightly structured, rather than simply splicing unrelated fragments.

Model Extension and Future Directions of LongWriter

Model expansion

The maximum generation length of the LongWriter model has been expanded to between 10,000 and 20,000 words. In the future, it is possible to further increase the output length of the SFT dataset to enable the model to generate texts of 100k words or even longer.

AgentWrite Optimization

Future research may continue to optimize the AgentWrite framework to obtain higher quality long output data and improve the model's inference efficiency without sacrificing generation quality.

GitHub: https://github.com/THUDM/LongWriter

Paper: https://arxiv.org/abs/2408.07055

Online Demo: https://huggingface.co/spaces/THUDM/LongWriter

💡

Trust your data to our rock-solid servers