- Publication: Tsinghua University
- Publication Date: August 13th, 2024
- Organizations mentioned: Tsinghua University, Zhipu AI, Anthropic, OpenAI, Meta
- Publication Authors: Yushi Bai, et. al.
- Technical background required: High
- Estimated read time (original text): 45 minutes
- Sentiment score: 75%, somewhat positive
Large language models (LLMs) have made significant strides in processing extensive inputs, with some capable of handling over 100,000 tokens. However, a critical limitation persists: these models struggle to generate outputs exceeding 2,000 words, despite their ability to process much longer inputs. This research paper introduces LongWriter, a groundbreaking approach to extend the output length of LLMs from 2,000 to over 10,000 words.
The study addresses a pressing need in AI research, as over 1% of user prompts explicitly request outputs surpassing the current 2,000-word limit. Through ultra-long text generation, this research could revolutionize various fields, including content creation, academic writing, and automated reporting. For businesses and professionals, this advancement promises more comprehensive AI-generated reports, in-depth analyses, and longer-form content, potentially transforming how we approach tasks requiring extensive written output.
TLDR
Goal: Researchers from Tsinghua University and Zhipu AI conducted this study to address the output length limitation of current large language models (LLMs), which typically struggle to generate content beyond 2,000 words. The research aims to extend LLMs’ output capacity to over 10,000 words while maintaining output quality, addressing a pressing need identified in user interaction logs where over 1% of prompts request outputs exceeding the current 2,000-word limit.
Methodology:
- The researchers developed AgentWrite, an agent-based pipeline that breaks down long writing tasks into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 words.
- Using AgentWrite, they constructed LongWriter-6k, a dataset containing 6,000 examples with output lengths ranging from 2,000 to 32,000 words, which was added to model training.
- They created LongBench-Write, a comprehensive benchmark for evaluating ultra-long generation capabilities, and used it to assess their models alongside existing proprietary and open-source models.
Key findings:
- The study revealed that the constraint on output length in current LLMs is primarily rooted in the characteristics of Supervised Fine-Tuning (SFT) datasets, with models effectively capped by the upper limit of output lengths present in their training data.
- By incorporating the LongWriter-6k dataset into training, the researchers successfully scaled the output window size of existing models to over 10,000 words without compromising output quality.
- The LongWriter models demonstrated significant improvements in output length scores (S_l) for prompts requiring 2,000-4,000 words and 4,000-20,000 words, outperforming existing models that often failed to meet these length requirements.
- Direct Preference Optimization (DPO) further enhanced the model’s long-text writing capabilities, improving both output quality and adherence to length requirements.
- The 9B parameter LongWriter model achieved state-of-the-art performance on the LongBench-Write benchmark, surpassing even much larger proprietary models.
- Analysis of user interaction logs revealed a demand for longer outputs, with over 1% of user prompts explicitly requesting outputs exceeding the previous 2,000-word limit.
Recommendations:
- Researchers and developers should focus on expanding the maximum output length in SFT datasets to unlock the potential for longer outputs in LLMs, as the study shows that models can generate much longer outputs when trained on appropriate data.
- The AgentWrite pipeline should be considered for automatically constructing long-output training data, as it proved effective in generating coherent outputs up to 20,000 words in length.
- Future work should explore constructing even longer training data to potentially extend LLMs’ output window size beyond the current achievement of 10,000-20,000 words.
- The integration of techniques like Direct Preference Optimization (DPO) should be considered in the development of long-form generation models to improve output quality and adherence to length requirements.
- As LLMs become capable of generating novel-length content, industries relying on long-form content creation should prepare for potential disruptions and consider how to leverage this technology effectively.
Thinking Critically
Implications
- Content creation revolution: AI-generated ultra-long texts may redefine authorship, challenging traditional publishing models and content authenticity verification methods.
- Educational paradigm shift: AI-assisted research could accelerate knowledge dissemination, prompting a reevaluation of academic assessment methods and critical thinking development.
- Writing profession transformation: Professionals in writing-intensive fields may need to pivot towards AI prompt engineering and high-level content curation skills.
Alternative perspectives
- Quality vs. Quantity: Focus on length may overlook potential issues with coherence and idea dilution in ultra-long outputs.
- Data feedback loop: Using AI-generated data for training could amplify biases and reduce thought diversity in future content generation.
- Neglected priorities: Emphasis on output length might divert resources from improving factual accuracy and reasoning capabilities.
AI predictions
- AI systems capable of drafting entire books and comprehensive research papers will emerge within 2-3 years.
- News organizations will employ AI for generating in-depth investigative reports by 2026, reshaping journalistic roles.
- Specialized, domain-specific AI models for long-form content generation will be developed within 5 years, enhancing accuracy in fields like law and technical documentation.
Glossary
- AgentWrite: A novel agent-based pipeline designed to leverage off-the-shelf LLMs to automatically construct extended, coherent outputs.
- Output window size: Refers to the maximum length of text a language model can generate in a single output.
- Plan-augmented output data: A dataset where the writing plan is concatenated to the beginning of the writing content for training purposes.
- Instruction backtranslation: A method of constructing long-output SFT data by generating instructions for existing long-form texts.
- SFT (Supervised Fine-Tuning): A training process that adapts a pre-trained language model to specific tasks or domains using labeled data, crucial for determining the model’s output capabilities.
- DPO (Direct Preference Optimization): A technique used to improve a model’s output quality and enhance its ability to follow length constraints in instructions.
- Packing training: An efficient training method that involves combining multiple shorter sequences into a single longer sequence during the training process.
- Loss weighting: A strategy in model training where the contribution of each target token to the loss is adjusted to improve performance on tasks with long outputs.