×
Code-Trained AI Models Outperform in Non-Coding Tasks
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The power of code in LLM training: New research from Cohere reveals that including code in the pre-training data of large language models (LLMs) significantly improves their performance on non-coding tasks.

  • Researchers systematically investigated the impact of code data in LLM pre-training on general performance beyond coding tasks.
  • The study used a two-phase training process: continued pre-training and a cooldown phase, testing various ratios of text and code in the training data.
  • Models were evaluated at different scales, from 470 million to 2.8 billion parameters, using benchmarks for world knowledge, natural language reasoning, and code performance.

Key findings: The inclusion of code in pre-training data consistently improved LLM performance across a wide range of non-coding tasks, with the benefits increasing as model size grew.

  • Models trained on code consistently outperformed text-only models on natural language reasoning tasks, with 100% code pre-training leading to the best performance.
  • A balanced mixture of code and text in pre-training data resulted in the best performance for world knowledge tasks.
  • Both code-only and balanced models outperformed text-only models on generative tasks, indicating that code data improves both reasoning and generation quality.

Impact of model scale: The performance gains from adding code to pre-training data became more pronounced as model size increased, particularly in world knowledge and code performance.

  • The trade-off between natural language tasks and code generation increased with model size.
  • While the study was limited to models up to 2.8 billion parameters due to cost constraints, researchers believe the findings will hold true for larger models.

Quality and diversity of code data: The study found that incorporating high-quality synthetic code and code-adjacent data in pre-training significantly boosted LLM performance.

  • Synthetic code data, created using problem statements and formally verified Python solutions, showed great potential for improving model performance.
  • Code-adjacent data, such as GitHub pull requests and commits, improved the models’ abilities on reasoning tasks.

Implications for enterprise applications: The findings have significant implications for businesses looking to leverage LLMs for various applications.

  • Including code in the cooldown phase of training, which is similar to fine-tuning, led to further improvements in non-code-related tasks.
  • Enterprises can potentially fine-tune pre-trained models with high-quality code from internal codebases and code-adjacent data to achieve better performance for specific applications.

Future directions and industry impact: The research is expected to influence the development and deployment of LLMs for enterprise use, potentially leading to more specialized pre-trained models.

  • Cohere, focused on providing LLMs for enterprise applications, may offer a wider range of pre-trained models with different mixtures of code and text, each optimized for specific types of tasks.
  • The findings are already informing how Cohere thinks about training state-of-the-art models for their clients.

Broader implications: This research challenges conventional thinking about LLM training and opens new avenues for improving AI performance across diverse tasks.

  • The discovery that code improves non-coding tasks suggests that the structure and logic inherent in programming languages may enhance an LLM’s overall reasoning and generative capabilities.
  • As AI continues to evolve, these findings may lead to more interdisciplinary approaches in model training, potentially incorporating data from various specialized fields to create more versatile and capable AI systems.
Code in pre-training data improves LLM performance at non-coding tasks

Recent News

Social network Bluesky says it won’t train AI on user posts

As social media platforms debate AI training practices, Bluesky stakes out a pro-creator stance by pledging not to use user content for generative AI.

New research explores how cutting-edge AI may advance quantum computing

AI is being leveraged to address key challenges in quantum computing, from hardware design to error correction.

Navigating the ethical minefield of AI-powered customer segmentation

AI-driven customer segmentation provides deeper insights into consumer behavior, but raises concerns about privacy and potential bias.