back
Get SIGNAL/NOISE in your inbox daily

The power of code in LLM training: New research from Cohere reveals that including code in the pre-training data of large language models (LLMs) significantly improves their performance on non-coding tasks.

  • Researchers systematically investigated the impact of code data in LLM pre-training on general performance beyond coding tasks.
  • The study used a two-phase training process: continued pre-training and a cooldown phase, testing various ratios of text and code in the training data.
  • Models were evaluated at different scales, from 470 million to 2.8 billion parameters, using benchmarks for world knowledge, natural language reasoning, and code performance.

Key findings: The inclusion of code in pre-training data consistently improved LLM performance across a wide range of non-coding tasks, with the benefits increasing as model size grew.

  • Models trained on code consistently outperformed text-only models on natural language reasoning tasks, with 100% code pre-training leading to the best performance.
  • A balanced mixture of code and text in pre-training data resulted in the best performance for world knowledge tasks.
  • Both code-only and balanced models outperformed text-only models on generative tasks, indicating that code data improves both reasoning and generation quality.

Impact of model scale: The performance gains from adding code to pre-training data became more pronounced as model size increased, particularly in world knowledge and code performance.

  • The trade-off between natural language tasks and code generation increased with model size.
  • While the study was limited to models up to 2.8 billion parameters due to cost constraints, researchers believe the findings will hold true for larger models.

Quality and diversity of code data: The study found that incorporating high-quality synthetic code and code-adjacent data in pre-training significantly boosted LLM performance.

  • Synthetic code data, created using problem statements and formally verified Python solutions, showed great potential for improving model performance.
  • Code-adjacent data, such as GitHub pull requests and commits, improved the models’ abilities on reasoning tasks.

Implications for enterprise applications: The findings have significant implications for businesses looking to leverage LLMs for various applications.

  • Including code in the cooldown phase of training, which is similar to fine-tuning, led to further improvements in non-code-related tasks.
  • Enterprises can potentially fine-tune pre-trained models with high-quality code from internal codebases and code-adjacent data to achieve better performance for specific applications.

Future directions and industry impact: The research is expected to influence the development and deployment of LLMs for enterprise use, potentially leading to more specialized pre-trained models.

  • Cohere, focused on providing LLMs for enterprise applications, may offer a wider range of pre-trained models with different mixtures of code and text, each optimized for specific types of tasks.
  • The findings are already informing how Cohere thinks about training state-of-the-art models for their clients.

Broader implications: This research challenges conventional thinking about LLM training and opens new avenues for improving AI performance across diverse tasks.

  • The discovery that code improves non-coding tasks suggests that the structure and logic inherent in programming languages may enhance an LLM’s overall reasoning and generative capabilities.
  • As AI continues to evolve, these findings may lead to more interdisciplinary approaches in model training, potentially incorporating data from various specialized fields to create more versatile and capable AI systems.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...