AI models improve with less human oversight, new study finds

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Artificial intelligence researchers at Hong Kong University and UC Berkeley have discovered that language models perform better when allowed to develop their own solutions through reinforcement learning rather than being trained on human-labeled examples. This finding challenges conventional wisdom about how to best train large language models (LLMs) and vision language models (VLMs).

Key research findings: The study compared supervised fine-tuning (SFT) with reinforcement learning (RL) approaches across both textual and visual reasoning tasks.

Models trained primarily through reinforcement learning showed superior ability to generalize to new, unseen scenarios
Excessive use of hand-crafted training examples can actually impair a model’s ability to handle novel situations
The research used two benchmark tasks: GeneralPoints for testing arithmetic reasoning and V-IRL for evaluating spatial reasoning capabilities

Technical methodology: The researchers conducted their experiments using Llama-3.2-Vision-11B as their base model, implementing a hybrid training approach.

Models received initial “warm-up” training using a small supervised fine-tuning dataset
Separate versions were created for each task and training method
Training was scaled independently for both RL and SFT approaches to compare their effectiveness

Critical results: The study revealed clear advantages of reinforcement learning over traditional supervised training methods.

RL-trained models consistently outperformed SFT models when faced with out-of-distribution examples
SFT-trained models showed signs of memorizing training rules rather than truly learning to generalize
These findings held true across both text-only and multimodal (text and vision) scenarios

Practical implications: The research suggests important considerations for future AI model development and deployment.

A small amount of supervised fine-tuning remains valuable for stabilizing model output format
Pure reinforcement learning approaches may be particularly valuable for tasks with clearly verifiable results
This approach could reduce the cost and effort required to create extensive hand-labeled training datasets

Looking ahead: While these findings challenge established training paradigms, the researchers note important nuances in their results that differ from other recent developments like DeepSeek-R1-Zero, suggesting that the effectiveness of pure RL training may depend on the specific architecture of the base model being used.

Less supervision, better results: Study shows AI models generalize more effectively on their own

VentureBeat

Menu

AI models improve with less human oversight, new study finds

Recent News

Condos with filters? Real estate agents use AI to fake property photos, sparking legal concerns

“Learn to AI”: California propels workforce training with tech giants across public education system

Qualcomm plans AI server chips for 2028 amid competitive challenges

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

AI models improve with less human oversight, new study finds

Recent News

Condos with filters? Real estate agents use AI to fake property photos, sparking legal concerns

“Learn to AI”: California propels workforce training with tech giants across public education system

Qualcomm plans AI server chips for 2028 amid competitive challenges

Join the revolution

CO/AI

Resources

Join the revolution