How subtle biases derail LLM evaluations

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Large Language Models are increasingly deployed as judges and decision-makers in critical domains, but their judgments suffer from systematic biases that threaten reliability. Research from The Collective Intelligence Project reveals that positional preferences, order effects, and prompt sensitivity significantly undermine LLMs’ ability to make consistent judgments. Understanding these biases is crucial as AI systems expand into sensitive areas like hiring, healthcare, and legal assessments where decision-making integrity is paramount.

The big picture: LLMs exhibit multiple systematic biases when used as judges, including positional preferences, ordering effects, and sensitivity to prompt wording, rendering their judgments unreliable.

These biases appear across multiple evaluation methods including pairwise comparisons, rubric-based scoring, and various classification approaches.
Even small, seemingly innocuous changes to prompts can dramatically alter an LLM’s judgment on identical content.

Key biases identified: In pairwise comparisons, LLMs show strong positional preferences, often favoring the first option regardless of content quality.

When using Option A/B labeling in comparisons, models showed a 13.8% preference for Option A over Option B when the content was identical.
Left-right positional bias exists when comparing items side by side, with a consistent preference for the left-positioned content.

Rubric vulnerabilities: LLMs’ scoring is heavily influenced by the order in which evaluation criteria are presented and how scales are described.

Models score identical content differently when evaluation criteria are reordered within the same prompt.
When using numerical scales, LLMs show reluctance to assign extreme scores (both high and low), clustering judgments toward the middle.

Prompt sensitivity: Minor wording changes in system prompts can cause dramatic swings in judgment outcomes.

Even changing a single word in a prompt can significantly alter an LLM’s assessment of identical content.
This unpredictability makes it difficult to establish stable evaluation frameworks using LLMs.

Classification instability: When tasked with classifying content into categories, LLMs show inconsistent behavior and are easily influenced by irrelevant factors.

Models frequently reclassify identical content when presented in different contexts.
Classification tasks show high sensitivity to how categories are described and ordered in the prompt.

Implications: These biases create serious reliability concerns for using LLMs in high-stakes decision-making environments like hiring, healthcare, and legal assessments.

Current approaches to using LLMs as judges may give a false sense of objectivity while actually introducing systematic biases.
Measurement biases in AI judgments are often overlooked or undetected in real-world applications.

Why this matters: As AI systems are increasingly deployed in sensitive domains, understanding their decision-making limitations is crucial for responsible implementation.

Unaddressed biases could lead to systematic discrimination or unfair outcomes when these systems are used in consequential settings.
The presence of these biases challenges the notion that LLMs can serve as objective judges or evaluators.

Mitigation strategies: Researchers recommend using multiple evaluation methods simultaneously and aggregating results to increase reliability.

Randomizing the order of options, criteria, and category presentations can help minimize positional biases.
Testing prompts with identical content can reveal the presence and magnitude of biases in specific evaluation setups.

LLM Judges Are Unreliable — The Collective Intelligence Project

The Collective Intelligence Project