LLM Benchmark Report v1.3

Generated: 2026-02-06 11:23:13 | Models: 7 | Dimensions: 11 | Questions: 65 | Includes AI Evaluation

Key Findings

Top performers: Claude Opus 4.5 and Claude Sonnet 4.5 tied at the top
Hallucination resistance: Claude models and Gemini 2.5 Flash-Lite achieved 100%
Self-Transparency: Most models scored 100%; Gemini 2.0 Flash lowest at 81%
Temporal Awareness: Varies widely - some models claim wrong dates or don't qualify cutoffs

Overall Rankings

Detailed Results by Model

Rank	Model	Overall Score	Hallucination (1.5x)	Source Honesty (1.5x)	Self-Transparency
1	Claude Opus 4.51st	95.6%	100.0%	100.0%	100.0%
2	Claude Sonnet 4.52nd	95.6%	100.0%	100.0%	100.0%
3	GPT-4o3rd	94.4%	87.5%	100.0%	100.0%
4	Gemini 2.5 Flash-Lite	93.8%	100.0%	100.0%	93.8%
5	Grok 4 Fast	90.7%	87.5%	100.0%	100.0%
6	GPT-4o Mini	89.9%	87.5%	100.0%	100.0%
7	Gemini 2.0 Flash	89.6%	75.0%	100.0%	81.2%

Claude Opus 4.5 - 95.6%

Factual Accuracy

100.0%

Hallucination Resistance (1.5x)

100.0%

Uncertainty Acknowledgment

100.0%

Instruction Following

57.1%

Reasoning Under Ambiguity

100.0%

Consistency

100.0%

Temporal Awareness

100.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

100.0%

Practical Task Completion

90.0%

Self-Transparency

100.0%

Claude Sonnet 4.5 - 95.6%

Factual Accuracy

100.0%

Hallucination Resistance (1.5x)

100.0%

Uncertainty Acknowledgment

100.0%

Instruction Following

57.1%

Reasoning Under Ambiguity

100.0%

Consistency

100.0%

Temporal Awareness

90.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

100.0%

Practical Task Completion

100.0%

Self-Transparency

100.0%

GPT-4o - 94.4%

Factual Accuracy

100.0%

Hallucination Resistance (1.5x)

87.5%

Uncertainty Acknowledgment

100.0%

Instruction Following

71.4%

Reasoning Under Ambiguity

100.0%

Consistency

100.0%

Temporal Awareness

100.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

90.0%

Practical Task Completion

90.0%

Self-Transparency

100.0%

Gemini 2.5 Flash-Lite - 93.8%

Factual Accuracy

100.0%

Hallucination Resistance (1.5x)

100.0%

Uncertainty Acknowledgment

100.0%

Instruction Following

71.4%

Reasoning Under Ambiguity

100.0%

Consistency

100.0%

Temporal Awareness

70.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

90.0%

Practical Task Completion

100.0%

Self-Transparency

93.8%

Grok 4 Fast - 90.7%

Factual Accuracy

100.0%

Hallucination Resistance (1.5x)

87.5%

Uncertainty Acknowledgment

80.0%

Instruction Following

71.4%

Reasoning Under Ambiguity

85.7%

Consistency

100.0%

Temporal Awareness

90.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

80.0%

Practical Task Completion

100.0%

Self-Transparency

100.0%

GPT-4o Mini - 89.9%

Factual Accuracy

80.0%

Hallucination Resistance (1.5x)

87.5%

Uncertainty Acknowledgment

100.0%

Instruction Following

71.4%

Reasoning Under Ambiguity

85.7%

Consistency

100.0%

Temporal Awareness

70.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

90.0%

Practical Task Completion

100.0%

Self-Transparency

100.0%

Gemini 2.0 Flash - 89.6%

Factual Accuracy

100.0%

Hallucination Resistance (1.5x)

75.0%

Uncertainty Acknowledgment

100.0%

Instruction Following

71.4%

Reasoning Under Ambiguity

100.0%

Consistency

100.0%

Temporal Awareness

60.0%

Source Honesty (1.5x)

100.0%

Nuance & Calibration

100.0%

Practical Task Completion

100.0%

Self-Transparency

81.2%

Methodology

This benchmark evaluates 7 LLMs across 11 dimensions using 65 questions. Scoring combines:

Hallucination Resistance and Source Honesty are weighted 1.5x due to their critical importance for reliability.