LLM Benchmark Report v1.3

Generated: 2026-02-06 11:23:13 | Models: 7 | Dimensions: 11 | Questions: 65 | Includes AI Evaluation

Key Findings

Overall Rankings

Rank Model Overall Score Hallucination (1.5x) Source Honesty (1.5x) Self-Transparency
1 Claude Opus 4.51st 95.6% 100.0% 100.0% 100.0%
2 Claude Sonnet 4.52nd 95.6% 100.0% 100.0% 100.0%
3 GPT-4o3rd 94.4% 87.5% 100.0% 100.0%
4 Gemini 2.5 Flash-Lite 93.8% 100.0% 100.0% 93.8%
5 Grok 4 Fast 90.7% 87.5% 100.0% 100.0%
6 GPT-4o Mini 89.9% 87.5% 100.0% 100.0%
7 Gemini 2.0 Flash 89.6% 75.0% 100.0% 81.2%

Detailed Results by Model

Claude Opus 4.5 - 95.6%
Factual Accuracy
100.0%
Hallucination Resistance (1.5x)
100.0%
Uncertainty Acknowledgment
100.0%
Instruction Following
57.1%
Reasoning Under Ambiguity
100.0%
Consistency
100.0%
Temporal Awareness
100.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
100.0%
Practical Task Completion
90.0%
Self-Transparency
100.0%
Claude Sonnet 4.5 - 95.6%
Factual Accuracy
100.0%
Hallucination Resistance (1.5x)
100.0%
Uncertainty Acknowledgment
100.0%
Instruction Following
57.1%
Reasoning Under Ambiguity
100.0%
Consistency
100.0%
Temporal Awareness
90.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
100.0%
Practical Task Completion
100.0%
Self-Transparency
100.0%
GPT-4o - 94.4%
Factual Accuracy
100.0%
Hallucination Resistance (1.5x)
87.5%
Uncertainty Acknowledgment
100.0%
Instruction Following
71.4%
Reasoning Under Ambiguity
100.0%
Consistency
100.0%
Temporal Awareness
100.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
90.0%
Practical Task Completion
90.0%
Self-Transparency
100.0%
Gemini 2.5 Flash-Lite - 93.8%
Factual Accuracy
100.0%
Hallucination Resistance (1.5x)
100.0%
Uncertainty Acknowledgment
100.0%
Instruction Following
71.4%
Reasoning Under Ambiguity
100.0%
Consistency
100.0%
Temporal Awareness
70.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
90.0%
Practical Task Completion
100.0%
Self-Transparency
93.8%
Grok 4 Fast - 90.7%
Factual Accuracy
100.0%
Hallucination Resistance (1.5x)
87.5%
Uncertainty Acknowledgment
80.0%
Instruction Following
71.4%
Reasoning Under Ambiguity
85.7%
Consistency
100.0%
Temporal Awareness
90.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
80.0%
Practical Task Completion
100.0%
Self-Transparency
100.0%
GPT-4o Mini - 89.9%
Factual Accuracy
80.0%
Hallucination Resistance (1.5x)
87.5%
Uncertainty Acknowledgment
100.0%
Instruction Following
71.4%
Reasoning Under Ambiguity
85.7%
Consistency
100.0%
Temporal Awareness
70.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
90.0%
Practical Task Completion
100.0%
Self-Transparency
100.0%
Gemini 2.0 Flash - 89.6%
Factual Accuracy
100.0%
Hallucination Resistance (1.5x)
75.0%
Uncertainty Acknowledgment
100.0%
Instruction Following
71.4%
Reasoning Under Ambiguity
100.0%
Consistency
100.0%
Temporal Awareness
60.0%
Source Honesty (1.5x)
100.0%
Nuance & Calibration
100.0%
Practical Task Completion
100.0%
Self-Transparency
81.2%

Methodology

This benchmark evaluates 7 LLMs across 11 dimensions using 65 questions. Scoring combines:

Hallucination Resistance and Source Honesty are weighted 1.5x due to their critical importance for reliability.