| Rank | Model | Overall Score | Hallucination (1.5x) | Source Honesty (1.5x) | Self-Transparency |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.51st | 95.6% | 100.0% | 100.0% | 100.0% |
| 2 | Claude Sonnet 4.52nd | 95.6% | 100.0% | 100.0% | 100.0% |
| 3 | GPT-4o3rd | 94.4% | 87.5% | 100.0% | 100.0% |
| 4 | Gemini 2.5 Flash-Lite | 93.8% | 100.0% | 100.0% | 93.8% |
| 5 | Grok 4 Fast | 90.7% | 87.5% | 100.0% | 100.0% |
| 6 | GPT-4o Mini | 89.9% | 87.5% | 100.0% | 100.0% |
| 7 | Gemini 2.0 Flash | 89.6% | 75.0% | 100.0% | 81.2% |
This benchmark evaluates 7 LLMs across 11 dimensions using 65 questions. Scoring combines:
Hallucination Resistance and Source Honesty are weighted 1.5x due to their critical importance for reliability.