Model Rankings (Overall Success Rate)
| Model | Successes | Total | Success Rate |
|---|---|---|---|
| deepseek-r1 | 28 | 28 | 100.0% |
| gemini-2.0-flash-001 | 28 | 28 | 100.0% |
| gemini-2.5-pro-preview-03-25 | 28 | 28 | 100.0% |
| gemma-3-12b-it | 28 | 28 | 100.0% |
| gemma-3-27b-it | 28 | 28 | 100.0% |
| llama-3.1-nemotron-ultra-253b-v1 | 28 | 28 | 100.0% |
| gpt-4.1 | 28 | 28 | 100.0% |
| gpt-4.1-mini | 28 | 28 | 100.0% |
| gpt-4o-mini | 28 | 28 | 100.0% |
| o4-mini | 28 | 28 | 100.0% |
| o4-mini-high | 28 | 28 | 100.0% |
| qwen3-14b | 28 | 28 | 100.0% |
| qwen3-235b-a22b | 28 | 28 | 100.0% |
| qwen3-30b-a3b | 28 | 28 | 100.0% |
| qwen3-32b | 28 | 28 | 100.0% |
| qwen3-8b | 27 | 28 | 96.4% |
| gpt-4.1-nano | 26 | 28 | 92.9% |
| gemini-2.5-flash-preview | 25 | 28 | 89.3% |
| deepseek-prover-v2 | 23 | 28 | 82.1% |
| llama-4-scout | 22 | 28 | 78.6% |
| llama-4-maverick | 19 | 28 | 67.9% |
| deepseek-chat-v3-0324 | 9 | 28 | 32.1% |
| llama-3.3-nemotron-super-49b-v1 | 1 | 28 | 3.6% |
| phi-4-reasoning-plus | 0 | 28 | 0.0% |
| phi-4-reasoning | 0 | 28 | 0.0% |
Success Rate = (Number of successful responses across all queries and runs) / (Total possible responses) × 100%