Model Rankings (Overall Success Rate)

ModelSuccessesTotalSuccess Rate
deepseek-r12828100.0%
gemini-2.0-flash-0012828100.0%
gemini-2.5-pro-preview-03-252828100.0%
gemma-3-12b-it2828100.0%
gemma-3-27b-it2828100.0%
llama-3.1-nemotron-ultra-253b-v12828100.0%
gpt-4.12828100.0%
gpt-4.1-mini2828100.0%
gpt-4o-mini2828100.0%
o4-mini2828100.0%
o4-mini-high2828100.0%
qwen3-14b2828100.0%
qwen3-235b-a22b2828100.0%
qwen3-30b-a3b2828100.0%
qwen3-32b2828100.0%
qwen3-8b272896.4%
gpt-4.1-nano262892.9%
gemini-2.5-flash-preview252889.3%
deepseek-prover-v2232882.1%
llama-4-scout222878.6%
llama-4-maverick192867.9%
deepseek-chat-v3-032492832.1%
llama-3.3-nemotron-super-49b-v11283.6%
phi-4-reasoning-plus0280.0%
phi-4-reasoning0280.0%

Success Rate = (Number of successful responses across all queries and runs) / (Total possible responses) × 100%