Model Rankings (Correctness Rate)

ModelCorrect MatchesTotalCorrectness Rate
gemini-2.0-flash-0012828100.0%
gemma-3-27b-it2828100.0%
o4-mini2828100.0%
qwen3-14b2828100.0%
qwen3-30b-a3b2828100.0%
qwen3-32b272896.4%
qwen3-8b272896.4%
qwen3-235b-a22b262892.9%
gemini-2.5-pro-preview-03-25242885.7%
gemma-3-12b-it242885.7%
llama-3.1-nemotron-ultra-253b-v1242885.7%
gpt-4.1-mini242885.7%
gpt-4o-mini242885.7%
o4-mini-high242885.7%
gemini-2.5-flash-preview232882.1%
gpt-4.1232882.1%
deepseek-prover-v2192867.9%
llama-4-maverick192867.9%
gpt-4.1-nano192867.9%
deepseek-r1182864.3%
llama-4-scout182864.3%
deepseek-chat-v3-032492832.1%
phi-4-reasoning-plus0280.0%
phi-4-reasoning0280.0%
llama-3.3-nemotron-super-49b-v10280.0%

Correctness Rate = (Number of responses with correctly matching layers against ground truth across all queries and runs) / (Total possible responses) × 100%