Model Rankings (Evaluation Correct Rate - data_sql)

ModelCorrect EvaluationsTotal EvaluatedCorrect Rate
deepseek-r11818100.0%
gemini-2.5-flash-preview1818100.0%
gpt-4.11818100.0%
o4-mini1212100.0%
qwen3-32b1818100.0%
o4-mini-high111291.7%
llama-4-maverick161888.9%
qwen3-14b161888.9%
qwen3-30b-a3b161888.9%
gpt-4.1-mini151883.3%
gpt-4o-mini151883.3%
deepseek-prover-v2141877.8%
llama-4-scout141877.8%
qwen3-8b141877.8%
deepseek-chat-v3-0324131872.2%
llama-3.1-nemotron-ultra-253b-v1131872.2%
gpt-4.1-nano131872.2%
qwen3-235b-a22b131872.2%
gemini-2.0-flash-001121866.7%
gemini-2.5-pro-preview-03-254666.7%
gemma-3-27b-it81844.4%
llama-3.3-nemotron-super-49b-v141233.3%
gemma-3-12b-it0180.0%
phi-4-reasoning-plus0180.0%
phi-4-reasoning0180.0%

Correct Rate = (Number of 'correct' evaluations for generated SQL) / (Total SQL queries evaluated) × 100%