Model Rankings (Evaluation Correct Rate - data_sql)
| Model | Correct Evaluations | Total Evaluated | Correct Rate |
|---|---|---|---|
| deepseek-r1 | 18 | 18 | 100.0% |
| gemini-2.5-flash-preview | 18 | 18 | 100.0% |
| gpt-4.1 | 18 | 18 | 100.0% |
| o4-mini | 12 | 12 | 100.0% |
| qwen3-32b | 18 | 18 | 100.0% |
| o4-mini-high | 11 | 12 | 91.7% |
| llama-4-maverick | 16 | 18 | 88.9% |
| qwen3-14b | 16 | 18 | 88.9% |
| qwen3-30b-a3b | 16 | 18 | 88.9% |
| gpt-4.1-mini | 15 | 18 | 83.3% |
| gpt-4o-mini | 15 | 18 | 83.3% |
| deepseek-prover-v2 | 14 | 18 | 77.8% |
| llama-4-scout | 14 | 18 | 77.8% |
| qwen3-8b | 14 | 18 | 77.8% |
| deepseek-chat-v3-0324 | 13 | 18 | 72.2% |
| llama-3.1-nemotron-ultra-253b-v1 | 13 | 18 | 72.2% |
| gpt-4.1-nano | 13 | 18 | 72.2% |
| qwen3-235b-a22b | 13 | 18 | 72.2% |
| gemini-2.0-flash-001 | 12 | 18 | 66.7% |
| gemini-2.5-pro-preview-03-25 | 4 | 6 | 66.7% |
| gemma-3-27b-it | 8 | 18 | 44.4% |
| llama-3.3-nemotron-super-49b-v1 | 4 | 12 | 33.3% |
| gemma-3-12b-it | 0 | 18 | 0.0% |
| phi-4-reasoning-plus | 0 | 18 | 0.0% |
| phi-4-reasoning | 0 | 18 | 0.0% |
Correct Rate = (Number of 'correct' evaluations for generated SQL) / (Total SQL queries evaluated) × 100%