Structured Response Reliability

LLM Comparison Results Summary

  • Total Runs: 4
  • Total Queries: 28
  • Total Unique Models: 25

All Models vs Queries (Latest Run)

Querydeepseek-chat-v3-0324deepseek-prover-v2deepseek-r1gemini-2.0-flash-001gemini-2.5-flash-previewgemini-2.5-pro-preview-03-25gemma-3-12b-itgemma-3-27b-itllama-4-maverickllama-4-scoutphi-4-reasoning-plusphi-4-reasoningllama-3.1-nemotron-ultra-253b-v1llama-3.3-nemotron-super-49b-v1gpt-4.1gpt-4.1-minigpt-4.1-nanogpt-4o-minio4-minio4-mini-highqwen3-14bqwen3-235b-a22bqwen3-30b-a3bqwen3-32bqwen3-8b
Query 1
Query 2
Query 3
Query 4
Query 5
Query 6
Query 7

Combined Run Success Table

Querydeepseek-chat-v3-0324deepseek-prover-v2deepseek-r1gemini-2.0-flash-001gemini-2.5-flash-previewgemini-2.5-pro-preview-03-25gemma-3-12b-itgemma-3-27b-itllama-4-maverickllama-4-scoutphi-4-reasoning-plusphi-4-reasoningllama-3.1-nemotron-ultra-253b-v1llama-3.3-nemotron-super-49b-v1gpt-4.1gpt-4.1-minigpt-4.1-nanogpt-4o-minio4-minio4-mini-highqwen3-14bqwen3-235b-a22bqwen3-30b-a3bqwen3-32bqwen3-8b
All the available datasets
All the available datasets in Halifax
Number of boat facilities in Halifax
Number of litter bin and boat facilities within Sambro, Halifax
Number of litter bins within Sambro, Halifax
Total number of communities in Halifax Regional Municipality
World's most populous city

Each indicates a "success" for that model/query in a run.
Each indicates a "failure" for that model/query in a run.
means no data for that run.

Run 1

Run 2

Run 3

Run 4