Loading
Loading
## Overview This dataset contains **60 advanced mathematical problems in English**, part of the multilingual AIME25 benchmark designed to evaluate reasoning capabilities in Large Language Models. Based on AIME-level competition mathematics, these problems are solved by fewer than 5% of top high school math competition participants. Each problem has 4 multiple-choice options with one correct answer and three plausible distractors. ## What It Tests - **Mathematical Reasoning**: Multi-step problem-solving in number theory, geometry, algebra, combinatorics, and probability - **Logical Inference**: Complex reasoning requiring creative approaches beyond formula application - **Numerical Precision**: Exact computation with large numbers and complex calculations ## Who Should Contribute **Creators & Validators**: Mathematics competition coaches (AIME/IMO), professional mathematicians, PhD students, advanced math educators **Reviewers**: Math professors, Olympiad committee members, assessment specialists **Researchers**: Teams evaluating LLM mathematical reasoning and educational AI ## Why Experts? AIME problems are solved by <5% of top math competition participants. Contributors need deep expertise to verify correctness, ensure clarity, validate plausible wrong answers, and maintain mathematical rigor. ## Use Cases Benchmarking LLM mathematical reasoning, evaluating educational AI, multi-step inference research, and cross-lingual assessment (when combined with Spanish and Chinese versions).
60
Total Prompts
2440
Scored Responses
9
Contributors
48%
Average Overall Score
| Rank | Model | Avg. Score | Prompts Tested | Avg. Response Time |
|---|---|---|---|---|
| Rank | Model | Avg. Score | Prompts Tested | Avg. Response Time |
|---|---|---|---|---|
1 | x-ai/grok-4 | 0.97 | 60 | 402ms |
2 | google/gemini-3-pro-preview | 0.97 | 60 | 99ms |
3 | openai/gpt-5-codex | 0.94 | 60 | 292ms |
4 | google/gemini-2.5-pro | 0.90 | 60 | 216ms |
5 | x-ai/grok-4-fast | 0.90 | 60 | 148ms |