Skip to content

Kagi LLM Benchmarking Project

Introducing the Kagi LLM Benchmarking Project, which evaluates major large language models (LLMs) on their reasoning, coding, and instruction following capabilities.

LLM Benchmarks

The Kagi LLM Benchmarking Project uses an unpolluted benchmark to assess contemporary large language models (LLMs) through diverse, challenging tasks. Unlike standard benchmarks, our tests frequently change and are mostly novel, providing a rigorous evaluation of the models' capabilities, (hopefully) outside of what models saw in the training data to avoid benchmark overfitting.

Last updated April 17th, 2025. We rebuild this table often. We change the benchmark tasks every update, so scores are not comparable over time.

Note that the costs in this table are heavy on output tokens, due to the nature of the benchmark tasks. These are not representative costs for use of these models as an agent, where the ratio of input to output tokens will be much different.

modelCoTaccuracytimecosttokensspeed (t/s)accuracy/$ scoreaccuracy/sec score
o3Y76.295022.571916056122915
claude-3-7-extended-thinkingY71.348472.205678193196328
gemini-2-5-proY68.723810.25799052526718
qwen-qwq-32bY65.947630.119943404004465538
o1Y65.445026.5521336787913
o3-miniY65.165020.52675103332012312
deepseek-r1Y64.063011.162291010713355521
o4-miniY62.275020.417464253814912
grok-3-miniN59.177840.07626582277757
deepseek-r1-distill-llama-70bY54.413810.406439163424013314
gpt-4-1N54.1713180.227515526112384
chatgpt-4oN53.098470.721271625019736
deepseekN50.343010.32012213917115716
grok-3N50.347840.922011672321546
llama-4-maverickN46.095420.04311215733910698
o1-proY44.3850259.57522628508
gpt-4-1-miniN44.0613180.0557121309167903
claude-3-7-sonnetY42.943010.30431108523614114
claude-3-opusN41.578471.943891154513214
claude-3-sonnet-v2N41.223010.2606197593215813
mistral-largeN39.421000.124151268212631739
claude-3-sonnet-v1N37.123010.33792139424610912
mistral-smallN36.471000.0038212585125954736
llama-3-405bN35.855420.3399123255421056
llama-3-70bN34.785420.1062819295353276
gemini-flashN34.43810.012873371926879
llama-4-scoutN33.435420.02634178733212696
gpt-4-turboN32.5113181.378611337110232
gpt-4o-miniN30.987840.02758228132911233
claude-3-haikuN29.638470.1232810100112403
nova-proN29.121000.157371531715318528
gemini-pro-1-5N28.993810.50243659117577
gpt-4-1-nanoN27.8313180.01078194881425812
nova-liteN26.091000.0100716421164259025
llama-3-3bN17.585420.01212245394514503
mistral-nemoN14.371000.001288719871122614

Reasoning models are denoted by the CoT column. They are optimized for multi-step reasoning and often produce better results on reasoning benchmarks, at the expense of latency and cost. They may not be suitable for all general purpose LLM tasks.

The table includes metrics such as overall mode quality (measured as percent of correct responses), total tokens output (some models are less verbose by default, affecting both cost and speed), total cost to run the test and average speed in tokens per second at the time of testing.

The scores for accuracy per second and accuracy per dollar are normalized accuracy scores for speed and cost. Higher scores are better.

This approach measures the models' potential and adaptability, with some bias towards features essential for LLM features in Kagi Search (mostly around reasoning and instruction following capabilities, see examples below).

As models get more advanced and to prevent leaking test to training data, we periodically update the benchmarks with harder questions to have reasonable distribution of model scores.

Benchmark details

The benchmark is meant to be hard so we can reasonably evaluate current capabilities of LLMs.

Example questions include:

What is the capital of Finland? If it begins with the letter H, respond 'Oslo' otherwise respond 'Helsinki'.
What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1
Given a QWERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to?

Credits

Kagi LLM Benchmarking Project is inspired by Wolfram LLM Benchmarking Project and Aider LLM coding leaderboard.