Skip to content

Kagi LLM Benchmarking Project

Introducing the Kagi LLM Benchmarking Project, which evaluates major large language models (LLMs) on their reasoning, coding, and instruction-following capabilities.

Kagi Offline Benchmark

The Kagi "offline" Benchmark is an unpolluted benchmark to assess large language models (LLMs) on their strength without the use of tools (web search, code use, etc.). This benchmark generally favors models that use chain of thought heavily.

Unlike standard benchmarks, the tasks in this benchmark are unpublished, not found in training data, or "gamed" in fine-tuning. The task set changes over time (mostly getting more difficult) to better represent the current state of the art.

Last update: August 11th, 2025 Tasks: 110 Input Tokens (all tasks): 14909

NOTE: Since the July update, we have made major changes to the task set. We test much more aggressively on non-english languages, robustness to polluted tokens, noisy context and instruction following. We also replaced our hallucination benchmarks with new tasks.

Please see notes below the table if you see results you find surprising, or get in contact with us on the user discord

model%accuracyCost($)time/tasktokensTPSprovider
claude-4-opus-thinking74.322.413.31705811.0kagi (ult)
grok-473.61.065.136600.5kagi (ult)
claude-4-sonnet-thinking73.05.414.11787210.0kagi (ult)
gpt-572.77.132.862821.6kagi (ult)
o3-pro72.134.287.8120541.1kagi (ult)
gemini-2-5-pro70.31.720.9135815.4kagi (ult)
gpt-5-mini70.34.926.7101133.3kagi
deepseek-r169.49.933.6407079.8kagi (ult)
qwen-3-235b-a22b-thinking69.40.127.65260115.8kagi
o367.64.830.9121273.3kagi (ult)
o4-mini67.63.116.0112246.1kagi
arcee-ai/maestro-reasoning64.92.716.7200565103.4openrouter
gpt-5-nano62.20.420.595873.9kagi
grok-361.32.66.02886542.4kagi (ult)
grok-3-mini61.30.37.686619.5kagi
claude-4-opus59.68.45.71950529.3kagi (ult)
gpt-oss-120b58.60.42.31476454.8kagi
gemini-2-5-flash-thinking56.80.511.42179916.7kagi
llama-4-maverick55.90.20.633516456.3kagi
claude-4-sonnet55.91.85.62057431.1kagi (ult)
qwen-3-235b-a22b55.00.411.311987585.3kagi
chatgpt-4o54.12.62.642545163.1deprecated
gpt-oss-20b53.20.53.33861996.0kagi
glm-4-552.35.262.3414345.4kagi (ult)
gpt-4-152.32.14.461935116.2deprecated
deepseek chat v352.30.43.93616573.1kagi
deepseek-r1-distill-llama-70b52.31.04.71032117.8deprecated
qwen-3-coder49.50.813.811660067.0kagi
gpt-4-1-mini48.60.46.16352485.9deprecated
mistral-medium45.90.34.72460242.6kagi
llama-3-405b45.01.13.02671277.7deprecated
baidu/ernie-4.5-300b-a47b45.00.27.93567136.2openrouter
kimi-k2 (see note below)45.01.13.384371201.1kagi
gemini-2-5-flash44.10.42.034698152.3kagi
thudm/glm-4-32b42.30.110.92823620.5openrouter
gemini-2-5-flash-lite40.50.12.243989171.8kagi
thedrummer/anubis-70b-v1.139.60.27.71793319.3openrouter
microsoft/phi-4-reasoning-plus37.80.249.844120975.6openrouter
mistral-small37.80.12.231453120.0kagi
gemini-flash37.80.12.12188292.4deprecated
llama-4-scout36.90.10.825965272.4deprecated
llama-3-70b35.10.31.721423104.1deprecated
minimax/minimax-0135.10.27.22313227.4openrouter
gemma-3-27b35.10.14.32773258.4deprecated
claude-3-haiku34.20.53.11489141.1deprecated
gpt-4-1-nano33.30.13.762437135.1deprecated
arcee-ai/virtuoso-large32.40.34.62317543.2openrouter
qwen-3-32b-thinking31.50.23.147478121.6kagi
google/gemma-3n-e4b-it31.50.018.22755712.8openrouter
qwen-3-32b28.80.35.7105057146.7kagi
cohere/command-a28.81.911.73947127.5openrouter
thedrummer/valkyrie-49b-v128.80.14.11987439.4openrouter
gpt-4o-mini28.80.12.536708119.8deprecated
ai21/jamba-large-1.726.11.34.41907735.6openrouter
google/gemma-3-4b-it25.20.03.22589171.6openrouter
inception/mercury21.60.25.82298235.2openrouter
inception/mercury-coder20.70.15.81732326.6openrouter
bytedance/ui-tars-1.5-7b20.70.02.227328104.5openrouter
arcee-ai/spotlight18.90.04.12541452.4openrouter
microsoft/phi-4-multimodal-instruct17.10.02.229028105.1openrouter
magistral-medium16.222.4105.513280.1Mistral
ai21/jamba-mini-1.711.70.12.11459858.2openrouter
arcee-ai/AFM-4.5B10.80.02.228208112.4together
magistral-small6.37.468.310390.1Mistral

Notes on chain of thought: Models that use chain of thought do drastically better in this benchmark. Some models, like kimi-k2 perform worse with our instruction following prompts (ex: "answer in only one word") seem to shut down reasoning. We also test more comprehensively on non-english/chinese languages, which seems to punish some models (Qwen3-32B).

Model Costs: Costs in the reasoning benchmark are mostly from the models' output tokens. This table's cost column is not representative for input token heavy tasks like web search or retrieval.

For example, grok-3-mini uses chain of thought and grok-3 does not. This is why grok-3-mini outperforms its bigger sibling in this benchmark.

Reasoning models may not be the best choice for all tasks! Pick the model that performs best at what you intend to do. We will be including other benchmark tables (search, tool use, agentic task completion) shortly.

Benchmark Questions

The reasoning benchmark is intended to measure the models in their capacity for self-correcting logical mistakes. This is essential for LLM features in Kagi Search. Many of the tasks are translated to other languages to assess model robustness across languages.

Various capabilities like chess, coding, math:

What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1

As well as one-shot pattern matching with knowledge retrieval:

Given a AZERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to?

Common traps in model overtraining on statistical text patterns.

For instance the mention of python trip models up (48% success rate):

Would 3.11 be a bigger number than 3.9 if I used python math libraries to compare?

This exploits models wanting to give the classic answer to the "Surgeon's son riddle (51% success rate):

A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room". 

She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".

Why can't the surgeon operate on the boy?

Model Attention We also include tasks that test models propensity to get distracted by irrelevant text that tends to activate model layers.

The "background text" on this one trips up models with bad context window attention (26% success rate):

In this chart, arrows represent actions.
Verbs for the actions are in boxes with doubled lines.
The text in the background of the diagram is noise, don't mind it.
1. What is Mary doing with the apple?
2. What is Jack doing to the apple?
3. What is Charles doing to the bee?
===================================
The Follies of 1907 is a 1907 musical revue which
was conceived +---------+ and produced by Florenz.
An Apple is a |  Jack   | round, edible fruit that
is produced by+---------+ the apple tree. Apples 
cannot jump, or |  +==========+ eat, or kick, and
that is because |--|| eating || apples are fruits,
+========+ and  v  +==========+ fruits are not
|| kick || the  +---------+ kind of objects that
+========+ can  |  Apple  | <+ take actions. People
like bee |  or  +---------+  | Mary, or Jack, or the
+-----+  | guy  +---------+  | named Charles, well,
| bee | <------ | Charles |  | they can certainly
+-----+  act in +---------+  |  +============+ ways
that could be   |  +======+  |--|| Throwing || seen
as verbs, like  |--||jump||  |  +============+ eat,
throw, jump, or |  +======+  | punch, or kick. Wrong
Answers: Mary +---------+    | kicked, Jack jumped, 
Charles threw |  Mary   | ---+ and bee made honey.
More random   +---------+ text: Mary ate the apple.
Charles threw the bee. Mary ate the apple. Jack is 
the one who kicked the apple. Not Mary. She ate it.
============================

Credits

Kagi LLM Benchmarking Project is inspired by Wolfram LLM Benchmarking Project and Aider LLM coding leaderboard.