Kagi LLM Benchmarking Project

Introducing the Kagi LLM Benchmarking Project, which evaluates major large language models (LLMs) on their reasoning, coding, and instruction-following capabilities.

Kagi Offline Benchmark

The Kagi "offline" Benchmark is an unpolluted benchmark to assess large language models (LLMs) on their strength without the use of tools (web search, code use, etc.). This benchmark generally favors models that use chain of thought heavily.

Unlike standard benchmarks, the tasks in this benchmark are unpublished, not found in training data, or "gamed" in fine-tuning. The task set changes over time (mostly getting more difficult) to better represent the current state of the art.

Last update: August 11th, 2025 Tasks: 110 Input Tokens (all tasks): 14909

NOTE: Since the July update, we have made major changes to the task set. We test much more aggressively on non-english languages, robustness to polluted tokens, noisy context and instruction following. We also replaced our hallucination benchmarks with new tasks.

Please see notes below the table if you see results you find surprising, or get in contact with us on the user discord

model	%accuracy	Cost($)	time/task	tokens	TPS	provider
claude-4-opus-thinking	74.3	22.4	13.3	17058	11.0	kagi (ult)
grok-4	73.6	1.0	65.1	3660	0.5	kagi (ult)
claude-4-sonnet-thinking	73.0	5.4	14.1	17872	10.0	kagi (ult)
gpt-5	72.7	7.1	32.8	6282	1.6	kagi (ult)
o3-pro	72.1	34.2	87.8	12054	1.1	kagi (ult)
gemini-2-5-pro	70.3	1.7	20.9	13581	5.4	kagi (ult)
gpt-5-mini	70.3	4.9	26.7	10113	3.3	kagi
deepseek-r1	69.4	9.9	33.6	40707	9.8	kagi (ult)
qwen-3-235b-a22b-thinking	69.4	0.1	27.6	52601	15.8	kagi
o3	67.6	4.8	30.9	12127	3.3	kagi (ult)
o4-mini	67.6	3.1	16.0	11224	6.1	kagi
arcee-ai/maestro-reasoning	64.9	2.7	16.7	200565	103.4	openrouter
gpt-5-nano	62.2	0.4	20.5	9587	3.9	kagi
grok-3	61.3	2.6	6.0	28865	42.4	kagi (ult)
grok-3-mini	61.3	0.3	7.6	8661	9.5	kagi
claude-4-opus	59.6	8.4	5.7	19505	29.3	kagi (ult)
gpt-oss-120b	58.6	0.4	2.3	14764	54.8	kagi
gemini-2-5-flash-thinking	56.8	0.5	11.4	21799	16.7	kagi
llama-4-maverick	55.9	0.2	0.6	33516	456.3	kagi
claude-4-sonnet	55.9	1.8	5.6	20574	31.1	kagi (ult)
qwen-3-235b-a22b	55.0	0.4	11.3	119875	85.3	kagi
chatgpt-4o	54.1	2.6	2.6	42545	163.1	deprecated
gpt-oss-20b	53.2	0.5	3.3	38619	96.0	kagi
glm-4-5	52.3	5.2	62.3	41434	5.4	kagi (ult)
gpt-4-1	52.3	2.1	4.4	61935	116.2	deprecated
deepseek chat v3	52.3	0.4	3.9	36165	73.1	kagi
deepseek-r1-distill-llama-70b	52.3	1.0	4.7	10321	17.8	deprecated
qwen-3-coder	49.5	0.8	13.8	116600	67.0	kagi
gpt-4-1-mini	48.6	0.4	6.1	63524	85.9	deprecated
mistral-medium	45.9	0.3	4.7	24602	42.6	kagi
llama-3-405b	45.0	1.1	3.0	26712	77.7	deprecated
baidu/ernie-4.5-300b-a47b	45.0	0.2	7.9	35671	36.2	openrouter
kimi-k2 (see note below)	45.0	1.1	3.3	84371	201.1	kagi
gemini-2-5-flash	44.1	0.4	2.0	34698	152.3	kagi
thudm/glm-4-32b	42.3	0.1	10.9	28236	20.5	openrouter
gemini-2-5-flash-lite	40.5	0.1	2.2	43989	171.8	kagi
thedrummer/anubis-70b-v1.1	39.6	0.2	7.7	17933	19.3	openrouter
microsoft/phi-4-reasoning-plus	37.8	0.2	49.8	441209	75.6	openrouter
mistral-small	37.8	0.1	2.2	31453	120.0	kagi
gemini-flash	37.8	0.1	2.1	21882	92.4	deprecated
llama-4-scout	36.9	0.1	0.8	25965	272.4	deprecated
llama-3-70b	35.1	0.3	1.7	21423	104.1	deprecated
minimax/minimax-01	35.1	0.2	7.2	23132	27.4	openrouter
gemma-3-27b	35.1	0.1	4.3	27732	58.4	deprecated
claude-3-haiku	34.2	0.5	3.1	14891	41.1	deprecated
gpt-4-1-nano	33.3	0.1	3.7	62437	135.1	deprecated
arcee-ai/virtuoso-large	32.4	0.3	4.6	23175	43.2	openrouter
qwen-3-32b-thinking	31.5	0.2	3.1	47478	121.6	kagi
google/gemma-3n-e4b-it	31.5	0.0	18.2	27557	12.8	openrouter
qwen-3-32b	28.8	0.3	5.7	105057	146.7	kagi
cohere/command-a	28.8	1.9	11.7	39471	27.5	openrouter
thedrummer/valkyrie-49b-v1	28.8	0.1	4.1	19874	39.4	openrouter
gpt-4o-mini	28.8	0.1	2.5	36708	119.8	deprecated
ai21/jamba-large-1.7	26.1	1.3	4.4	19077	35.6	openrouter
google/gemma-3-4b-it	25.2	0.0	3.2	25891	71.6	openrouter
inception/mercury	21.6	0.2	5.8	22982	35.2	openrouter
inception/mercury-coder	20.7	0.1	5.8	17323	26.6	openrouter
bytedance/ui-tars-1.5-7b	20.7	0.0	2.2	27328	104.5	openrouter
arcee-ai/spotlight	18.9	0.0	4.1	25414	52.4	openrouter
microsoft/phi-4-multimodal-instruct	17.1	0.0	2.2	29028	105.1	openrouter
magistral-medium	16.2	22.4	105.5	1328	0.1	Mistral
ai21/jamba-mini-1.7	11.7	0.1	2.1	14598	58.2	openrouter
arcee-ai/AFM-4.5B	10.8	0.0	2.2	28208	112.4	together
magistral-small	6.3	7.4	68.3	1039	0.1	Mistral

Notes on chain of thought: Models that use chain of thought do drastically better in this benchmark. Some models, like kimi-k2 perform worse with our instruction following prompts (ex: "answer in only one word") seem to shut down reasoning. We also test more comprehensively on non-english/chinese languages, which seems to punish some models (Qwen3-32B).

Model Costs: Costs in the reasoning benchmark are mostly from the models' output tokens. This table's cost column is not representative for input token heavy tasks like web search or retrieval.

For example, grok-3-mini uses chain of thought and grok-3 does not. This is why grok-3-mini outperforms its bigger sibling in this benchmark.

Reasoning models may not be the best choice for all tasks! Pick the model that performs best at what you intend to do. We will be including other benchmark tables (search, tool use, agentic task completion) shortly.

Benchmark Questions

The reasoning benchmark is intended to measure the models in their capacity for self-correcting logical mistakes. This is essential for LLM features in Kagi Search. Many of the tasks are translated to other languages to assess model robustness across languages.

Various capabilities like chess, coding, math:

What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1

As well as one-shot pattern matching with knowledge retrieval:

Given a AZERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to?

Common traps in model overtraining on statistical text patterns.

For instance the mention of python trip models up (48% success rate):

Would 3.11 be a bigger number than 3.9 if I used python math libraries to compare?

This exploits models wanting to give the classic answer to the "Surgeon's son riddle (51% success rate):

A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room". 

She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".

Why can't the surgeon operate on the boy?

Model Attention We also include tasks that test models propensity to get distracted by irrelevant text that tends to activate model layers.

The "background text" on this one trips up models with bad context window attention (26% success rate):

In this chart, arrows represent actions.
Verbs for the actions are in boxes with doubled lines.
The text in the background of the diagram is noise, don't mind it.
1. What is Mary doing with the apple?
2. What is Jack doing to the apple?
3. What is Charles doing to the bee?
===================================
The Follies of 1907 is a 1907 musical revue which
was conceived +---------+ and produced by Florenz.
An Apple is a |  Jack   | round, edible fruit that
is produced by+---------+ the apple tree. Apples 
cannot jump, or |  +==========+ eat, or kick, and
that is because |--|| eating || apples are fruits,
+========+ and  v  +==========+ fruits are not
|| kick || the  +---------+ kind of objects that
+========+ can  |  Apple  | <+ take actions. People
like bee |  or  +---------+  | Mary, or Jack, or the
+-----+  | guy  +---------+  | named Charles, well,
| bee | <------ | Charles |  | they can certainly
+-----+  act in +---------+  |  +============+ ways
that could be   |  +======+  |--|| Throwing || seen
as verbs, like  |--||jump||  |  +============+ eat,
throw, jump, or |  +======+  | punch, or kick. Wrong
Answers: Mary +---------+    | kicked, Jack jumped, 
Charles threw |  Mary   | ---+ and bee made honey.
More random   +---------+ text: Mary ate the apple.
Charles threw the bee. Mary ate the apple. Jack is 
the one who kicked the apple. Not Mary. She ate it.
============================

Credits

Kagi LLM Benchmarking Project is inspired by Wolfram LLM Benchmarking Project and Aider LLM coding leaderboard.

Company

Plans & Payment

Support and Community

Contribute

Privacy & Security

Results

Getting Started

Search Features

LLM Features

Settings

Search

API

Introduction

Kagi LLM Benchmarking Project

Kagi Offline Benchmark

Benchmark Questions

Credits

Search

Introduction

Kagi LLM Benchmarking Project ​

Kagi Offline Benchmark ​

Benchmark Questions ​

Credits ​

Kagi LLM Benchmarking Project

Kagi Offline Benchmark

Benchmark Questions

Credits