Skip to content
On this page

Kagi LLM Benchmarking Project

Introducing the Kagi LLM Benchmarking Project, which evaluates major large language models (LLMs) on their reasoning, coding, and instruction following capabilities.

LLM Benchmarks

The Kagi LLM Benchmarking Project uses an unpolluted benchmark to assess contemporary large language models (LLMs) through diverse, challenging tasks. Unlike standard benchmarks, our tests frequently change and are mostly novel, providing a rigorous evaluation of the models' capabilities, (hopefully) outside of what models saw in the training data to avoid benchmark overfitting.

Last updated July 29, 2024.

ModelAccuracy (%)TokensTotal Cost ($)Median Latency (s)Speed (tokens/sec)
OpenAI gpt-4o52.0074820.143101.6048.00
Together meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo50.0077670.071362.0046.49
Anthropic claude-3.5-sonnet-2024062046.0065950.120182.5448.90
Mistral large-latest44.0050970.067873.0818.03
Groq llama-3.1-70b-versatile40.0051900.007810.7181.62
Reka reka-core36.0069660.124016.2117.56
OpenAI gpt-4o-mini34.0060290.004511.6436.92
DeepSeek deepseek-chat32.0073100.003044.8117.20
Anthropic claude-3-haiku-2024030728.0056420.008811.3355.46
Groq llama-3.1-8b-instant28.0066280.000852.2682.02
DeepSeek deepseek-coder28.0080790.003274.1316.72
OpenAI gpt-426.0024770.334081.3216.68
Mistral open-mistral-nemo22.0041350.003230.6582.65
Groq gemma2-9b-it22.0048890.002491.6954.39
OpenAI gpt-3.5-turbo22.0015690.015520.5145.03
Reka reka-edge20.0053770.007982.0246.87
Reka reka-flash16.0057380.016683.2828.75
GoogleGenAI gemini-1.5-pro-exp-080114.0049420.263251.8228.19
GoogleGenAI gemini-1.5-flash14.0052870.027773.0221.16

The table includes metrics such as overall mode quality (measured as percent of correct responses), total tokens output (some models are less verbose by default, affecting both cost and speed), total cost to run the test, median response latency and average speed in tokens per second at the time of testing.

This approach measures the models' potential and adaptability, with some bias towards features essential for LLM features in Kagi Search (mostly around reasoning and instruction following capabilties, see examples below).

As models get more advanced and to prevent leaking test to training data, we periodically update the benchmarks with harder questions to have reasonable distribution of model scores.

Benchmark details

The benchmark is meant to be hard so we can reasonably evaluate current capabilities of LLMs.

Example questions include:

What is the capital of Finland? If it begins with the letter H, respond 'Oslo' otherwise respond 'Helsinki'.
What square is the black king on in this chess position: 1Bb3BN/R2Pk2r/1Q5B/4q2R/2bN4/4Q1BK/1p6/1bq1R1rb w - - 0 1
Given a QWERTY keyboard layout, if HEART goes to JRSTY, what does HIGB go to?
section .data
    a dd 0
    b dd 0

section .text
    global _start

_start:
    mov eax, [a]
    add eax, [b]
    mov [a], eax
    mov eax, [a]
    sub eax, [b]
    mov [b], eax
    mov eax, [a]
    sub eax, [b]
    mov [a], eax

    mov eax, 60
    xor edi, edi
    syscall

What does this program do, in one sentence?

LLM Pricing comparison

In addition to quality and speed, we are also interested in the cost of using contemporary LLMs.

The table below is updated to the best of our abilities, feel free to submit changes by editing this page.

LLMContext LengthPrice per input ($/M)Price per output ($/M)
GPT-4o128K515
GPT-4o mini128K0.150.60
GPT-4-Turbo128K1030
GPT-4 (8k)8K3060
GPT-4 (32k)32K60120
GPT-3.5-Turbo16K0.51.5
Claude 3 Haiku200K0.251.25
Claude 3.5 Sonnet200K315
Claude 3 Opus200K1575
Gemini 1.5 Pro (128K/1M)1M3.50/710.50/21
Gemini 1.5 Flash (128K/1M)1M0.075/0.150.3/0.6
Mistral Small8K26
Mistral Medium8K2.78.1
Mistral Large8K824
Reka Core128K1025
Reka Flash128K0.82
Reka Edge128K0.41
Cohere Command R+128K315
Cohere Command R128K0.501.50
Groq Llama 3 70B8K0.590.79
Groq Llama 3 8B8K0.050.10
Groq Mixtral 8x7B32K0.270.27
Groq Gemma 7B8K0.100.10

Kagi Assistant provides access to all the models in bold. Usage is included in your Kagi subscription.

Credits

Kagi LLM Benchmarking Project is inspired by Wolfram LLM Benchmarking Project and Aider LLM coding leaderboard.