I like to see stats like that, but I find it very concerning that OpenRouter don't mind inspecting its user/customer data without shame.
Even if you pretend that the classifier respect anonymity, if I pay for the inference, I would expect that it would be a closed tube with my privacy respected.
If at least it was for "safety" checks, I don't like that but I would almost understand, now it is for them to have "marketing data".
Imagine, and regarding the state of the world it might come soon, that you have whatsapp or telegram that inspect all the messages that you send to give reports like:
- 20% of our users speak about their health issues
- 30% of messages are about annoying coworkers
- 15% are messages comparing dick sizes
stingraycharles 51 minutes ago [-]
They explicitly give you a discount if you opt in to allowing your data to be used for (anonymized) analytics. That’s pretty fair imho.
heliumtera 16 minutes ago [-]
>I would expect that it would be a closed tube with my privacy respected
Lol hahaha
lukev 3 hours ago [-]
Super interesting data.
I do question this finding:
> the small model category as a whole is seeing its share of usage decline.
It's important to remember that this data is from OpenRouter... a API service. Small models are exactly those that can be self-hosted.
It could be the case that total small model usage has actually grown, but people are self-hosting rather than using an API. OpenRouter would not be in a position to determine this.
maikakz 3 hours ago [-]
Thank you & totally agree! The findings are purely observational through OpenRouter’s lens, so they naturally reflect usage on the platform, not the entire ecosystem.
majdalsado 2 hours ago [-]
Very interesting how Singapore ranks 2nd in terms of token volume. I wonder if this is potentially Chinese usage via VPN, or if Singaporean consumers and firms are dominating in AI adoption.
Also interesting how the 'roleplaying' category is so dominant, makes me wonder if Google's classifier sees a system prompt with "Act as a X" and classifies that as roleplay vs the specific industry the roleplay was intended to serve.
trebligdivad 1 hours ago [-]
The 'Glass slipper' idea makes sense to me; people have a bunch of different ideas to try on AIs, and try it as new models come out, and once a model does it well they stick with it for a while.
m0rde 2 hours ago [-]
> The noticeable spike [~20 percentage points] in May in the figure above [tool invocations] was largely attributable to one sizable account whose activity briefly lifted overall volumes.
The fact that one account can have such a noticeable effect on token usage is kind of insane. And also raises the question of how much token usage is coming from just one or five or ten sizeable accounts.
syspec 3 hours ago [-]
According to the report, 52% of all open-source AI is used for *roleplaying*.
They attribute it to fewer content filters and higher creativity.
I'm pretty surprised by that, but I guess that also selects for people who would use openrouter
raincole 3 hours ago [-]
If you rely on AI to write most of your code (instead of using it like Stackoverflow), Claude Code/OpenAI Codex subscription are cheaper than buying tokens. So those users are not on openrouter.
djfergus 3 hours ago [-]
I'm curious what percentage of claude/codex users this is true for - I assumed their business models rely on this not being true for the majority.
bakugo 2 hours ago [-]
Both Claude Code and Codex steer you towards the monthly subscription. Last time I tried Codex, I remember several aspects of it being straight up broken if used with an API key instead of a subscription account.
The business model is likely built upon the assumption that most people aren't going to max out their limits every day, because if they were, it likely wouldn't be profitable.
IMTDb 2 hours ago [-]
Or maybe it’s just strange classification. I see a lot of prompts on the internet looking like “act as a senior xxx expert with over 15 years of industry experience and answer the following: [insert simple question]”
I hope those are not classified as “roleplaying” the “roleplay” here is just a trick to get better answer from the model, often in a professional setting that has nothing to do with creative writing of NSFW stuff
3 hours ago [-]
ceroxylon 1 hours ago [-]
That also stuck out for me, I was wondering if it was video games using openrouter for uptime / inference switching, video games would use a lot of tokens generating dialogue for a few programmer's villages.
djfergus 3 hours ago [-]
Openrouter has an apps tab. If you look at the free, non-coding models, some apps that feature are: janitor.ai, sillytavern, chub.ai. I'd never heard of them but people seem to be burning millions of tokens enjoying them.
bakugo 2 hours ago [-]
> I guess that also selects for people who would use openrouter
It definitely does. OpenRouter is pretty popular among roleplayers and creative writers due to having a wide variety of models available, sometimes providing free access to quality models such as DeepSeek, and lacking any sort of rules against generating "adult" content.
IgorPartola 1 hours ago [-]
Here is the thing: they made good enough open weight models available and affordable, then found that people used them more than before. I am not trying to diminish the value here but I don’t think this is the headline.
paulirish 1 hours ago [-]
I worry that OpenRouter's Apps leaderboard incentivizes tools (e.g. Cline/Kilo) to burn through tokens to climb the ranks, meanwhile penalizing being context-efficient.
Overall really interesting read, but I'm having trouble processing this:
> OpenRouter performs internal categorization on a random sample comprising approximately 0.25% of all prompts
How can you arrive at any conclusion with such a small random sample size?
hoppoli 1 hours ago [-]
Statistical significance comes mostly from N (number of samples) and the variance on the dimension you're trying to measure[1]. If the variance is high, you'll need higher N. If the variance is low, you'll need a lower N. The percentage of the population is not relevant (N = 1000 might be significant and it doesn't matter if it's 1% or 30% of the population)
[^1] This is a simplification. I should say that it depends on the standard error of your statistic, i.e, the thing you're trying to measure (If you're estimating the max of a population, that's going to require more samples than if you're estimating the mean). This standard error, in turn, will depend on the standard deviation of the dimension you're measuring. For example, if you're estimating the mean height, the relevant quantity is the standard deviation of height in the population.
For example, even 300 really random people is enough to correctly assertain the distribution of population for some measurement (say, some personality feauture).
That’s the basis of all polls and what have you
gerdesj 1 hours ago [-]
I think you might be thrashing around 30 samples for a normal distribution and the Central Limit Theorem and accidentally added a zero!
(OK, on rereading, you did link to a WP article about CLT, so 30 it is!)
piskov 1 hours ago [-]
You’re absolutely right! (c)
300 — I had in memory as a safe bet in a case of some skewed stuff like log-normal, exponential, etc.
abdullahkhalids 1 hours ago [-]
Because the accuracy of an estimated quantity mostly depends on the size of the sample, not on the size of the population [1]. This does require assumptions like somewhat homogenous population and normal distributions etc. However, these assumptions often hold.
The open weight model data is very interesting. I missed the release of Minimax M2. The benchmarks seem insanely impressive for its size. I would suspect benchmaxing but why would people be using it if it wasn’t useful?
themanmaran 4 hours ago [-]
> The metric reflects the proportion of all tokens served by reasoning models, not the share of "reasoning tokens" within model outputs.
I'd be interested in a clarification on the reasoning vs non-reasoning metric.
Does this mean the reasoning total is (input + reasoning + output) tokens? Or is it just (input + output).
Obviously the reasoning tokens would add a ton to the overall count. So it would be interesting to see it on an apples to apples comparison with non reasoning models.
ribosometronome 3 hours ago [-]
As would models that that are overly verbose. My experience is the Claude tends to do more than is asked for (e.g. immediately move on to creating tests and documentation) while other models like Gemini tend to be more concise in what they do.
reeeli 4 hours ago [-]
I'm out of time but "reasoning input tokens" from fortune 5000 engineers sounds like a lobotomized LSD dream, would you care on elaborating how you distinguish between reasoning and non-reasoning? vs "question on duty"?
themanmaran 4 hours ago [-]
"reasoning" models like GPT 5 et al do a pre-generation step where they:
- Take in the user query (input tokens)
- Break that into a game plan. Ex: "Based on user query: {query} generate a plan of action." (reasoning tokens)
- Answer (output tokens)
Because the reasoning step runs in a loop until it's run through it's action plan, it frequently uses way more tokens than the input/output step.
reeeli 2 hours ago [-]
that was useful, thank you.
I have sooo many issues with the naming scheme of this """""AI"""" industry", it's crazy!
So the LLM gets a prompt, then creates a scheme to pull pre-weighted tokens post-user-phrasing, the constituents of which (the scheme) are called reasoning tokens, which it only explicitly distinguishes as such because there are hundreds or even thousands of output tokens to the hundreds and/or thousands of potential reasoning input tokens that were (almost) equal to the actually chosen reasoning input tokens based on the more or less adequately phrased question/prompt given ... as input ... by the user ...
IgorPartola 1 hours ago [-]
You can call them planning if you want or pre-planning. But I would encourage you to play with the API version of your model of choice to see exactly what this looks like. It’s kind of like a human’s internal monologue: “got an email from my boss asking to write unit tests for the analytics API. First I have to look at the implementation to know how exactly it actually functions, then write out what kinds of tests make sense, then implement the tests. I should write a TODO list of these steps.”
It is essentially a way to expand the prompt further. You can achieve the same exact thing by turning off the “thinking” feature and just being more detailed and step by step in your prompt but this is faster.
My guess is that the next evolution of this will be models that do an edit or review step after to catch if any of the constraints were broken. But best I can tell a reasoning model can be approximated by doing two passes of a non-reasoning model: first pass you give it the user prompt with instructions that boil down to “make sense of this prompt and formulate a plan” and the second pass you give it the original prompt, the plan, and an explanation that the plan is to implement the original prompt using the plan.
typs 4 hours ago [-]
I believe they’re just classifying all models into “reasoning models” eg o3 vs “non reasoning models” eg 4o and just doing a comparison of total tokens (input tokens + hidden reasoning output tokens + shown output tokens)
maikakz 4 hours ago [-]
that's exactly right!
DIAexitNode 2 hours ago [-]
hell yeah, 109 out of 10 doors opened! 99 bonus doors! what are you talking about, man?
asadm 3 hours ago [-]
Who is using grok code and why?
btbuildem 2 hours ago [-]
It was (is?) free with eg. opencode -- so, open-source coding agent + free sota model, it's hard to resist. That said, grok fast is fast, but not that great when compared to the other top tier models.
djfergus 3 hours ago [-]
It's a 1.7 trillion token free model. Why wouldn't you try it?
I've been testing free models for coding hobby projects after I burnt through way too many expensive tokens on Replit and Claude. Grok wasn't great, kept getting into loops for me. I had better results using KAT coder on opencode (also free).
verdverm 2 hours ago [-]
> Why wouldn't you try it?
Because the people behind it and myself having at least some standards
joshuamcginnis 2 hours ago [-]
According to https://openrouter.ai/rankings, lots of people are using it - presumably because it performs well and provides value.
bakugo 2 hours ago [-]
Kilo Code lets people use Grok Code Fast 1 for free, using OpenRouter as the provider. And Grok 4.1 Fast was completely free directly on OpenRouter for some time after its release.
So yeah, their statistics are inflated quite a bit, since most of that usage was not paid for, or at least not by the end user.
skywhopper 2 hours ago [-]
This is interesting, but I found it moderately disturbing that they spend a LOT of effort up front talking about how they don’t have any access to the prompts or responses. And then they reveal that they did actually have access to the text and they spend 80% of the rest of the paper analyzing the content.
charcircuit 1 hours ago [-]
>And then they reveal that they did actually have access to the text
I'm not seeing that. All I'm seeing is them analyzing metadata.
nextworddev 2 hours ago [-]
*State of non-enterprise, indie AI
All this data confirms that OpenRouter’s enterprise ambitions will fail. It’s a nice product for running Chinese models tho
IgorPartola 1 hours ago [-]
They have SOTA models from OpenAI and Anthropic and Google and you can access them at a 5.5% premium. What you get is the ability to seamlessly switch between them. And also when one is down you can instantly switch to another. Whether that is valuable to you or not is use case dependent. But it isn’t without value.
What it does have I think is a problem that TaskRabbit had: you can hire a house cleaner through TR but once you find a good one you can just work directly with them and save the middleman fee. So OR is great for experimenting with a ton of models to see what is the cheapest one that still performs the tasks you need but then you no longer need OR unless it is for reliability.
nextworddev 34 minutes ago [-]
Use LiteLLM for model routing
typs 4 hours ago [-]
This is really amazing data. Super interesting read
Even if you pretend that the classifier respect anonymity, if I pay for the inference, I would expect that it would be a closed tube with my privacy respected. If at least it was for "safety" checks, I don't like that but I would almost understand, now it is for them to have "marketing data".
Imagine, and regarding the state of the world it might come soon, that you have whatsapp or telegram that inspect all the messages that you send to give reports like:
- 20% of our users speak about their health issues
- 30% of messages are about annoying coworkers
- 15% are messages comparing dick sizes
Lol hahaha
I do question this finding:
> the small model category as a whole is seeing its share of usage decline.
It's important to remember that this data is from OpenRouter... a API service. Small models are exactly those that can be self-hosted.
It could be the case that total small model usage has actually grown, but people are self-hosting rather than using an API. OpenRouter would not be in a position to determine this.
Also interesting how the 'roleplaying' category is so dominant, makes me wonder if Google's classifier sees a system prompt with "Act as a X" and classifies that as roleplay vs the specific industry the roleplay was intended to serve.
The fact that one account can have such a noticeable effect on token usage is kind of insane. And also raises the question of how much token usage is coming from just one or five or ten sizeable accounts.
I'm pretty surprised by that, but I guess that also selects for people who would use openrouter
The business model is likely built upon the assumption that most people aren't going to max out their limits every day, because if they were, it likely wouldn't be profitable.
I hope those are not classified as “roleplaying” the “roleplay” here is just a trick to get better answer from the model, often in a professional setting that has nothing to do with creative writing of NSFW stuff
It definitely does. OpenRouter is pretty popular among roleplayers and creative writers due to having a wide variety of models available, sometimes providing free access to quality models such as DeepSeek, and lacking any sort of rules against generating "adult" content.
https://openrouter.ai/rankings#apps
> OpenRouter performs internal categorization on a random sample comprising approximately 0.25% of all prompts
How can you arrive at any conclusion with such a small random sample size?
[^1] This is a simplification. I should say that it depends on the standard error of your statistic, i.e, the thing you're trying to measure (If you're estimating the max of a population, that's going to require more samples than if you're estimating the mean). This standard error, in turn, will depend on the standard deviation of the dimension you're measuring. For example, if you're estimating the mean height, the relevant quantity is the standard deviation of height in the population.
For example, even 300 really random people is enough to correctly assertain the distribution of population for some measurement (say, some personality feauture).
That’s the basis of all polls and what have you
(OK, on rereading, you did link to a WP article about CLT, so 30 it is!)
300 — I had in memory as a safe bet in a case of some skewed stuff like log-normal, exponential, etc.
[1] https://stats.stackexchange.com/questions/166/how-do-you-dec...
I'd be interested in a clarification on the reasoning vs non-reasoning metric.
Does this mean the reasoning total is (input + reasoning + output) tokens? Or is it just (input + output).
Obviously the reasoning tokens would add a ton to the overall count. So it would be interesting to see it on an apples to apples comparison with non reasoning models.
- Take in the user query (input tokens)
- Break that into a game plan. Ex: "Based on user query: {query} generate a plan of action." (reasoning tokens)
- Answer (output tokens)
Because the reasoning step runs in a loop until it's run through it's action plan, it frequently uses way more tokens than the input/output step.
I have sooo many issues with the naming scheme of this """""AI"""" industry", it's crazy!
So the LLM gets a prompt, then creates a scheme to pull pre-weighted tokens post-user-phrasing, the constituents of which (the scheme) are called reasoning tokens, which it only explicitly distinguishes as such because there are hundreds or even thousands of output tokens to the hundreds and/or thousands of potential reasoning input tokens that were (almost) equal to the actually chosen reasoning input tokens based on the more or less adequately phrased question/prompt given ... as input ... by the user ...
It is essentially a way to expand the prompt further. You can achieve the same exact thing by turning off the “thinking” feature and just being more detailed and step by step in your prompt but this is faster.
My guess is that the next evolution of this will be models that do an edit or review step after to catch if any of the constraints were broken. But best I can tell a reasoning model can be approximated by doing two passes of a non-reasoning model: first pass you give it the user prompt with instructions that boil down to “make sense of this prompt and formulate a plan” and the second pass you give it the original prompt, the plan, and an explanation that the plan is to implement the original prompt using the plan.
I've been testing free models for coding hobby projects after I burnt through way too many expensive tokens on Replit and Claude. Grok wasn't great, kept getting into loops for me. I had better results using KAT coder on opencode (also free).
Because the people behind it and myself having at least some standards
So yeah, their statistics are inflated quite a bit, since most of that usage was not paid for, or at least not by the end user.
I'm not seeing that. All I'm seeing is them analyzing metadata.
All this data confirms that OpenRouter’s enterprise ambitions will fail. It’s a nice product for running Chinese models tho
What it does have I think is a problem that TaskRabbit had: you can hire a house cleaner through TR but once you find a good one you can just work directly with them and save the middleman fee. So OR is great for experimenting with a ton of models to see what is the cheapest one that still performs the tasks you need but then you no longer need OR unless it is for reliability.