I’ve heard of a lot of teams recently starting to use number of tokens as the key metric by which they measure their engineering team.

It’s actually kind of funny that I even feel the need to write this blog post, but I did want to get it on record: I think it’s a bad metric if it’s your primary north star.

Should it be one of many metrics that you use to understand how people on your team are performing? Yes. You definitely want some observability into how your engineers (or non-engineers) are using LLMs. But gamifying it and making it THE key metric is just a recipe for disaster.

As I’m sure some companies have found out by now, there are a number of reasons why this isn’t a good idea:

Tokens scale linearly with cost. While that may not be a problem early on, I guarantee you it will be a huge problem later on when you’re paying out the nose to Anthropic and OpenAI but can’t easily switch the volume off. Tokens tend to be reasonably sticky because it’s not easy to change workflows, especially if you have automations running that require tokens. Often it’s a project to go and identify where all the cost is coming from, categorize whether that cost is worthwhile, and then figure out how to stop it and possibly migrate systems off of LLMs.
It’s a fast-tracked way to create an organization of Slop Cannons. If you are literally incentivizing tokens, then the incentive is for people to spend them as quickly as possible. Even if they’re not outright causing outages, low quality PRs being shipped into production can be slowly insidious over time. You’re incentivizing usage over anything else. More generally, tokens don’t tell you anything about whether the work was good. A 1M token agent run that fixes nothing looks identical on the dashboard to a 1M token agent run that ships a hard refactor. If your North Star metric can’t distinguish those two, your metric is lacking in a key dimension.

But I want people to use AI and to change their behavior!

Great, I do too, but the lesson I keep learning is that you can’t really skip the hard work that is required for behavior change.

I think you should be optimizing for the people who are really excited to use AI and really putting them in charge of moving the organization, and then creating a wave of excitement about what’s possible now.

The handful of people on your team who are already curious will figure things out faster than any incentive program will. Pair them with engineers who haven’t had their “aha” moment yet. Let them ship something visible. Run internal demos. Share war stories about workflows that went from hours to minutes. Behavior change happens through demonstrated value, not through KPIs denominated in tokens.

The other thing worth saying: if your team isn’t using AI at the rate that you want, the problem is almost never that they need a quota. It’s usually that the tooling is rough, the workflows aren’t obvious, or nobody on the team has shown them what good looks like yet. None of those problems get solved by putting a token counter on the wall.

So what should we actually look at?

If you want metrics, look at outputs rather than inputs. Some questions I’m asking our team:

Are we shipping more product per engineer than we were six months ago?
Are we resolving customer issues faster?
Are people taking on projects they wouldn’t have attempted before?
When engineers describe their week, do they sound more energized or more drained?

Tokens are an input, and the metrics that matter are almost always outputs. Optimize an input and you’ll get more of it, but you won’t necessarily get the thing you actually wanted.

Should you watch token usage? Definitely! Use it for cost forecasting, for understanding adoption curves, for spotting people who might benefit from a nudge or some coaching. Just don’t make it the only thing that matters.

Number of tokens shouldn’t be the only metric

But I want people to use AI and to change their behavior!

So what should we actually look at?