tldr
Tokens leak from four places: input bloat, output bloat, thinking budget, and stale context. The transcript also compounds: every turn re-sends prior context, so long threads get expensive fast. The biggest structural add-ons I have used are Graphify (repo knowledge graph, fewer blind greps), claude-mem (compressed cross-session memory), and Caveman (prose compression), on top of a strict CLAUDE.md, capped thinking tokens, early compaction, and RTK. Your mileage on any percent or “x” figure will vary. None of it helps if the codebase is a mess.
If you use Claude Code daily, you already know the problem. Sessions burn through tokens faster than you’d expect, long refactors get cut off mid-thought, and regardless of what plan you’re on, the only thing under your control is how much context you generate and consume per task.
This is a working guide. Everything below is something you can apply today, ordered by effort-to-impact. No meme stack, no hype. Just the levers that actually move the meter.
The mental model: four places tokens leak
You need a map of where the budget goes. In a typical Claude Code session, tokens fall into four buckets.
Input bloat. Command output, file contents, grep hits, test logs. Raw text pouring into the context window every time you run something. This is usually the single biggest line item and the least visible.
Output bloat. Claude’s own prose. Preambles, recaps, “here’s what I did,” full-file echoes after a one-line edit, offering three alternatives when you asked for one.
Thinking bloat. Extended thinking tokens. By default, Claude Code allocates close to 32K thinking tokens per request. That default exists for backward compatibility with older Opus limits, not because your task needs it.
Context rot. Long sessions where old, now-irrelevant context sits in the window consuming budget every turn until auto-compact finally fires at 95% full. By that point, compaction itself is expensive.
Every technique below targets one of these four. When you’re choosing what to adopt, ask which bucket it drains.
Why long chats are expensive
Claude Code bills tokens, not a flat per-message count. Each new request usually includes a large slice of the conversation so far (plus system context and tool traces) on the input side. The later you are in a thread, the more you pay to stay in the same context.
If each turn adds a roughly similar amount of new material, the input cost of the session compounds with the number of turns. By the time you are dozens of messages in, a new reply can be many times more input-heavy than an early turn, depending on message length, compaction, and what the client keeps. Shorter, cleaner threads cost less than long, meandering ones.
Three conversation habits (free)
Edit your last user message instead of stacking tiny follow-ups: one instruction beats ten low-signal turns. After about 15-20 back-and-forths, or sooner if the task pivots, start a new chat. That lines up with running /compact before the window is full (Tier 1); sometimes a new session beats rescuing a bloated log. When you can, batch several questions into a single user message so the model does one pass on one shared context.
Tier 1: Free wins, do these first
Measure before you optimize
Install ccusage before anything else. It reads the JSONL files Claude Code already writes locally and reports daily, session, and 5-hour-block spend. There’s a statusline integration so you see live consumption in your prompt.
npx ccusage@latest daily
npx ccusage@latest blocks --live
Without this you’re guessing. With it, every change below has a number attached.
Write a CLAUDE.md that refuses waste
Claude Code auto-loads a CLAUDE.md from your repo root (or ~/.claude/CLAUDE.md globally). This is the single most effective change you can make and it takes five minutes. The rules that matter:
- No preamble, no flattery, no “Great question.”
- No summaries of what you just did unless asked.
- Use Edit/patch. Never re-emit a whole file after a small change.
- Don’t echo file contents back after modifying them.
- Don’t offer alternatives I didn’t ask for. Pick one.
- For small tasks, skip the plan and act. For larger ones, 3-5 bullets then stop.
There’s a good token-efficient CLAUDE.md template you can drop into your repo as a starting point. Expect a 15-25% drop in output tokens from this alone.
Cap the thinking budget
Add one line to your ~/.zshrc (or ~/.bashrc if you’re on bash):
export MAX_THINKING_TOKENS=8000
This persists across every terminal session. Eight thousand is enough for most day-to-day work. Reserve /effort high for the hard problems: architecture decisions, subtle bugs, cross-file refactors. This is the cheapest trick on the list and the one most people don’t know exists.
Run /compact at 60%, not 95%
Auto-compact fires when your context is nearly full. By then, half the window is stale and compaction is working against a bloated transcript. Run it manually around 60% and tell it what to keep:
/compact keep: current task, file paths in play, last failing test output
You have to do it; nothing automates it for you. It is the single biggest lift to session quality, not just line-item cost. Long sessions stop sliding into a confused mess.
Tier 2: Install once, save forever
RTK: compress command output before it hits context
rtk-ai/rtk is a small Rust binary that sits between your shell and Claude Code. When the agent runs pytest, npm test, cargo build, grep, find, and 100+ other common commands, RTK intercepts the output, strips noise, and compresses it before it enters the context window. Reported reductions land in the 60-90% range on filtered commands.
curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/master/install.sh | bash
rtk init -g
Zero dependencies, single binary, under 10ms overhead. It’s hook-transparent, so you don’t change how you invoke anything. Of everything in this guide, RTK is the one I’d call underappreciated.
Graphify: knowledge graph of the repo
Graphify (safishamsi/graphify) builds a knowledge graph of a folder (code, docs, and more) so the assistant orients by structure before spraying Grep/Glob across the tree. The PyPI package is graphifyy (double “y”); the CLI is still graphify. If you see other spellings in random graphics, this is the canonical project.
Requires Python 3.10+.
pip install graphifyy
graphify install
In the project, run /graphify . from Claude Code to build outputs under graphify-out/. Then wire the always-on path:
graphify claude install
That updates CLAUDE.md so Claude reads graphify-out/GRAPH_REPORT.md for architecture-style questions, and adds a PreToolUse hook in Claude Code’s settings that reminds the model to consult the report before heavy file search when a graph exists. Rebuild the graph when architecture shifts.
On some tasks the drop in wasted search and narration is large, because the model stops relearning the same edges on every request. The multiplier depends on repo size, how stale the graph is, and your workflow. Graphify’s site and README are the source of truth for versioned install notes.
For Cursor in the same repo, upstream documents graphify cursor install if you want the parallel integration there.
claude-mem: cross-session memory (with a tradeoff)
thedotmack/claude-mem is a Claude Code plugin that captures what happens in sessions, compresses observations with an LLM, and injects relevant context into later sessions so you re-explain the project less. That can shrink output and repeated narrative in a fresh chat. The worker runs locally (default HTTP on port 37777), but compression uses model calls unless you point it at a cheap or free provider. Read the project docs before you treat it as free on the API line item.
Do not rely on npm install -g claude-mem alone. That path is the SDK, not the full plugin and worker.
From a normal shell:
npx claude-mem install
Or inside a Claude Code session:
/plugin marketplace add thedotmack/claude-mem
/plugin install claude-mem
Restart Claude Code afterward. Hit http://127.0.0.1:37777/api/health (see upstream for the current path) once the worker is up. Any percentage you have seen in a post was tied to a specific stack: vector search, SQLite, and your choice of compression model all move the number.
Route subagent work to Haiku
Opus and Sonnet are expensive because they think. Subagents usually don’t need to. Grep, file reads, doc lookups, pattern-matching: that’s Haiku territory.
The simplest way to do this globally is one environment variable in your ~/.zshrc:
export CLAUDE_CODE_SUBAGENT_MODEL=haiku
This routes all subagent calls to Haiku by default. If you have custom subagents that need more horsepower, you can override per-subagent by setting model: sonnet or model: opus in that subagent’s YAML frontmatter.
The price delta between Haiku and Opus is roughly an order of magnitude. Once you’ve identified which parts of your workflow are routine, the savings compound fast.
You will see big “x” and percent claims. Treat them as anecdotes until you measure. A graph that saves an order of magnitude of search on one repo might barely matter on a smaller one. claude-mem and Caveman on prose: same kind of spread. Fresh graphs, a cheap compression backend, and a tight CLAUDE.md all change the result. ccusage is how you know what you got.
Tier 3: Situational, still useful
Caveman mode, unironically
JuliusBrussee/caveman is a Claude Code skill that forces the model into caveman-speak for prose output. Code, paths, commands, and technical tokens pass through untouched; only narrative English gets compressed. Output-token reduction is reported around 65-75%.
It reads as a joke. It also works. There’s a March 2026 paper (“Brevity Constraints Reverse Performance Hierarchies in Language Models”) showing that forcing brief outputs actually improved benchmark accuracy. Useful on prose-heavy tasks like code review comments, documentation sweeps, and exploratory conversations.
npx skills add JuliusBrussee/caveman
Prefer Edit over Write; prefer partial reads over whole files
When you touch a file, Edit beats Write for a one-line change. On a 1,200-line file, read a range, not the whole file. “Just to be safe” reads are how the window fills with text the model will never use, even if your CLAUDE.md already says the right thing.
A recommended rollout order
If you’re starting from zero, do it in this order. Each step’s savings are measurable against the previous baseline thanks to ccusage.
- ccusage, for baseline measurement
- CLAUDE.md rules, for output-side discipline
MAX_THINKING_TOKENS=8000, for thinking-side discipline/compactat 60% and the three conversation habits in Why long chats are expensive, for transcript hygiene- RTK, for command-output compression
- Graphify, when the codebase is large or unfamiliar (build the graph, then
graphify claude install) - claude-mem, when you want cross-session memory and accept the compression/worker tradeoffs
- Haiku routing for subagents, for model-tier discipline
- Caveman mode, for prose-heavy output
Do steps 1-4 in a single afternoon. You’ll see the shift immediately. Graphify, claude-mem, and the rest are more moving parts; add them when the baseline is already measured and the pain is real.
None of this helps if the codebase is a mess
Tools and environment variables can only compress what the model already has to read. If your codebase is poorly structured, with unclear responsibilities, hidden coupling, and 800-line files that do five things, then the model has to read more, guess more, and produce worse output. You pay for that in tokens every single session.
I wrote about this recently in AI Is Not the Bottleneck. The short version: a model working in a clean system reads fewer files, makes fewer mistakes, and generates smaller diffs. SOLID, DRY, clear folder structure, single-responsibility files. The label does not matter. The result does. A well-organized codebase is cheaper to operate with AI, the same way it’s cheaper to operate with humans.
The same applies to how you work. Break tasks into small, specific pieces before handing them to the agent. Give it one clear job, not “build me a feature.” Write a short spec. Pressure-test it before you start. Then execute incrementally, with small pushes to production and fast feedback loops. The agents that burn the most tokens are the ones given vague instructions inside large, tangled systems.
Small PRs help here too. They’re easier for people to review, easier for AI to review, and easier to verify. If a pull request sits for three days because it’s too large, that’s a workflow problem no token-saving trick will fix.
The bigger picture
Smaller context is cleaner. Models stay sharper, sessions stay coherent, and you stop waiting for a wall of text when a three-line diff would do.
The tools above are scaffolding. What actually cuts spend is that you said what you meant the first time.
Quick setup scripts
These scripts add the environment variables to your shell rc, install RTK, and set up global hooks. They’re idempotent, so running them twice won’t duplicate anything.
zsh (macOS default):
curl -fsSL https://aleksandar.xyz/scripts/claude-token-setup-zsh.sh | zsh
bash:
curl -fsSL https://aleksandar.xyz/scripts/claude-token-setup-bash.sh | bash
Open a new terminal after running, or source your rc file.