If you use Claude Code daily, you already know the problem. Sessions burn through tokens faster than you’d expect, long refactors get cut off mid-thought, and regardless of what plan you’re on, the only thing under your control is how much context you generate and consume per task.
This is a working guide. Everything below is something you can apply today, ordered by effort-to-impact. No meme stack, no hype. Just the levers that actually move the meter.
The mental model: four places tokens leak
Before reaching for tools, it helps to know where your budget actually goes. In a typical Claude Code session, tokens disappear into four buckets.
Input bloat. Command output, file contents, grep hits, test logs. Raw text pouring into the context window every time you run something. This is usually the single biggest line item and the least visible.
Output bloat. Claude’s own prose. Preambles, recaps, “here’s what I did,” full-file echoes after a one-line edit, offering three alternatives when you asked for one.
Thinking bloat. Extended thinking tokens. By default, Claude Code allocates close to 32K thinking tokens per request. That default exists for backward compatibility with older Opus limits, not because your task needs it.
Context rot. Long sessions where old, now-irrelevant context sits in the window consuming budget every turn until auto-compact finally fires at 95% full. By that point, compaction itself is expensive.
Every technique below targets one of these four. When you’re choosing what to adopt, ask which bucket it drains.
Tier 1: Free wins, do these first
Measure before you optimize
Install ccusage before anything else. It reads the JSONL files Claude Code already writes locally and reports daily, session, and 5-hour-block spend. There’s a statusline integration so you see live consumption in your prompt.
npx ccusage@latest daily
npx ccusage@latest blocks --live
Without this you’re guessing. With it, every change below has a number attached.
Write a CLAUDE.md that refuses waste
Claude Code auto-loads a CLAUDE.md from your repo root (or ~/.claude/CLAUDE.md globally). This is the single most effective change you can make and it takes five minutes. The rules that matter:
- No preamble, no flattery, no “Great question.”
- No summaries of what you just did unless asked.
- Use Edit/patch. Never re-emit a whole file after a small change.
- Don’t echo file contents back after modifying them.
- Don’t offer alternatives I didn’t ask for. Pick one.
- For small tasks, skip the plan and act. For larger ones, 3-5 bullets then stop.
There’s a good token-efficient CLAUDE.md template you can drop into your repo as a starting point. Expect a 15-25% drop in output tokens from this alone.
Cap the thinking budget
Add one line to your ~/.zshrc (or ~/.bashrc if you’re on bash):
export MAX_THINKING_TOKENS=8000
This persists across every terminal session. Eight thousand is enough for most day-to-day work. Reserve /effort high for the hard problems: architecture decisions, subtle bugs, cross-file refactors. This is the cheapest trick on the list and the one most people don’t know exists.
Run /compact at 60%, not 95%
Auto-compact fires when your context is nearly full. By then, half the window is stale and compaction is working against a bloated transcript. Run it manually around 60% and tell it what to keep:
/compact keep: current task, file paths in play, last failing test output
This is a habit, not a tool. It’s also the single biggest improvement to session quality, not just cost. Long sessions stop degrading into confused soup.
Tier 2: Install once, save forever
RTK: compress command output before it hits context
rtk-ai/rtk is a small Rust binary that sits between your shell and Claude Code. When the agent runs pytest, npm test, cargo build, grep, find, and 100+ other common commands, RTK intercepts the output, strips noise, and compresses it before it enters the context window. Reported reductions land in the 60-90% range on filtered commands.
curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/master/install.sh | bash
rtk init -g
Zero dependencies, single binary, under 10ms overhead. It’s hook-transparent, so you don’t change how you invoke anything. Of everything in this guide, RTK is the one I’d call underappreciated.
Route subagent work to Haiku
Opus and Sonnet are expensive because they think. Subagents usually don’t need to. Grep, file reads, doc lookups, pattern-matching: that’s Haiku territory.
The simplest way to do this globally is one environment variable in your ~/.zshrc:
export CLAUDE_CODE_SUBAGENT_MODEL=haiku
This routes all subagent calls to Haiku by default. If you have custom subagents that need more horsepower, you can override per-subagent by setting model: sonnet or model: opus in that subagent’s YAML frontmatter.
The price delta between Haiku and Opus is roughly an order of magnitude. Once you’ve identified which parts of your workflow are routine, the savings compound fast.
Tier 3: Situational, still useful
Caveman mode, unironically
JuliusBrussee/caveman is a Claude Code skill that forces the model into caveman-speak for prose output. Code, paths, commands, and technical tokens pass through untouched; only narrative English gets compressed. Output-token reduction is reported around 65-75%.
It reads as a joke. It also works. There’s a March 2026 paper (“Brevity Constraints Reverse Performance Hierarchies in Language Models”) showing that forcing brief outputs actually improved benchmark accuracy. Useful on prose-heavy tasks like code review comments, documentation sweeps, and exploratory conversations.
npx skills add JuliusBrussee/caveman
Prefer Edit over Write; prefer partial reads over whole files
Two habits worth training:
- When modifying a file, insist on
Edit(minimal diff) overWrite(full rewrite). Your CLAUDE.md should already enforce this, but it’s worth watching for. - When reading a 1,200-line file, read the range you care about. Reading “just to be safe” is how context fills with material the model will never use.
A recommended rollout order
If you’re starting from zero, do it in this order. Each step’s savings are measurable against the previous baseline thanks to ccusage.
- ccusage, for baseline measurement
- CLAUDE.md rules, for output-side discipline
MAX_THINKING_TOKENS=8000, for thinking-side discipline/compactat 60%, for context hygiene- RTK, for input-side compression
- Haiku routing for subagents, for model-tier discipline
- Caveman mode, for prose compression
Do steps 1-4 in a single afternoon. You’ll see the shift immediately. Steps 5-6 take a little more setup but pay back within a day of normal use.
None of this helps if the codebase is a mess
Tools and environment variables can only compress what the model already has to read. If your codebase is poorly structured, with unclear responsibilities, hidden coupling, and 800-line files that do five things, then the model has to read more, guess more, and produce worse output. You pay for that in tokens every single session.
I wrote about this recently in AI Is Not the Bottleneck. The short version: a model working in a clean system reads fewer files, makes fewer mistakes, and generates smaller diffs. SOLID, DRY, clear folder structure, single-responsibility files. The label does not matter. The result does. A well-organized codebase is cheaper to operate with AI, the same way it’s cheaper to operate with humans.
The same applies to how you work. Break tasks into small, specific pieces before handing them to the agent. Give it one clear job, not “build me a feature.” Write a short spec. Pressure-test it before you start. Then execute incrementally, with small pushes to production and fast feedback loops. The agents that burn the most tokens are the ones given vague instructions inside large, tangled systems.
Small PRs help here too. They’re easier for people to review, easier for AI to review, and easier to verify. If a pull request sits for three days because it’s too large, that’s a workflow problem no token-saving trick will fix.
The bigger picture
Smaller context is cleaner, full stop. Models stay sharper, sessions stay coherent, and you stop waiting for giant responses when a three-line diff would do.
Token discipline is, more than anything, just a way of being more specific about what you’re asking for. The tools above are scaffolding for a habit that makes the agent better regardless of price.
Quick setup scripts
These scripts add the environment variables to your shell rc, install RTK, and set up global hooks. They’re idempotent, so running them twice won’t duplicate anything.
zsh (macOS default):
curl -fsSL https://aleksandar.xyz/scripts/claude-token-setup-zsh.sh | zsh
bash:
curl -fsSL https://aleksandar.xyz/scripts/claude-token-setup-bash.sh | bash
Open a new terminal after running, or source your rc file.