Context management and cutting your AI bill
If you run AI at any volume, most of your bill is wasted tokens. Here is how context discipline, smart model routing through OpenRouter, and cheap models like GLM bring the cost down without hurting quality.
AI tools charge by the token, which is roughly a few characters of text. Every word you send in and every word you get back is metered. For occasional use this is pennies and not worth a thought. But the moment you run AI at volume, a chatbot handling hundreds of chats, an automation touching every email, the bill grows fast, and most of it is waste. The two levers that bring it down are managing your context and choosing the right model for each job.
Why context is where the money goes
Everything the model reads counts as input tokens: your instructions, the conversation so far, and any documents you paste in. People bloat this without realising. They resend the entire history on every message, paste whole manuals when one page would do, and carry a giant system prompt on every call. You pay for all of it, every single time. Worse, stuffing the window full can make answers worse, not better, because the important detail gets lost in the noise.
Managing context: the practical moves
- Send what is relevant, not everything. Retrieve the few passages that matter instead of pasting the whole document. This is the everyday payoff of the retrieval setup covered in our guide on AI memory systems.
- Summarise long conversations. Instead of resending a hundred messages, keep a short running summary and send that. The model keeps the thread without re reading the whole transcript.
- Trim the system prompt. Say what is needed once, clearly. Long, repetitive instructions cost on every single call.
- Cap the output. Ask for concise answers and set a sensible limit, because you pay for what comes back too.
- Use prompt caching where your tools support it. If the same large block of context is reused across calls, caching lets you avoid paying full price for it every time.
The bigger lever: use the right model for the job
Not every task needs the most powerful, most expensive model. Sorting an email, tagging an enquiry or drafting a routine reply can be done well by a far cheaper one. The expensive frontier models should be saved for the genuinely hard work: nuanced reasoning, important writing, anything where a mistake is costly. Sending every task to the top model is like couriering a postcard. Matching each task to the cheapest model that does it well is called routing, and it is where the real savings live.
OpenRouter: one door to every model
OpenRouter is a service that sits in front of dozens of AI models, from OpenAI, Anthropic, Google and the open source world, behind a single connection and a single bill. Instead of wiring up each provider separately, you connect once and choose the model per task in your code. That means you can route cheap, high volume jobs to a budget model and the hard jobs to a premium one, compare prices at a glance, and fall back to another provider if one has an outage. For anyone running AI across several tasks, it takes the friction out of routing.
GLM: a cheap workhorse
GLM is a family of capable models from Zhipu AI, available through OpenRouter among others. The appeal is simple: they are strong enough for a great deal of everyday work and cost a fraction of the frontier models per token. For the high volume, lower stakes jobs, classification, first draft writing, extraction, summarising, a model like GLM can do the work for a small slice of the price. You point the heavy traffic at the cheap workhorse and keep the premium model for the moments that truly need it.
Do you need to bother?
If your AI use is light, no. Optimising a bill of a few pounds a month is a waste of your time, and the simplest setup is the right one. This matters when AI runs at volume in your business: a busy chatbot, an automation touching every order, anything making thousands of calls a month. At that scale, context discipline and smart routing are the difference between a tool that pays for itself and one that quietly eats your margin.
The principle underneath all of it is the same one we apply everywhere: spend effort where it pays back. Get the volume work onto cheap models with lean context, and put your money and your best model where it actually matters.