Back to Blog
Cost OptimizationLLM

Two and a Half Methods to Cut LLM Token Costs

Explore some lesser-known techniques to optimize your LLM token usage and reduce costs.

January 22, 2025
8 min read
By Parmot Team

Only a few weeks ago, I checked in on the bill for a client's in-house LLM-based document parsing pipeline. They use it to automate a bit of drudgery with billing documentation. It turns out, "just throw everything at the model" is not always a sensible path forwards. Every prompt instruction, table row, every duplicate OCR pass, prompt expansion, etc... It adds up.

By the end of last month, the token spend graph looked like a meme stock pump.

Please learn from our mistakes. Here, we're sharing a few interesting (well... at least we found them interesting) ways to cut LLM token spend.


Prompt Caching for Long , Repeated Prefixes

Many calls repeat the same front-matter (system prompt, tool schemas, policies, style guide, etc.).

  • Providers now cache those repeated prefix tokens so you don't pay full price to reprocess them each time.
  • On Azure OpenAI, cached input tokens are billed at a discount for Standard deployments and can be free (100% discount) on Provisioned.
  • Caches usually persist for minutes and require ≥1,024 identical leading tokens, with additional hits every 128 tokens.

You often can't shrink essential instructions, but you can avoid paying full freight for them over and over.

  • Hoist all stable content to the very start of the messages array (system/developer/tool schemas) to maximize prefix identity and cache hits.
  • Keep those sections deterministic (no timestamps/UUIDs) and long enough to cross the 1,024-token threshold.

Tokenizer-Aware Model & Phrasing (Choose Encodings That Yield Fewer Tokens)

Different models tokenize the same text into different token counts.

  • Newer OpenAI "o-series" models use o200k encodings, which tend to be more efficient than older cl100k.
  • Fewer tokens for the same prompt = lower bill.
  • The OpenAI cookbook emphasizes that pricing is by token, so counting and minimizing tokens directly reduces cost.

If model A splits your prompt into 20% fewer tokens than model B (for comparable quality), that's an immediate 20% input-token savings - before any prompt engineering.

  • Measure with a local tokenizer (e.g., tiktoken) before you ship; pick the model whose tokenizer yields fewer tokens for your domain text.
  • Rewrite high-token phrases (numbers, dates, boilerplate) into forms that break into fewer tokens without changing meaning.
  • Always re-count after rewriting, since tokenization quirks vary by model.

LLMLingua-Style Prompt Compression (and its newer variants)

LLMLingua learns which tokens actually matter and deletes the rest before the call. I'm calling this a "half method" because it actually deteriorates the input context, potentially sabotaging efforts to extract exact details.

  • The original paper reports up to ~20× prompt compression with little accuracy loss.
  • LLMLingua-2 distills an LLM to do faster, task-agnostic, extractive compression (typically 2 - 5× compression with 1.6 - 2.9× end-to-end speedups).
  • LongLLMLingua adapts the idea for long contexts and even improves RAG quality while using fewer tokens.

Fewer input tokens → fewer billed tokens, and often shorter outputs because the model sees less fluff.

  • Use the open-source repo (microsoft/LLMLingua) as a pre-processor in your pipeline.
  • Start conservative (e.g., 1.5 - 2× compression) on critical tasks; push higher on retrieval snippets and exemplars.
  • For RAG, compress retrieved chunks per-query (LLMLingua-2/LongLLMLingua) before assembling the final context window.