Journal/LLM Cost Optimization

We Analyzed 10M API Tokens: Here's Where Your Engineering Team is Wasting Money

NK
Nilesh Kumar
··7 min read
We Analyzed 10M API Tokens: Here's Where Your Engineering Team is Wasting Money
TL;DR: After analyzing over 10 million LLM API tokens, we discovered that engineering teams waste up to 35% of their budget on three distinct anti-patterns: omitting max_tokens limits, injecting redundant context into un-cached system prompts, and failing to compress JSON outputs.

What Are Token Anti-Patterns?

Token anti-patterns are coding structures and prompt designs that force an LLM to ingest or generate significantly more tokens than necessary to complete a task, directly inflating the financial cost of the API request.

Why It Matters

When developers build AI features on their local machines using $5 of credits, optimization seems unnecessary. But at scale, a bloated prompt sent 100,000 times a day becomes a massive liability. Fixing these anti-patterns can reduce a startup's monthly infrastructure bill by thousands of dollars with minimal engineering effort.

How It Works

Anti-Pattern 1: The Infinite Generation Trap

By default, models will generate tokens until they hit a natural stop sequence or their hard context limit. If you omit the max_tokensparameter, a confused model might output 2,000 words of hallucinated nonsense instead of a simple “Yes” or “No”. Because output tokens are priced 3x to 5x higher than input tokens, unconstrained generation is the fastest way to burn money — the kind of silent leak that hard budget caps exist to stop.

Anti-Pattern 2: The Kitchen Sink System Prompt

We found that 40% of input tokens consisted of massive system prompts containing rules utterly irrelevant to the specific user query. For example, injecting 3,000 tokens of “SQL Schema Guidelines” into a prompt where the user simply asked for a UI button color. Unless you are utilizing strict prompt caching, every one of those tokens is billed on every single request.

Anti-Pattern 3: Pretty-Printed JSON

When asking LLMs to return JSON structured data, many developers ask the model to format it with indentation. Every space and newline character is treated as a token. By explicitly instructing the model to return minified JSON, you can reduce output token consumption by 15–20% per request.

Practical Steps for Token Optimization

  1. Enforce max_tokens: Set a strict upper bound on generation. If you expect a summary, set it to 150. If you expect a boolean, set it to 5.
  2. Dynamically Scope Context: Use embedding searches to inject only the exact 3 or 4 paragraphs necessary to answer the prompt, rather than the entire 50-page manual.
  3. Minify Instructions: Use concise, imperative language. Remove conversational pleasantries from system prompts.

Common Mistakes

A frequent error is chaining LLM calls unnecessarily. Developers build multi-agent workflows where Model A summarizes, passes to Model B for extraction, which passes to Model C for formatting. Often, a single well-crafted prompt to GPT-4o accomplishes all three tasks in one shot, consuming 60% fewer total tokens.

FAQ

What is a token anti-pattern?

A coding or prompting practice that results in an LLM processing or generating more tokens than are actually required to solve the task at hand.

How much does pretty-printed JSON cost compared to minified JSON?

Because whitespace characters and line breaks are counted as tokens, pretty-printed JSON can consume up to 20% more output tokens than a tightly minified JSON string.

Why is the max_tokens parameter so important for budget control?

The max_tokens parameter acts as a hard circuit breaker during generation. It prevents models that get caught in recursive or hallucinated loops from generating thousands of expensive, useless output tokens.

Should I remove punctuation to save tokens?

No. Removing punctuation often confuses the model's attention mechanism, leading to poorer reasoning and higher failure rates. Optimize structure, not syntax.

Conclusion

Token optimization is the new performance tuning. Just as engineers optimize database queries to reduce CPU load, modern AI developers must ruthlessly audit their prompts and API parameters to eliminate token bloat. The first step is visibility: track your spend in real time so you can see which anti-pattern is actually costing you money.

Stop flying blind on AI costs

Frugal tracks every dollar across OpenAI, Anthropic, and more — with budget alerts before costs spiral.

Start free →