The Real Cost of One AI Conversation — and the Math Behind It
Your AI assistant has been live for one month. It handled about 10,000 conversations — shoppers asking about sizing, stock, shipping, the assistant answering every one. A good month. Then the invoice lands: $1,150. You read it twice. That is not a server, not an ad budget — that is the price of a chatbot talking.
Nothing went wrong. No bug, no abuse, no traffic spike. The assistant did exactly its job, and doing its job is what cost a four-figure sum. And here is the line that should keep you up at night: the store across the street ran the identical assistant, fielded the same 10,000 conversations, and paid $130. One-ninth the price. Not a discount, not a smarter contract — the same software, the same shoppers. The entire gap is one setting on a screen neither owner was ever shown.
This is the quiet trick in every AI pricing page. They tell you it is "cheap" — a fraction of a cent per chat — and they are technically right. But "a fraction of a cent" is the most expensive phrase in software. It hides a 10× spread, and nobody does the multiplication for you. One fraction of a cent buys the $130 month. Another buys the $1,150 one. Same three words on the page.
So let's do the multiplication. No code, no jargon you have not met before — just the arithmetic behind a single conversation, traced through all three models that quietly bill you, until you can read any pricing page and know exactly what you are about to pay.

There are three different model bills hiding inside one conversation:
- The language model (LLM) — the part that reads the shopper and writes the replies.
- The embedding model — the part that searches your catalog.
- The eval model — the part that quietly grades the assistant's answers.
We will price each one, then add them up.
First: a "conversation" is not one question
The instinct is to think of a chat as one question and one answer. It almost never is. A real shopping conversation looks like this:
Shopper: Do you have running shoes for flat feet?
Assistant: Yes — a few good options. Are these for road or trail?
Shopper: Road, mostly.
Assistant: Then I'd look at these three… (lists products)
Shopper: Is the second one true to size?
Assistant: It runs about half a size small…
Shopper: Okay, add the size 10 to my cart.
Assistant: Done — anything else?
That is six back-and-forth turns. Each turn is a separate request to the language model. And here is the part that surprises people: every turn re-sends the entire conversation so far. The model has no memory between turns — to answer message #6, it must be handed messages #1 through #5 again, in full.
Hold onto that. It is the single biggest driver of the bill.
The unit you actually pay for: tokens
You do not pay per message or per word. You pay per token. A token is a chunk of text — roughly ¾ of a word in English. "Running shoes for flat feet" is about 6 tokens. A typical sentence is 15–25 tokens. A paragraph is 80–120.
Two rules of thumb:
- ~750 words ≈ 1,000 tokens.
- Model prices are quoted per 1,000,000 tokens (per "1M"). That large unit is why everything looks deceptively cheap — but conversations burn through tokens faster than you would guess.
And tokens come in two kinds, priced differently:
| Token type | What it is | Relative price |
|---|---|---|
| Input | Everything you send to the model: the shopper message, the conversation history, your store instructions, product data | Cheaper |
| Output | Everything the model writes back | 3–5× more expensive |
Output is the expensive one. The assistant writing a paragraph costs several times more than it reading a paragraph.
What gets sent on every single turn
When the assistant answers one message, the request is not just that message. It is a stack:
| Piece | What it is | Typical size |
|---|---|---|
| System instructions | "You are a helpful shopping assistant for [Store]. Be concise. Never invent prices…" plus the list of actions it is allowed to take (search, add to cart, check stock) | ~2,000 tokens |
| Conversation history | Every previous message, shopper and assistant, word for word | grows each turn |
| Retrieved product data | The catalog entries pulled in to answer this question | ~800 tokens |
| The new shopper message | What they just typed | ~50 tokens |
The system instructions and retrieved data are sent fresh every turn. The history grows every turn. The shopper typed 50 tokens — but the model receives close to 3,000.
The hidden multiplier: re-sending the conversation
Let's price the six-turn chat above. Assume each shopper message is ~50 tokens, each assistant reply ~150 tokens, system instructions ~2,000 tokens, and retrieved product data ~800 tokens per turn.
Because every turn re-sends everything before it, the input grows turn over turn:
| Turn | Input sent to model | Output written |
|---|---|---|
| 1 | 2,000 + 800 + 0 history + 50 = 2,850 | 150 |
| 2 | 2,000 + 800 + 200 history + 50 = 3,050 | 150 |
| 3 | 2,000 + 800 + 400 history + 50 = 3,250 | 150 |
| 4 | 2,000 + 800 + 600 history + 50 = 3,450 | 150 |
| 5 | 2,000 + 800 + 800 history + 50 = 3,650 | 150 |
| 6 | 2,000 + 800 + 1,000 history + 50 = 3,850 | 150 |
| Total | 20,100 input tokens | 900 output tokens |
Here is the headline. The shopper typed about 300 tokens of text. The conversation consumed 20,100 input tokens — roughly 67× more. The system instructions alone (2,000 × 6 turns = 12,000 tokens) account for more than half the bill.
This is not waste — it is how the technology works. But it explains why "the messages were so short, why did it cost that much?" has a real answer.
Bill #1: the language model
Now apply price. Model pricing varies enormously, so let's use two vendor-neutral tiers that bracket the real market:
| Tier | Input price /1M | Output price /1M |
|---|---|---|
| Flagship (top-end reasoning model) | $5.00 | $15.00 |
| Mid-tier (fast, capable, cheaper) | $0.50 | $1.50 |
Flagship model:
- Input: 20,100 ÷ 1,000,000 × $5.00 = $0.1005
- Output: 900 ÷ 1,000,000 × $15.00 = $0.0135
- LLM total: ~$0.114 — about 11–12 cents
Mid-tier model:
- Input: 20,100 ÷ 1,000,000 × $0.50 = $0.0101
- Output: 900 ÷ 1,000,000 × $1.50 = $0.0014
- LLM total: ~$0.0115 — about 1 cent
Same conversation. Same shopper. A 10× difference, decided entirely by which model the assistant runs on. That is the choice nobody puts in front of you — and it is the most important one on the page.
Bill #2: the embedding model (catalog search)
When the shopper asks for "running shoes for flat feet," the assistant cannot read your whole catalog every time — that would be far too many tokens. Instead it uses embeddings.
An embedding turns a piece of text into a list of numbers that captures its meaning. Products with similar meaning end up with similar numbers. To search, you embed the shopper's question and find the catalog entries whose numbers are closest. This is what lets "flat feet" surface a shoe described as "stability / motion control" even though those exact words never matched.
Embeddings have two costs:
a) Indexing your catalog — a one-time cost. Every product description gets embedded once, then re-embedded only when it changes. For a 5,000-product store at ~200 tokens per product:
- 5,000 × 200 = 1,000,000 tokens
- Embedding models are cheap — about $0.02 per 1M tokens
- Indexing the entire catalog: ~$0.02. Two cents. Total, not per shopper.
b) Searching during the conversation — a per-conversation cost. Each search embeds only the shopper's short query. Six searches × ~50 tokens = 300 tokens:
- 300 ÷ 1,000,000 × $0.02 = $0.000006
That is six millionths of a dollar. For practical purposes, the embedding cost of a conversation is zero. It matters for catalog indexing, not for the per-chat bill. Good to know — mostly so you are not upsold on it.
Bill #3: the eval model (quality control)
The third model is the one most store owners have never heard of. An eval model is a second, usually smaller, language model whose job is to grade the assistant's answers — automatically checking things like: did it stay on topic, did it invent a price, was it actually helpful?
You do not need this on every conversation. It is a quality-control sample — like a factory checking 1 in 20 units, not all of them. But when it runs, it is another model call, so it has a cost.
Grading one assistant reply means sending the eval model the question, the answer, and a rubric (~600 input tokens) and getting back a short verdict (~100 output tokens). Eval almost always runs on a cheap small model — say $0.15 /1M input, $0.60 /1M output.
If you graded all six turns of our conversation:
- Input: 6 × 600 = 3,600 ÷ 1M × $0.15 = $0.00054
- Output: 6 × 100 = 600 ÷ 1M × $0.60 = $0.00036
- Eval total: ~$0.0009 — under a tenth of a cent
And if you sample 1 conversation in 20 instead of grading every turn, the eval cost effectively disappears. It is a rounding error either way — but it is real, and it is the reason your assistant keeps getting better instead of quietly drifting.
Putting the whole bill together
One six-turn conversation, all three models added up:
| Component | Flagship LLM | Mid-tier LLM |
|---|---|---|
| Language model | $0.1140 | $0.0115 |
| Embedding (search) | $0.000006 | $0.000006 |
| Eval (all turns graded) | $0.0009 | $0.0009 |
| Total per conversation | ~$0.115 | ~$0.013 |
The language model is the bill — 99% of it. Embeddings are free in practice. Eval is a rounding error. So when you compare AI assistants, do not get lost in feature lists about "advanced retrieval" or "evaluation pipelines." Ask which language model answers the shopper, and at what tier. That one answer sets your cost.
To make it concrete — at 10,000 conversations a month:
- Flagship: ~$1,150/month
- Mid-tier: ~$130/month
Same traffic. Same store. The gap is a model choice.
What actually moves the meter
If you want the bill lower without making the assistant worse, these are the five real levers — in order of impact:
- Model tier. The 10× lever. Many stores do not need a flagship model to recommend shoes; a strong mid-tier model handles ordinary shopping questions well. The best setups route — cheap model for simple chats, flagship only for genuinely hard ones.
- Prompt caching. Those 2,000-token system instructions are identical on every turn. Most providers let you cache that fixed block so re-sending it costs a fraction of full price. On a long conversation this alone can cut the input bill by half or more. Ask if your provider uses it.
- System prompt size. A bloated 5,000-token instruction block is sent on every turn of every conversation forever. Tightening it is a permanent discount.
- History trimming. A 30-turn conversation does not need turn 1 verbatim. Summarizing or dropping old turns stops the input from growing without limit.
- Eval sampling. Grade a representative sample, not every message. You get the quality signal at a fraction of the cost.
Notice what is not on the list: embeddings and search. They are cheap enough that optimizing them saves you nothing.
The takeaway for store owners
- A conversation is many model calls, not one — and each call re-sends the whole chat. A handful of short messages can be 20,000+ tokens.
- You pay per token, ~¾ of a word; output costs 3–5× more than input.
- The language model is ~99% of the cost. Embedding search is effectively free; eval is a rounding error.
- The model tier is a 10× lever — roughly 1¢ vs 12¢ for the identical conversation. It is the number that matters, and it is the one rarely shown to you.
- Before you sign up, ask three questions: Which model answers shoppers? Can it route simple chats to a cheaper model? Do you use prompt caching?
Cheap is not a number. Now you have the number — and the math to check anyone else's.