May 18, 20269 min read

The Real Cost of One AI Conversation — and the Math Behind It

Your AI assistant has been live for one month. It handled about 10,000 conversations — shoppers asking about sizing, stock, shipping, the assistant answering every one. A good month. Then the invoice lands: $1,150. You read it twice. That is not a server, not an ad budget — that is the price of a chatbot talking.

Nothing went wrong. No bug, no abuse, no traffic spike. The assistant did exactly its job, and doing its job is what cost a four-figure sum. And here is the line that should keep you up at night: the store across the street ran the identical assistant, fielded the same 10,000 conversations, and paid $130. One-ninth the price. Not a discount, not a smarter contract — the same software, the same shoppers. The entire gap is one setting on a screen neither owner was ever shown.

This is the quiet trick in every AI pricing page. They tell you it is "cheap" — a fraction of a cent per chat — and they are technically right. But "a fraction of a cent" is the most expensive phrase in software. It hides a 10× spread, and nobody does the multiplication for you. One fraction of a cent buys the $130 month. Another buys the $1,150 one. Same three words on the page.

So let's do the multiplication. No code, no jargon you have not met before — just the arithmetic behind a single conversation, traced through all three models that quietly bill you, until you can read any pricing page and know exactly what you are about to pay.

Five stacks of small-denomination coins of varying heights on a white background. — Five stacks, five different heights — the same small coins. The cost of an AI conversation works exactly this way: the unit is tiny, but how high the stack climbs is decided by choices you control.

There are three different model bills hiding inside one conversation:

The language model (LLM) — the part that reads the shopper and writes the replies.
The embedding model — the part that searches your catalog.
The eval model — the part that quietly grades the assistant's answers.

We will price each one, then add them up.

The shape of the bill. Every shopper turn runs through the language model; that model reaches out to an embedding model to search your catalog and an eval model to grade itself. Three models, three line items — but as you will see, they are wildly unequal.

First: a "conversation" is not one question

The instinct is to think of a chat as one question and one answer. It almost never is. A real shopping conversation looks like this:

Shopper: Do you have running shoes for flat feet?

Assistant: Yes — a few good options. Are these for road or trail?

Shopper: Road, mostly.

Assistant: Then I'd look at these three… (lists products)

Shopper: Is the second one true to size?

Assistant: It runs about half a size small…

Shopper: Okay, add the size 10 to my cart.

Assistant: Done — anything else?

That is six back-and-forth turns. Each turn is a separate request to the language model. And here is the part that surprises people: every turn re-sends the entire conversation so far. The model has no memory between turns — to answer message #6, it must be handed messages #1 through #5 again, in full.

Hold onto that. It is the single biggest driver of the bill.

The unit you actually pay for: tokens

You do not pay per message or per word. You pay per token. A token is a chunk of text — roughly ¾ of a word in English. "Running shoes for flat feet" is about 6 tokens. A typical sentence is 15–25 tokens. A paragraph is 80–120.

Two rules of thumb:

~750 words ≈ 1,000 tokens.
Model prices are quoted per 1,000,000 tokens (per "1M"). That large unit is why everything looks deceptively cheap — but conversations burn through tokens faster than you would guess.

And tokens come in two kinds, priced differently:

Token type	What it is	Relative price
Input	Everything you send to the model: the shopper message, the conversation history, your store instructions, product data	Cheaper
Output	Everything the model writes back	3–5× more expensive

Output is the expensive one. The assistant writing a paragraph costs several times more than it reading a paragraph.

What gets sent on every single turn

When the assistant answers one message, the request is not just that message. It is a stack:

Piece	What it is	Typical size
System instructions	"You are a helpful shopping assistant for [Store]. Be concise. Never invent prices…" plus the list of actions it is allowed to take (search, add to cart, check stock)	~2,000 tokens
Conversation history	Every previous message, shopper and assistant, word for word	grows each turn
Retrieved product data	The catalog entries pulled in to answer this question	~800 tokens
The new shopper message	What they just typed	~50 tokens

The system instructions and retrieved data are sent fresh every turn. The history grows every turn. The shopper typed 50 tokens — but the model receives close to 3,000.

The hidden multiplier: re-sending the conversation

Let's price the six-turn chat above. Assume each shopper message is ~50 tokens, each assistant reply ~150 tokens, system instructions ~2,000 tokens, and retrieved product data ~800 tokens per turn.

Because every turn re-sends everything before it, the input grows turn over turn:

Turn	Input sent to model	Output written
1	2,000 + 800 + 0 history + 50 = 2,850	150
2	2,000 + 800 + 200 history + 50 = 3,050	150
3	2,000 + 800 + 400 history + 50 = 3,250	150
4	2,000 + 800 + 600 history + 50 = 3,450	150
5	2,000 + 800 + 800 history + 50 = 3,650	150
6	2,000 + 800 + 1,000 history + 50 = 3,850	150
Total	20,100 input tokens	900 output tokens

Each bar is one turn. The pale block — the system prompt and product data — is identical every time and never shrinks. The solid block on top is the conversation history, growing turn after turn. The shopper's actual words are a sliver of either.

Here is the headline. The shopper typed about 300 tokens of text. The conversation consumed 20,100 input tokens — roughly 67× more. The system instructions alone (2,000 × 6 turns = 12,000 tokens) account for more than half the bill.

This is not waste — it is how the technology works. But it explains why "the messages were so short, why did it cost that much?" has a real answer.

Bill #1: the language model

Now apply price. Model pricing varies enormously, so let's use two vendor-neutral tiers that bracket the real market:

Tier	Input price /1M	Output price /1M
Flagship (top-end reasoning model)	$5.00	$15.00
Mid-tier (fast, capable, cheaper)	$0.50	$1.50

Flagship model:

Input: 20,100 ÷ 1,000,000 × $5.00 = $0.1005
Output: 900 ÷ 1,000,000 × $15.00 = $0.0135
LLM total: ~$0.114 — about 11–12 cents

Mid-tier model:

Input: 20,100 ÷ 1,000,000 × $0.50 = $0.0101
Output: 900 ÷ 1,000,000 × $1.50 = $0.0014
LLM total: ~$0.0115 — about 1 cent

Same conversation. Same shopper. A 10× difference, decided entirely by which model the assistant runs on. That is the choice nobody puts in front of you — and it is the most important one on the page.

Bill #2: the embedding model (catalog search)

When the shopper asks for "running shoes for flat feet," the assistant cannot read your whole catalog every time — that would be far too many tokens. Instead it uses embeddings.

An embedding turns a piece of text into a list of numbers that captures its meaning. Products with similar meaning end up with similar numbers. To search, you embed the shopper's question and find the catalog entries whose numbers are closest. This is what lets "flat feet" surface a shoe described as "stability / motion control" even though those exact words never matched.

Embeddings have two costs:

a) Indexing your catalog — a one-time cost. Every product description gets embedded once, then re-embedded only when it changes. For a 5,000-product store at ~200 tokens per product:

5,000 × 200 = 1,000,000 tokens
Embedding models are cheap — about $0.02 per 1M tokens
Indexing the entire catalog: ~$0.02. Two cents. Total, not per shopper.

b) Searching during the conversation — a per-conversation cost. Each search embeds only the shopper's short query. Six searches × ~50 tokens = 300 tokens:

300 ÷ 1,000,000 × $0.02 = $0.000006

That is six millionths of a dollar. For practical purposes, the embedding cost of a conversation is zero. It matters for catalog indexing, not for the per-chat bill. Good to know — mostly so you are not upsold on it.

Bill #3: the eval model (quality control)

The third model is the one most store owners have never heard of. An eval model is a second, usually smaller, language model whose job is to grade the assistant's answers — automatically checking things like: did it stay on topic, did it invent a price, was it actually helpful?

You do not need this on every conversation. It is a quality-control sample — like a factory checking 1 in 20 units, not all of them. But when it runs, it is another model call, so it has a cost.

Grading one assistant reply means sending the eval model the question, the answer, and a rubric (~600 input tokens) and getting back a short verdict (~100 output tokens). Eval almost always runs on a cheap small model — say $0.15 /1M input, $0.60 /1M output.

If you graded all six turns of our conversation:

Input: 6 × 600 = 3,600 ÷ 1M × $0.15 = $0.00054
Output: 6 × 100 = 600 ÷ 1M × $0.60 = $0.00036
Eval total: ~$0.0009 — under a tenth of a cent

And if you sample 1 conversation in 20 instead of grading every turn, the eval cost effectively disappears. It is a rounding error either way — but it is real, and it is the reason your assistant keeps getting better instead of quietly drifting.

Putting the whole bill together

One six-turn conversation, all three models added up:

Component	Flagship LLM	Mid-tier LLM
Language model	$0.1140	$0.0115
Embedding (search)	$0.000006	$0.000006
Eval (all turns graded)	$0.0009	$0.0009
Total per conversation	~$0.115	~$0.013

The whole article in one picture. Identical conversation, identical shopper — the only variable is which language model answered. That single choice is the difference between a $130 month and a $1,150 one.

The language model is the bill — 99% of it. Embeddings are free in practice. Eval is a rounding error. So when you compare AI assistants, do not get lost in feature lists about "advanced retrieval" or "evaluation pipelines." Ask which language model answers the shopper, and at what tier. That one answer sets your cost.

To make it concrete — at 10,000 conversations a month:

Flagship: ~$1,150/month
Mid-tier: ~$130/month

Same traffic. Same store. The gap is a model choice.

What actually moves the meter

If you want the bill lower without making the assistant worse, these are the five real levers — in order of impact:

Model tier. The 10× lever. Many stores do not need a flagship model to recommend shoes; a strong mid-tier model handles ordinary shopping questions well. The best setups route — cheap model for simple chats, flagship only for genuinely hard ones.
Prompt caching. Those 2,000-token system instructions are identical on every turn. Most providers let you cache that fixed block so re-sending it costs a fraction of full price. On a long conversation this alone can cut the input bill by half or more. Ask if your provider uses it.
System prompt size. A bloated 5,000-token instruction block is sent on every turn of every conversation forever. Tightening it is a permanent discount.
History trimming. A 30-turn conversation does not need turn 1 verbatim. Summarizing or dropping old turns stops the input from growing without limit.
Eval sampling. Grade a representative sample, not every message. You get the quality signal at a fraction of the cost.

Notice what is not on the list: embeddings and search. They are cheap enough that optimizing them saves you nothing.

The takeaway for store owners

A conversation is many model calls, not one — and each call re-sends the whole chat. A handful of short messages can be 20,000+ tokens.
You pay per token, ~¾ of a word; output costs 3–5× more than input.
The language model is ~99% of the cost. Embedding search is effectively free; eval is a rounding error.
The model tier is a 10× lever — roughly 1¢ vs 12¢ for the identical conversation. It is the number that matters, and it is the one rarely shown to you.
Before you sign up, ask three questions: Which model answers shoppers? Can it route simple chats to a cheaper model? Do you use prompt caching?

Cheap is not a number. Now you have the number — and the math to check anyone else's.