← All articles
6 min read

How to A/B test a conversational AI agent

Why guessing is the expensive option

A conversational agent is not one thing you tune. It is a stack of decisions: how it greets a shopper, how warm or terse it sounds, when it offers help, how it recommends products, and how it walks someone through returns or sizing. Each of those is a knob — and with language, small changes move money.

“Want me to narrow it down?” lands differently from “Here are three options.” You cannot reason your way to the right wording, because shopper behaviour rarely matches your intuition. The alternative to testing — edit the prompt, eyeball last week’s numbers, declare victory — is guessing with extra steps. Traffic shifts, a promo runs, and you have no idea whether your change helped, hurt, or did nothing.

In WisWes, A/B testing is an Enterprise feature. It runs two or more variants of the assistant’s prompts or conversation flows at the same time, so the comparison happens under identical conditions instead of week-over-week.

What you can actually test

The useful tests live at the level of behaviour the shopper feels:

Start with the surfaces most shoppers hit: the greeting and the recommendation style touch nearly every session, so a win there compounds.

How to run a clean test

Four rules separate a test you can trust from theatre.

1. Change one thing at a time. If Variant B has a new greeting and a new recommendation style, a win tells you nothing about which change did the work. One variable per test.

2. Assign randomly, and keep the session sticky. Each new shopper should be randomly dropped into a variant, then kept on it for the whole session — WisWes does this for you. Stickiness matters more for a conversational agent than for a static page: a chat is a continuous relationship. If a shopper got a warm, guiding persona in message one and a clipped one in message four, the experience breaks and the data is noise.

3. Get enough traffic. A handful of conversations cannot tell you anything. Lower-traffic stores should run tests longer rather than calling them in a day.

4. Pick one primary metric before you start. Decide what “winning” means up front. If you wait until the data is in and then go shopping for a metric where B looks good, you will always find one — and it will mean nothing.

What to measure

WisWes tracks three things per variant: conversations, win-backs, and completed checkouts. Choose your primary metric from the bottom of that list, not the top.

MetricWhat it tells youPrimary?
ConversationsHow many shoppers engaged the variantNo — it’s the denominator, not a result
Win-backsShoppers pulled back from leaving and convertedYes, if the test targets the proactive layer
Completed checkoutsSessions that ended in a purchaseYes — the default primary metric

Message count, session length, and “deflection” are vanity traps. A variant that drives more completed checkouts made the store money. That is unambiguous.

A/B campaign in the WisWes dashboard showing a Control variant versus a leading Variant B, with conversations, win-backs and checkout rate per variant
An A/B campaign in WisWes — control versus variant, with conversations, win-backs and checkout rate tracked per variant, and a clear leader.

Reading results and calling a winner

The single most common mistake is stopping too early. Run a test for a day, see Variant B ahead, ship it — and next week it underperforms, because the early lead was noise. Hold the line:

When one variant is clearly and consistently ahead on your primary metric, promote it in the dashboard. It becomes the new control, and your next test challenges it.

A worked example

Say you want to test how the agent opens a conversation.

You change only the greeting; everything downstream stays identical. Primary metric: completed checkouts. A week in, Variant B is consistently ahead and the gap is holding steady across days, not bouncing. That is a winner — you promote B, and B becomes your new control.

Now the next question writes itself: does the guided opener work even better as a full “help me choose” flow with structured steps? You have your next test, and a control worth beating. Test one knob, measure revenue, promote the winner, repeat — and over a quarter, a string of small honest wins compounds into an agent that is measurably better at selling. You can model the cost side of all this usage with the AI agent cost calculator.

Turn questions into checkout.

WisWes drops into your store and guides shoppers from browsing to buying. 14-day free trial — no card.