May 24, 20268 min readDay 3

Day 3: One prompt, twenty-five searches — keyword expansion with Gemini

Day 3. Today I taught the agent to ask the same question twenty-five ways.

Day 2 finished with 12 products per prompt — whatever AliExpress put on page one. Today, Gemini Flash fans the prompt into ~25 search variants, the fetcher hits them in parallel, and the JSON gets a new shape. The run worked. The result was half a win — and the half-loss is the kind of thing build-in-public is for.

Chunk-numbering note: Day 2 teased "chunk 4 = keyword expansion" and "chunk 5 = relevance check". Renumbered to keep one chunk per day, so what was chunk 4 is now chunk 3, and so on. Margin calc gets bumped to chunk 7 — quality of which product is a different problem from economics of how much you'd make.

1. Where Day 2 left us — and the problem it exposed

Yesterday's win was the wire: a prompt comes in on Shopify, a workflow runs on GitHub Actions, a JSON file with twelve AliExpress products lands on main. Fourteen seconds from button to commit.

Yesterday's loss was the data. I typed iphone 18 replica as a test prompt and the top hit was this:

The SERVO A50 PRO Mini Foldable Phone on AliExpress — a purple plastic flip phone with two SIM slots and a 2.4 inch screen, the top result that the Day 2 pipeline returned for the prompt 'iphone 18 replica'. — Top result on Day 2 for `iphone 18 replica`: the SERVO A50 PRO foldable flip phone. The Day 2 article calls this out — "the wire works, the data does not". It's the problem the next few chunks (this one included) attack.

The single-prompt search has two structural problems, not one:

Vocabulary mismatch. A shopper types blue kids pants. The shelf with the best margin is probably listed by the supplier as boys denim trousers or children jeans. We never see it — AliExpress ranks listings by keyword overlap with the literal query.
Single-page ceiling. Even the perfect query shows only what fits above the fold of the first page. Lots of good listings sit on pages 2–10 with the same keywords; we never reach them.

Both problems shrink the candidate pool the rest of the pipeline has to work with. Today's chunk doesn't solve relevance (that's chunk 4, when an LLM filters out the foldable flip phone). Today's chunk grows the pool — wider net, more raw material, then later chunks throw the bad stuff back.

Three ways to widen the net (and why I picked one)

Option	How it works	Why not / why yes
A. Pagination	Walk pages 2-10 of the same query, dedupe by product_id.	Doesn't fix vocabulary mismatch. Page 7 of "blue kids pants" is still ordered by the same ranker for the same words.
B. Synonym dictionary	Hand-maintained word list: kids→children→toddler, pants→trousers→jeans...	Combinatorial blow-up; needs constant tending; biased by whoever wrote it.
C. LLM expansion	Ask a small model to fan one prompt into ~25 plausible search variants.	Cheap, automatic, biased only by training data. Handles synonyms, sibling categories, and common misspellings the same way a real shopper would. Picked.

2. Picking the LLM — why Gemini Flash specifically

Keyword expansion is the kind of task small models are great at: the output is short, structured, and verifiable. There's no chain-of-thought, no tool use, no long-form generation. A frontier model would be overkill and slow. So the candidate set was the small-and-fast tier:

Model	Free tier?	Cost (input/output, per 1M tokens)	Why I didn't pick it
Gemini 2.5 Flash	1,500 req/day, ~1M tokens/min	$0.075 / $0.30 once you exceed free tier	Picked.
GPT-4o-mini	No (small free credits on signup only)	$0.15 / $0.60	No always-free tier — fork-and-run breaks for anyone without an OpenAI account.
Claude Haiku 4.5	No (Anthropic SDK has free trial credits only)	$1.00 / $5.00	Same problem + an order of magnitude more expensive for this workload.
Self-hosted (Llama, Qwen, etc.)	Free	GPU rental or local box	Would need a GPU somewhere. Breaks the "no infrastructure" story from chunk 2.

Gemini 2.5 Flash hits the three things this chunk actually needs. Each one matters more than it sounds:

Free forever, not just trial credits. Margeen is a public-repo build-in-public project. If a reader forks wiswes/margeen and tries to run the workflow, they need to get a key in 30 seconds without entering a credit card — otherwise the "zero infra cost" story from chunk 2 breaks the moment they hit Day 3. Gemini's free tier (1,500 requests/day, 1 million tokens/minute) is the only mainstream LLM that meets that bar today.
Cost math even if you exceed the free tier. One expansion call sends ~50 tokens of prompt and gets back ~250 tokens. At Gemini's paid prices that's (50 × 0.075 + 250 × 0.30) / 1,000,000 ≈ $0.000079 per search. You could run a search a minute, every minute, all month, for under $4. The cost stops being a real number; the rate limit on the AliExpress side will hurt you first.
JSON mode (responseMimeType). Gemini, OpenAI, and Anthropic all support a "return only valid JSON" mode. With it on, the response body is guaranteed to be JSON I can json.loads directly — no regex stripping of ```json fences, no hallucinated "Here's your list:" preamble. This matters disproportionately for short structured tasks. (More on this in the next section.)
Stdlib-only client. No SDK. The Gemini REST endpoint is a single POST to a JSON URL. The script already shells out to curl for AliExpress; for Gemini I used urllib.request from the Python stdlib. So requirements.txt stays empty.
Latency is low enough to ignore. Flash typically returns in 400–800 ms for a request this small. The fetch loop (25 AliExpress fetches in parallel) is the cost centre, not the expansion — even doubling the LLM call would barely show up in the total wall time.

Setup, for anyone forking: get a key at aistudio.google.com/apikey (no card needed), add it to the repo as a secret named GEMINI_API_KEY, wire it as an env var in the workflow. Forks without the secret are not broken — the script logs GEMINI_API_KEY not set — skipping expansion and falls back to the Day-2 single-prompt search.

3. The expander — prompt design and the parsing it enables

Prompt engineering for tiny structured tasks is a different sport from prompt engineering for chat. Three things matter: the rules in the prompt, the response format constraint, and the post-processing that catches the model being almost-right.

The prompt, with annotations

# scripts/search_aliexpress.py — expand_keywords()
user_prompt = (
    f'Generate up to {MAX_VARIANTS} short AliExpress search-query '
    f'variants a shopper might type to find this: "{prompt}".\n\n'
    "Rules:\n"
    "- Each variant 1-5 words, lowercase, no punctuation.\n"
    "- Include synonyms, common misspellings, sibling categories, "
    "brand-name variants.\n"
    "- No questions, no sentences, no duplicates of the original.\n"
    "- Return ONLY a JSON array of strings. No prose, no code fence."
)

Each line was added because of a specific failure I saw in early runs:

Rule	What goes wrong without it
"a shopper might type"	Without it, you get marketing-flavoured variants — "premium denim collection for children" — that no real AliExpress search box ever sees.
1–5 words	LLMs love to be helpful. Long variants ("comfortable dark blue jeans for boys aged 4 to 12") return zero matches because no listing repeats that whole phrase verbatim.
lowercase, no punctuation	AliExpress search is case-insensitive but the URL encoding for commas / quotes is messy. Normalising up front makes downstream dedupe trivial.
synonyms, misspellings, sibling categories, brand variants	Naming the categories explicitly nudges the model to fan out instead of returning 25 paraphrases of the same noun.
no questions, no sentences, no duplicates	Negatively scoped rules. Without them you get "what are blue kids pants" and the original prompt verbatim taking up slots.
Return ONLY a JSON array	Combined with response-format=JSON below, makes the parser a one-liner.

JSON mode: the unsexy parsing win

The request includes responseMimeType: "application/json" in generationConfig. That instructs Gemini to constrain decoding to valid JSON tokens — not as a soft prompt instruction ("please return JSON") but at the sampler level. The response body is guaranteed parseable.

body = json.dumps({
    "contents": [{"role": "user", "parts": [{"text": user_prompt}]}],
    "generationConfig": {
        "temperature": 0.7,            # warm enough for variety, not so
                                       #   warm it invents fake products
        "responseMimeType": "application/json",
    },
}).encode("utf-8")

req = urllib_request.Request(
    f"{GEMINI_ENDPOINT}?key={api_key}",
    data=body,
    headers={"Content-Type": "application/json"},
    method="POST",
)
with urllib_request.urlopen(req, timeout=timeout) as r:
    response = json.loads(r.read().decode("utf-8"))
text = response["candidates"][0]["content"]["parts"][0]["text"]
raw = json.loads(text)   # <-- parses cleanly, no regex strip needed

Without JSON mode you spend at least one debugging round on a response like ```json\n[...]\n``` or Sure! Here are 25 variants: [...]. With JSON mode the worst case is a syntactically-valid empty list, which the caller can detect cleanly.

Temperature 0.7 — not 0, not 1

Temperature 0 makes every call deterministic, which sounds nice for tests but produces nearly identical lists for similar prompts — the variety I want comes from sampling, not from re-prompting. Temperature 1 starts inventing brand names that don't exist. 0.7 is the sweet spot for "diverse but plausible" outputs on short structured tasks.

Defensive normalisation

Even with JSON mode and clear rules, the model occasionally: leaves trailing whitespace, uppercases the brand-name variants ("Apple iPhone"), returns the original prompt as one of the items, or duplicates two near-identical entries ("blue kids pants" vs "blue kids' pants" — the apostrophe is the only difference). The post-processor handles all of those without complaining:

seen: set[str] = set()
cleaned: list[str] = []
for item in raw:
    if not isinstance(item, str):
        continue
    v = re.sub(r"\s+", " ", item).strip().lower()
    if not v or v == prompt.lower() or v in seen:
        continue
    cleaned.append(v)
    seen.add(v)
return base + cleaned[: MAX_VARIANTS - 1]

Three things are happening: whitespace collapse + strip + lower (canonical form), dedupe via a seen set, and the cap at MAX_VARIANTS - 1 because the original prompt is always variant #0 (so the total fanout — including the unchanged prompt — never exceeds 25).

Failure as a first-class state

The function always returns a non-empty list. No key, API error, timeout, malformed JSON — all of them produce the single-element fallback [prompt]. Downstream code never branches on "did expansion work?"; it just loops over whatever list it got. This matters for the public workflow — a fork without the secret behaves identically to Day 2's single-prompt search, no extra code paths.

def expand_keywords(prompt: str, timeout: int = 20) -> list[str]:
    base = [prompt]
    api_key = os.environ.get("GEMINI_API_KEY", "").strip()
    if not api_key:
        print("[margeen] GEMINI_API_KEY not set - skipping expansion")
        return base
    # ... build request ...
    try:
        with urllib_request.urlopen(req, timeout=timeout) as r:
            response = json.loads(r.read().decode("utf-8"))
    except (urllib_error.URLError, TimeoutError, json.JSONDecodeError) as e:
        print(f"[margeen] expansion API error: {e} - falling back")
        return base
    # ... parse + normalise (above) ...
    return base + cleaned[: MAX_VARIANTS - 1]

What Gemini actually returned

For the headline blue kids pants run, the 25-variant list looked like this (extracted from latest.json):

# Synonyms for "kids"
blue toddler pants     blue baby pants      blue children pants
blue youth pants       kids blue jeans      toddler blue jeans

# Synonyms for "pants"
blue kids trousers     blue kids leggings   blue kids joggers
blue kids sweatpants   blue children trousers
blue children joggers  children blue bottoms

# Gender variants
blue boy pants         blue girl pants      boys blue jeans
girls blue jeans       boy blue trousers    girl blue trousers
kids blue bottoms

# Shade variants
dark blue kids pants   light blue kids pants
navy kids pants        denim kids pants

Four buckets, all plausible, all distinct from the original — and all (in principle) hitting different ranking shelves on the AliExpress side. That's the model earning its keep.

4. Twenty-five searches without burning a minute

Day 2's single fetch took ~14 seconds. Naively, 25 fetches in series is 350 seconds. A six-minute search is dead on arrival — GitHub Actions queues other runs behind it, the Shopify Flow timeout fires, the article gets boring to read.

The Python standard library has ThreadPoolExecutor, which is exactly what this needs: each fetch is I/O-bound (curl waiting on the network), the GIL is not in the way, and there's no setup beyond a context manager.

from concurrent.futures import ThreadPoolExecutor, as_completed

MAX_PARALLEL_FETCHES = 8

records = []
with ThreadPoolExecutor(max_workers=MAX_PARALLEL_FETCHES) as pool:
    futures = {pool.submit(search_variant, v): v for v in variants}
    for fut in as_completed(futures):
        records.append(fut.result())

Each variant's fetch is wrapped in search_variant()which catches its own exceptions and returns a record — so one flaky variant can't take down the whole run.

5. New JSON shape (v2)

The output schema is now margeen.search/v2. Three things changed:

expansion block. Records whether expansion was on, which model produced the variants, and how many variants were fetched.
variants[] array. One record per search variant: keyword, HTTP status, response bytes, blocked flag, error string if any, candidate count. The audit trail for "why this product was, or wasn't, in the result".
seen_in on every candidate. Which variant(s) surfaced this product. When two variants surface the same product_id, that signals stronger relevance — chunk 4 will use this.

candidates/latest.json on GitHub showing the new v2 schema: expansion block at the top with model name and variant_count: 25, followed by a variants array where each entry has keyword, http_status, response_bytes, blocked, error, candidate_count. — The new `candidates/latest.json`. The `variants` array makes the rest of the article more or less write itself — you can read the result of every individual search and decide what to do about it.

6. The first real run

Prompt: blue kids pants (same as Day 2's headline run, so the numbers are comparable). Gemini returned 25 variants — synonyms (blue toddler pants), siblings (boys blue jeans), category-shifts (navy kids pants, blue children joggers). The fetcher fanned out.

✓

Run #16 — 73dc516

Run on "blue kids pants" — expansion enabled. 21 seconds end-to-end, 24 unique candidates. Compared to Day 2: 50% more wall time for ~2× the candidates. But also: 23 of 25 variants were bot-blocked by AliExpress (more on that below).

Open on GitHub →

GitHub Actions run summary for Margeen search #16: status Success, total duration 21s, one artifact named search-result (4.54 KB). — Run #16: 21 seconds, one artifact, 4.54 KB. Day 2's headline run was 14 seconds and 2.46 KB. The artifact grew because the JSON now carries 25 variant records, not just the products.

GitHub Actions runs list for Margeen search, with run #16 at the top showing 21 seconds, and the earlier Day 2 runs at 12-14 seconds. — Run #16 (top, 21s) next to the Day-2 single-prompt runs (12–14s). Twenty-five HTTP fetches in 50% more wall time — that's thread-pool concurrency doing its job.

7. Twenty-three of twenty-five got bot-blocked

Reading the variants[] array of that run is sobering. Two variants returned ~600 KB of real product HTML (blue boy pants, blue children trousers). The other twenty-three returned ~2 KB each: AliExpress's anti-bot interstitial.

Status	Variants	What it means
✓ Real HTML	2 of 25	12 candidates each, no overlap → 24 unique products
✗ 2 KB bot page	23 of 25	Caught correctly by looks_blocked() — recorded, not garbage

AliExpress's anti-bot fired because GitHub Actions runs on a small range of US IPs, and the script just hit it with 25 requests in five seconds. From the server's side, that's a textbook crawler signature. The first two got through; everything after that hit the captcha wall.

And yet: the expansion still doubled the candidate count vs Day 2, because the 2 variants that did get through had non-overlapping results. The mechanism works; the rate at which it works is the next problem.

8. Read every result on GitHub

Every Margeen search writes a JSON file to the repo on main. The git history is the audit log. Three places to look:

The wiswes/margeen commit history showing the Day 3 keyword-expansion commit (73dc516) followed by an auto-commit from the workflow (c1a3ff2) titled 'search: blue kids pants'. Earlier in the list are the Day 2 commit and several other auto-commits from prior searches. — The commit history on `main` after Day 3. `73dc516` is the code change. The next commit (`c1a3ff2`, "search: blue kids pants") was pushed by the workflow itself — the agent committing its own findings. Every search appears as one commit; the repo is the database.

The candidates/ folder on GitHub showing one timestamped JSON file per search across multiple days, plus the stable candidates/latest.json pointer and a debug.html artifact from a blocked run. — The `candidates/` folder keeps the last ten runs. `candidates/latest.json` is a stable pointer for "just give me the most recent result". A `.debug.html` file gets dropped alongside the JSON when a fetch was bot-blocked, so the raw response is recoverable.

Direct links to today's evidence

Want to run a search yourself? Fork wiswes/margeen, open the Actions tab on your fork, click Margeen search → Run workflow, type any prompt, hit run. Without a Gemini key it does Day-2 single-prompt search; with a key (yours, added as GEMINI_API_KEY in repo Secrets) you get the full 25-variant fanout.

9. What lands tomorrow

Two threads:

Chunk 3.5 (small fix): stagger fetches. ThreadPool with max_workers capped lower (maybe 2), plus a small jitter between requests. The goal is to look less like a crawler from AliExpress's side — turning 23 blocked into 0–3 blocked. That's probably one commit, not a chunk of its own.
Chunk 4 (relevance check): the planned next chunk. Once the candidate pool is genuinely fuller, an LLM reads {prompt, title} for each candidate and drops the ones that don't match — the SERVO A50 PRO finally gets filtered out of iphone 18 replica.

Day 3 of 40. The mechanism works. AliExpress's anti-bot has opinions.

One prompt → 25 Gemini variants → 25 parallel fetches → 24 unique candidates in 21 seconds. Twenty-three of twenty-five variants got bot-blocked. That's tomorrow's small fix; the relevance check is the chunk after.

Read the script →See the latest.json →