May 23, 202611 min readDay 2

Day 2: Building AI search across AliExpress

A dense bundle of assorted electric cables, twisted and tied together. — What the inside of a brand-new project actually looks like. Photo by John Barkiple on Unsplash.

Day 2. Today I built the part that searches AliExpress.

A prompt goes in on the Shopify side, twenty seconds later a list of real products — with prices — lands in the repo. Below: how I picked the runtime, how I actually wired it together, and what a live result looks like today. Every step links to the real commit or run page on GitHub.

1. Picking the runtime

Margeen needs somewhere to run. The code has to wake up when something happens on the Shopify side, hit AliExpress, and write the result to a place I can read later. Three platforms were obvious-ish candidates, in the order I considered them.

Shopify

Tempting — the user already pays for Shopify.

Out

Shopify is a great platform for storefronts and a poor platform for hosting arbitrary code. Functions only runs at checkout/cart extension points. Flow is a no-code automation tool — it can call external HTTP, but it can't run Python. There is no general-purpose "Shopify Lambda". So Shopify can be the trigger surface and the storefront, but not the brain.

Cloudflare Workers

Free tier, fast cold start, real runtime.

Out

Workers are great for synchronous webhook replies. They are not so great for "run a few seconds of work, then store the result somewhere I can link to later". To make that work I'd need Workers plus a KV/D1 store plus a way to render the result. The total system grows into two services and a database — and then the articles need to explain all three. Out for that reason.

GitHub Actions

The runtime, the audit log, the approval UI, and the database — all the same git history.

✓ Picked

Truly free for public repos. The repository_dispatch REST endpoint lets any external system (Shopify Flow, in our case) start a run. Each run can commit its own result back to the same repo. The repo becomes the database. There is no separate state to lose, no infrastructure to forget about, no extra credentials to rotate. Slower at the edges — a 5–30s queue between dispatch and execution — but for this use case, totally fine.

The Margeen search workflow runs page on GitHub Actions, listing fifteen runs — most green, two red — each tagged Manually run by wiswes with a duration in seconds. — The actual outer wire, in evidence. Every row on the public workflow page is one prompt → one run → one commit. Greens are searches that wrote a JSON file to the repo; reds are the broken pushes I had to fix.

2. Actually building it

Everything below is live on the public repo. The workflow file is .github/workflows/search.yml; the Python script is scripts/search_aliexpress.py; all the past runs sit on the workflow page.

The fetch — Python `requests` versus curl

With requests.get(...) and a real browser User-Agent, the response was HTTP 200, two kilobytes — a bot-check interstitial. With curl from the same machine, identical headers, six hundred kilobytes of real product HTML. The difference is the TLS fingerprint — curl negotiates HTTP/2 with browser-like cipher suites, requests does not. So the fetcher shells out to curl via subprocess.run; every GitHub Actions runner has curl preinstalled.

What the 2 KB bot-block page contains, verbatim:

<html>
  <head><title>请输入验证码</title></head>
  <body>
    <script src="//g.alicdn.com/sd/punish/2.5.6/punish.js"></script>
    <div id="J_RecaptchaContainer"></div>
    <script>window._dida_config_ = { ... };</script>
  </body>
</html>

What a single product card looks like in the real 600 KB response:

<a href="//www.aliexpress.us/item/3256812035515440.html?algo_pvid=…"
   class="multi--container--..."
   title="5-14Y Boys Jeans Pants Spring Kids Wide Leg Straight Trousers …">
  <div class="multi--image--…">…image tags…</div>
  <div class="multi--price--sale--current">$17.22</div>
</a>

And the actual Python that does the fetch:

# scripts/search_aliexpress.py
import subprocess

result = subprocess.run(
    ["curl", "-sSL", "--max-time", "30",
     "-A", USER_AGENT,
     "-H", "Accept-Language: en-US,en;q=0.9",
     url],
    capture_output=True, text=True, check=False,
)
body = result.stdout

Parsing the cards — and the .us trap

With 600 KB of real HTML in hand, I wrote a regex to pull product cards. Result: zero products. Twelve product IDs were in the response — I could grep them — but the parser found none. I made the script dump the raw HTML to candidates/<ts>.debug.html when parsing produced nothing, then opened the saved file. The hrefs were there, but on aliexpress.us, not .com. GitHub Actions runs on US IPs and AliExpress redirects US visitors to its US subdomain. My regex was hardcoded to .com. One character class change later, twelve products extracted.

A product photo from AliExpress: a child wearing blue jeans against a textured wall — the kind of product card the parser pulls out of the search HTML. — One of the cards the parser pulls out for `blue kids pants` — the same product the regex now finds at `aliexpress.us/item/3256812035515440`. Open the listing · see it in latest.json.

# Old: only aliexpress.com
href="(//www\.aliexpress\.com/item/(\d+)\.html[^"]*)"

# New: any 2-3 letter TLD on the aliexpress domain
href="(//(?:www\.)?aliexpress\.[a-z]{2,3}/item/(\d+)\.html[^"]*)"

Adding prices

Names and URLs are useful; prices are the point. Inside each card chunk I look for the first $X.YY or US $X.YY. AliExpress renders the headline price in the first one or two spans of the card body, so the first match is reliable enough for today.

# inside parse_items()
price_pat = re.compile(r'(?:US ?\$|\$)\s?([0-9]+(?:[.,][0-9]{1,2})?)')
price_m = price_pat.search(chunk)
items.append({
    "product_id": pid,
    "url": url,
    "title": title,
    "price_str": price_m.group(0) if price_m else None,
    "price_usd": float(price_m.group(1)) if price_m else None,
})

The trigger

Shopify Flow's HTTP action POSTs to GitHub's repository_dispatch endpoint with the prompt as the payload. The full body looks like this:

# Shopify Flow → HTTP request action
URL:    https://api.github.com/repos/wiswes/margeen/dispatches
Method: POST
Headers:
  Authorization: Bearer <PAT>
  Accept: application/vnd.github+json
  X-GitHub-Api-Version: 2022-11-28
Body:
  {
    "event_type": "margeen_search",
    "client_payload": { "prompt": "{{ product.title }}" }
  }

The .github/workflows/search.yml file open on GitHub, showing the YAML for the Margeen search workflow — name, workflow_dispatch and repository_dispatch triggers, permissions, and the search job. — The whole workflow file — `.github/workflows/search.yml`. Two ways in: the manual **Run workflow** button on the Actions page (`workflow_dispatch`) and Shopify Flow's HTTP action (`repository_dispatch`).

Real evidence — failed and working runs

This is build-in-public. Two real runs to click and verify — one broken, one green, both on the actual project repo:

✗

Run #12 — 10d525b

Day 1: surface results on the run page (step summary + artifact). YAML parse error. I had embedded a Python heredoc inside a bash run: block to render the top-5 markdown list. The heredoc body sits at column 0, which YAML interprets as the end of the literal block. The workflow never started — "Invalid workflow file" at line 108. Swapped the heredoc for a jq one-liner and pushed the fix.

Open on GitHub →

✓

Run #14 — 6851c97

Run on "blue kids pants". 12 candidates extracted with prices. The run page's Summary tab shows the comparison table (cheapest first), the JSON is committed to candidates/, and candidates/latest.json updated. Total elapsed: 14 seconds.

Open on GitHub →

The GitHub Actions summary page for Margeen search #14: a green check, Status: Success, Total duration 14s, one artifact (search-result, 2.46 KB), the single search job ran in 10s. — Run #14 on the public Actions page: one job, ten seconds of compute, fourteen seconds wall-clock, one committed artifact. The whole outer loop end-to-end.

What landed in the repo

Day 2 added five files to wiswes/margeen on top of the Day-1 manifesto skeleton. The current shape of the repo:

The wiswes/margeen repository page on GitHub showing the root folders and files: .github/workflows, assets, candidates, chunks/01, docs, scripts, plus .gitignore, LICENSE, README.md. — The current state of `wiswes/margeen`. The commit message on the top row ("search: iphone 18 replica") is from the run that wrote the very file in the next screenshot — Margeen is its own audit log.

candidates/latest.json open on GitHub — a JSON file listing AliExpress products with product_id, url, title, price_str, and price_usd fields, plus a sidebar of timestamped per-search files in the candidates folder. — `candidates/latest.json` — the per-run output, committed to `main` every time a search completes. The sidebar on the left is the history: every previous run is still on disk, named by timestamp + slug.

Four direct links to the live files, in case the screenshots above weren't enough:

.github/workflows/search.yml — the workflow that runs on every Shopify Flow dispatch
scripts/search_aliexpress.py — the ~120-line fetcher + parser
candidates/latest.json — the most recent run's JSON output
All workflow runs — the public pipeline page, with every run logged

What AliExpress actually sends back

The wire works. The data is a different story. I tried a prompt I genuinely wanted to test — iphone 18 replica — and the workflow returned twelve products. The first one was this:

A product page on AliExpress showing the SERVO A50 PRO Foldable Mobile Phone — a purple plastic flip phone with a small inner screen, listed at about 1,092 hryvnia (~$26 USD). — The top hit for `iphone 18 replica` on this AliExpress listing. A purple plastic flip phone marketed as the SERVO A50 PRO. Calling it "an iPhone 18 replica" would be generous.

That's the relevance gap in one image. AliExpress's search ranks by keyword overlap + listing popularity + sponsored placement — like any e-commerce forward index. The word iPhone appears in many unrelated listings as a comparison ("works just like an iPhone"); the word replica cuts across categories (foam fruit replicas showed up in an earlier test). The parser has no semantic idea whether what came back is what you asked for. It just extracts whatever AliExpress put on the page.

How keyword expansion will close that gap

Today the script does one HTTP call per prompt. That single call inherits whatever AliExpress thinks is relevant. Chunk 4 wraps the fetch in a loop, with an LLM expanding the prompt first into 20–30 carefully-chosen variants. Concretely, for the same iphone 18 replica prompt, the keyword expander would generate something like:

# chunk-4 sketch — what Gemini Flash will be asked to produce
prompt = "iphone 18 replica"
keywords = expand(prompt)
# returns ~25 variants:
#   "iphone 18 clone"
#   "iphone 18 dupe"
#   "iphone 18 lookalike"
#   "iphone 18 dummy phone"
#   "iphone 18 mockup display unit"
#   "iphone 18 fake working"
#   "iphone 18 1 to 1 replica"
#   "android phone iphone 18 design"
#   "iphone 18 clone unlocked"
#   "iphone 18 style smartphone"
#   ... 15 more
candidates = []
for k in keywords:
    candidates.extend(search_aliexpress(k))
candidates = dedupe(candidates, by="product_id")
# then chunk 5 does the relevance check

Two things change. One, we get more candidates — different keywords surface different shelves of AliExpress's catalogue. Two, the next stage (chunk 5) reads {prompt, title, supplier_description} for each candidate and drops the ones that don't actually match the user's intent. That is where the foldable flip phone gets filtered out.

scripts/search_aliexpress.py open on GitHub — a Python file with module docstring, standard library imports (subprocess, re, json, os), and the start of search/parse helpers. — `scripts/search_aliexpress.py` — what built the white blocks of the inner pipeline today: HTTP fetch (via `subprocess`), parse + dedupe (with `re`), commit JSON. Keyword expansion and the relevance check are the next chunks; they will sit at the top of this same file.

3. What live results look like today

Every successful run does three things:

Output	Where	For
Per-search JSON	`candidates/<ts>_<slug>.json`	Audit log + future replays
Stable latest pointer	`candidates/latest.json`	One URL — never changes
Run-page summary	GitHub Actions run page	Comparison table, cheapest first, clickable titles

Until today, the only way to see a result was to know the repo's file layout. Now the GitHub Actions run page itself shows the prompt, the candidate count, the link to the JSON, and a sorted comparison table — markdown, clickable, generated by jq. Sample of what landed on a real run:

Margeen search — blue kids pants

Candidates found: 12 · Blocked: false

Price	Product (clickable)
$6.74	Summer Baby Denim Pants — Casual Soft Blue…
$9.92	3–12 Yrs Boys Jeans Slim Fit Denim…
$10.55	Children Jeans Boys Cotton Print Splice…
$10.89	Fashion Bow Embroidery Girls Jeans…
$13.17	Baby Girls' Jeans New Style Children's…

What renders on the GitHub Actions run page below the job tile.

To run one yourself right now:

Open the workflow → Run workflow, type a category, hit Run.
Within ~15 seconds, the run page's Summary shows the comparison table.
For the most-recent result at a stable URL, bookmark candidates/latest.json.

Day 2 of 40. The search loop holds. A foldable flip-phone is not an iPhone.

Tomorrow: the keyword-expansion loop above, with Gemini Flash generating the synonyms. Same JSON, one extra step, the same public commit log.

Follow the code on GitHub →See the latest search →