May 17, 202613 min read

30 Interview Questions for an AI Python Agentic Engineer

Two people shaking hands across a desk at the end of a job interview — Photo: Cytonn Photography / Unsplash

Hiring for agentic AI roles is hard. The field moves fast, job titles are inconsistent, and a candidate who can recite the transformer architecture may still struggle to ship a reliable tool-calling loop. The questions below are organised to test what actually matters on the job: Python fundamentals, LLM mechanics, agent design, and production reliability.

Each question includes a note on what a strong answer looks like, so you can use this as an interviewer's guide or a study sheet. Whether you are hiring an engineer or preparing for the chair on the other side of the table, the goal is the same — find the people who can build agents that hold up in production, not just in a demo.

Part 1 — Python fundamentals for AI work

1. How do async and await work, and why do they matter for agent code?

Strong answer: agents spend most of their time waiting on network I/O — LLM calls, tool APIs. asyncio lets a single process handle many concurrent calls without threads. Look for a mention of the event loop and that await yields control rather than blocking.

2. What is the difference between concurrency and parallelism, and where does the GIL fit?

The GIL serialises bytecode execution, so CPU-bound work needs multiprocessing; I/O-bound work — which is most agent work — is fine with asyncio or threads.

3. How would you stream tokens from an LLM API to a client in real time?

Look for generators and async generators, server-sent events (SSE) or WebSockets, and yielding chunks as they arrive rather than buffering the whole response.

4. Explain Python generators and a case where they save memory in an AI pipeline.

Lazy evaluation — for example, streaming a large document through a chunker and embedder without loading it all into RAM.

5. How do you use pydantic to validate structured LLM output?

Define a model, parse the LLM's JSON, catch ValidationError, and feed the error back to the model for a retry.

6. What are type hints good for in agent codebases, and what are their limits?

Editor and static-analysis help, plus self-documenting tool signatures — many frameworks generate JSON schemas from them — but there is no runtime enforcement without something like pydantic.

7. How do you manage dependencies for a reproducible AI project?

Lockfiles (uv, poetry, pip-tools), pinned versions, and isolating CUDA and torch versions so the build is the same everywhere.

8. How would you write a unit test for code that calls an LLM?

Mock the API client, test the logic around the call — parsing, retries, routing — and use recorded fixtures or evals rather than hitting the live model in CI.

Part 2 — LLM and prompting mechanics

A candidate and an interviewer talking across a table by a window — Photo: Christina @ wocintechchat.com / Unsplash

9. What is a context window, and how do you handle inputs that exceed it?

Truncation, summarisation, retrieval (RAG), chunking, and tracking token counts with a tokenizer.

10. Explain temperature and top-p, and when you would set temperature to 0.

Temperature 0 for deterministic or structured tasks — extraction, tool routing — and higher values for creative generation.

11. What is prompt caching and why does it matter for agents?

Agents resend a large, stable system prompt and tool definitions on every turn. Caching that prefix cuts latency and cost significantly. Cache the stable parts and keep volatile content at the end.

12. How do tokens affect cost and latency, and how do you estimate them?

Pricing is per token in and out, latency scales with output tokens, and you estimate ahead of time with the model's tokenizer.

13. What is structured output / function calling, and how is it different from parsing JSON out of text?

The model is constrained — or strongly biased — to emit a schema-valid object, which is far more reliable than scraping JSON out of prose. A good answer mentions JSON mode and grammar-constrained decoding.

14. What causes hallucinations, and what concrete techniques reduce them?

Grounding with retrieval, citations, "say I don't know" instructions, lower temperature, and verification steps — not just "better prompts".

15. How do you decide between a bigger model, a smaller fine-tuned model, and RAG?

Cost, latency and accuracy trade-offs: RAG for knowledge that changes, fine-tuning for format and behaviour, bigger models for hard reasoning.

Part 3 — Agent design and architecture

16. What makes a system "agentic"? Define it without buzzwords.

An LLM that decides which actions to take in a loop — calling tools, observing results, and re-planning — rather than producing a single response.

17. Walk me through the ReAct loop: reason, act, observe.

The model emits a thought plus a tool call, the runtime executes the tool, the result is appended to context, and the loop repeats until the model returns a final answer.

18. How do you design a tool for an agent to use reliably?

A clear name and description, a tight typed schema, narrow scope, idempotency where possible, and error messages written for the model to recover from.

19. When should you use a single agent versus a multi-agent system?

Multi-agent for genuinely separable roles and contexts — an orchestrator plus specialists. A single agent is simpler and usually the right default. Watch for candidates who reach for multi-agent reflexively.

20. How do you stop an agent from looping forever?

Max-iteration caps, token and cost budgets, loop and repeat detection, timeouts, and a fallback final answer.

21. How does an agent manage memory across a long task or session?

Distinguish short-term memory (the conversation buffer), summarised or compacted history, and long-term memory — a vector store or a file or database the agent reads and writes.

22. What is context engineering, and why is it replacing "prompt engineering"?

Deliberately curating everything in the window — tool results, retrieved documents, history, instructions — managing relevance and ordering, and pruning noise. The window is a budget, not a dumping ground.

23. How would you implement human-in-the-loop approval for risky actions?

Classify actions by risk, pause the loop to surface a confirmation, persist state so the run can resume, and default to deny for irreversible or outward-facing actions.

24. Compare orchestration frameworks — when would you skip the framework?

Frameworks give state machines, checkpointing and observability; a custom loop is fine and clearer for simple cases. The candidate should have an opinion and know the cost of the abstraction.

25. How do you build a RAG pipeline, end to end?

Ingest, chunk, embed, store in a vector database, retrieve — often hybrid: semantic plus keyword — rerank, and inject into the prompt with citations. Bonus points for evaluating retrieval quality separately from generation.

Part 4 — Production, evaluation and reliability

A laptop on a desk showing source code, ready for a live coding exercise — Photo: Christopher Gower / Unsplash

26. How do you evaluate an agent? "It looks good" is not an answer.

An eval set of representative tasks, automated scoring (exact match, LLM-as-judge, task success rate), regression testing on every prompt or model change, and tracing of individual tool calls.

27. An agent works in testing but fails 20% of the time in production. How do you debug it?

Tracing and observability, isolating the failing step, and checking for non-determinism, tool errors and context overflow — a systematic approach, not "tweak the prompt".

28. How do you handle LLM API failures — rate limits, timeouts, 5xx?

Exponential backoff with jitter, retries on idempotent calls, circuit breakers, fallback models or providers, and graceful degradation.

29. What are the security risks of giving an LLM tools, and how do you mitigate them?

Prompt injection — especially from tool outputs and retrieved content — over-broad tool permissions, and secret leakage. Mitigations: least-privilege tools, sandboxing, input and output filtering, never trusting model output as a command, and human approval for sensitive actions.

30. How do you control and monitor cost in a production agent system?

Per-request token and cost budgets, model routing — a cheap model for easy steps — prompt caching, capped loop iterations, and dashboards with alerts on cost per task.

How to use these questions

Screening (30 min): pick two from each part — questions 1, 9, 16 and 26 are good anchors.
Deep dive (60 min): focus on Parts 3 and 4, and pair them with a live coding exercise.
Best signal: turn question 17, 18 or 25 into a hands-on task — ask the candidate to build a small tool-calling loop. Talking about agents and shipping reliable ones are different skills.

The strongest candidates will not just recite definitions — they will talk about trade-offs, failure modes, and what they would actually do when things break.

Agentic engineering is also what powers modern conversational commerce. If you want to see these ideas in a shipped product — a tool-calling agent that reads a live catalogue and guides shoppers to checkout — read our guide to conversational commerce or take a look at what WisWes does.