An agent that worked perfectly in a notebook fails on day three in production. The user asked something slightly off-script, the model hallucinated a tool call, and now your support team is cleaning up tickets. We’ve shipped enough AI agents on Claude, OpenAI, n8n, and LangSmith to know this is almost never a model problem. It’s a prompt structure problem.
These five patterns are what we reach for first. They’re not novel. They are, however, the difference between a demo and a system you can leave running on a Tuesday.
The 5 patterns we use on every production agent
- Structured output schema. Force the model to emit JSON conforming to a defined schema, validated before any downstream action runs. Claude’s tool use and OpenAI’s structured outputs both enforce this at the API level. If your agent is still parsing freeform text with regex, you’re one weird user message away from a production incident.
- Role and task decomposition in the system prompt. Don’t ask the model to “help the user with their billing question.” Decompose: “You are a billing triage agent. Your job is one of three actions: (1) answer from the FAQ, (2) escalate to human, (3) request more info. Pick one, then justify the choice.” Pre-decomposition cuts ambiguity by maybe 60 percent.
- Tool-result framing. When you feed a tool’s output back to the model, wrap it in explicit XML or JSON tags with the tool name and timestamp. Don’t paste raw results into the conversation. The model needs to know which call returned what, especially across multi-turn agent loops where two tool calls might return similar-looking data.
- Self-check before final output. Add a verification step where the model reviews its own draft response against the original request. “Before responding, check: did you answer all parts of the question? Did you cite the source document? If not, revise.” This is cheap (one extra token pass) and catches a surprising amount of drift.
- Fallback patterns. Build explicit “I don’t know” and escalation paths into the prompt. Models hallucinate when they feel cornered. Give them a graceful exit: “If you cannot answer with confidence above 80 percent, output {action: ‘escalate’, reason: ‘…’} and stop.” Reliability comes from giving the model permission to fail loudly.
Example: a structured output prompt template
Here’s a stripped-down version of the system prompt we use for a triage agent. It combines patterns 1, 2, and 5.
You are a support triage agent for a B2B SaaS product.
Your job: classify each incoming user message and return a JSON object.
Allowed actions:
- "answer_from_kb" : message can be resolved from the knowledge base
- "escalate_billing" : message involves a billing dispute or refund
- "escalate_tech" : message describes a bug or outage
- "request_info" : message is too vague to act on
- "decline" : message is out of scope (sales, legal, HR)
Required output schema:
{
"action": "<one of the allowed actions>",
"confidence": <float 0.0 to 1.0>,
"reason": "<one sentence, max 20 words>",
"kb_article_ids": [<array of strings, only if action is answer_from_kb>]
}
Rules:
1. If confidence is below 0.75, action MUST be "request_info" or
"escalate_tech" — never guess.
2. Never output any text outside the JSON object.
3. If the message contains PII, set action to "escalate_tech" and
include "PII_detected" in the reason field.
User message:
{{user_message}}
That’s roughly 30 lines, and it does most of the work. The downstream code validates the JSON against a schema, routes the action, and logs confidence to LangSmith for monitoring. The model never sees free-text decisions.
Anti-patterns to avoid
We’ve also developed strong opinions about what not to do. These three account for most of the agent failures we audit.
- Chain-of-thought leakage to the user. If you prompt the model to “think step by step,” and you don’t strip the reasoning before showing output to a user, you’re shipping internal monologue to customers. Use a separate reasoning field in the schema, then drop it before display.
- Over-prompting. A 4,000-token system prompt with 47 edge-case rules will perform worse than a 400-token prompt with 5 clear ones. Models get confused by long, conflicting instructions. If your prompt has grown past 1,500 tokens, refactor into smaller specialized agents with handoffs.
- No eval suite. If you can’t replay 50 historical inputs against a new prompt version and see pass/fail counts, you’re not iterating, you’re guessing. We use LangSmith or simple pytest harnesses for this. The eval suite is more important than the prompt itself.
When to use an agent versus a workflow
None of these patterns matter if you picked the wrong architecture. Agents are right when the path through the task is not knowable in advance. For predictable, multi-step flows, a deterministic workflow with one LLM call inside it almost always beats an autonomous agent. We wrote about that distinction at length in AI agents vs workflow automation.
Now what? If you’re scoping an agent build and want a second set of eyes on the architecture, see how we approach autonomous AI agents in production.