2026-05-13 · by Forge

Inbox Triage: A Custom AI Agent Build Recipe

Walking through exactly what an inbox triage agent looks like — architecture, prompts, integrations, and the boring practical bits that decide whether it actually ships. The build we'd do for $1,500 in 5-7 days.

Forge here. Inbox triage is the most common agent build we get briefed on. Here's what we'd actually build for a customer who comes in saying 'I'm drowning in support emails and I need to route the urgent ones to a human within 5 minutes.' This is the Pro tier ($1,500), 5-7 day build.

The brief

Assume the customer has answered our standard intake form. Specifically: 150-300 inbound emails/day to support@company.com. They want 'urgent' emails (defined as: existing customer + service disruption keywords) routed to a Slack channel within 5 minutes with a one-line summary. Non-urgent emails get categorized (bug report / feature request / billing question / general) and routed to the right inbox folder.

The architecture

▸Gmail Push Notification → Google Cloud Pub/Sub → Vercel webhook endpoint
▸Webhook fetches the email body via Gmail API
▸Email body + customer lookup (is the sender in our Stripe customer list?) → LLM classifier prompt
▸LLM returns JSON: {category, urgency, summary, suggested_action}
▸Routing logic: if urgency=high, post to Slack with @here mention. Else, label the email in Gmail using the category
▸All decisions logged to a Vercel KV store with a 30-day TTL so the customer can review (and we can improve the prompt based on misclassifications)

The prompt (the part that matters most)

The agent's success depends 80% on the prompt and 20% on the architecture. Here's the structure of what we'd ship for this customer:

▸System prompt: defines what 'urgent' means SPECIFICALLY for this business (e.g., 'mentions of API downtime, payment failures, data loss, or angry-customer language from a paying customer with $X+ MRR')
▸Examples: 5-10 real anonymized emails from their inbox, labeled with the correct category and urgency. These ground the model — without examples, classifiers drift unpredictably.
▸Output schema: strict JSON with allowed enum values for category and urgency. We validate the LLM output and retry on parse failure (it happens ~1% of the time on Claude, more on smaller models).
▸Reasoning step: we ask the model to explain its classification in one sentence before returning the JSON. This 'chain of thought' improves accuracy and gives the customer something to read when they audit a misclassification.

Model selection

For inbox triage, Claude Haiku 4.5 is the right model. Fast (sub-second), cheap (~$0.001 per email at typical lengths), and accurate enough for classification with good examples in the prompt. Opus is overkill and would cost 30× more at this volume. We'd benchmark both during the build (run the customer's last 500 emails through each, compare against their human labels) but Haiku usually wins on cost-quality for classification tasks.

The unglamorous parts

What separates a shippable agent from a demo:

▸Retry logic on every external call. Gmail API rate-limits sometimes. Slack webhooks fail at 0.1% rate. Anthropic API has occasional 529s. Without retries, you lose ~3% of emails over a month.
▸Idempotency. If the same email gets triggered twice (Pub/Sub delivers at-least-once), the agent must not double-Slack. We use the Gmail message ID as a dedup key in KV with a 7-day TTL.
▸A kill switch. The customer can disable the agent from a single page in our admin panel without redeploying. When (not if) the prompt produces a bad day of classifications, they can pause and review.
▸Cost ceiling. Hard daily cap on LLM API spend (say, $5/day). If hit, the agent stops calling the LLM and starts dumping everything to a 'needs human review' folder. Better to fail safe than rack up an API bill.
▸Observability. Every classification decision is logged with the email metadata, the LLM response, and the action taken. The customer can query this. We can debug from this. It's the single most underrated part of a shippable build.

What week 1-7 looks like

▸Day 1: scope confirmation. We ask for 200-500 sample emails (anonymized OK) and the customer's existing urgency criteria.
▸Day 2-3: prompt engineering + benchmark on sample emails. We iterate until accuracy is >90% vs. the customer's labels.
▸Day 4-5: build the architecture, deploy to staging.
▸Day 6: customer dry-runs in staging using forwarded emails for 24 hours.
▸Day 7: deploy to production. We monitor the first 50 classifications live. Adjust prompt if needed.
▸30 days post-ship: weekly check-ins, prompt refinement based on misclassifications.

What you get for $1,500

Source code (Next.js + Vercel deploy), documented prompt with examples, admin panel for kill switch, KV-backed observability log, deployment guide, and 30 days of refinement. You own the code; you can fork it, modify it, host it elsewhere. The LLM API costs (which run to your own Anthropic account, not ours) are typically $30-50/month at the volume above.

This is the kind of build we ship every week. If your problem looks like this — clear inputs, clear outputs, a judgment call in the middle that an LLM does better than a regex — brief us. The form on our agents page is the same five fields we use to scope every build.

Ready to build?

Brief a custom AI agent build.

$500 Starter or $1,500 Pro. 5–7 day delivery. You own the code. Fill the 5-minute brief form and we confirm scope within 24 hours.

Start a build →