This build delivers a support/billing agent you can actually ship. It follows Brain+Hands separation, explicit tool contracts, a deterministic state machine, and secure backend tools. Stack: WordPress (frontend), Python FastAPI (tools + orchestrator), Redis (state/cache), Postgres (KB + logs), vector DB (memory), OpenAI/Groq/Anthropic (LLM).
1) Architecture overview
– Brain (LLM policy): Plans, chooses tools, reasons. No direct DB/API access.
– Hands (tools): Idempotent HTTP endpoints with strict schemas. Observable, rate-limited, auditable.
– Orchestrator: State machine controlling turns, tool calls, retries, and timeouts.
– Memory:
– Short-term: per-session scratchpad (Redis).
– Knowledge: vector search over product docs/FAQ.
– Facts: authoritative store lookups (billing, orders).
– Guardrails: auth, PII redaction, tool allowlist, cost/time budgets, circuit breakers.
– Integration: WordPress plugin sends chat events to orchestrator; streaming tokens back to UI.
2) Contracts first (make tools boring and safe)
Define JSON schemas for every tool. Keep them narrow, idempotent, and testable.
Example tool manifest (slice):
{
“name”: “get_order_status”,
“description”: “Return current order state for a given order_id.”,
“method”: “POST”,
“url”: “https://api.example.com/tools/get_order_status”,
“input_schema”: {
“type”: “object”,
“properties”: {
“order_id”: {“type”: “string”, “pattern”: “^[A-Z0-9-]{6,}$”}
},
“required”: [“order_id”],
“additionalProperties”: false
},
“output_schema”: {
“type”: “object”,
“properties”: {
“status”: {“type”: “string”},
“updated_at”: {“type”: “string”, “format”: “date-time”}
},
“required”: [“status”]
},
“timeouts_ms”: 2500,
“retries”: 1
}
3) Hands: secure Python FastAPI tools
– Enforce schema at the edge.
– Require JWT with narrow scopes.
– Add rate limits and audit logs.
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel, Field
import time
app = FastAPI(title=”SupportAgentTools”)
class OrderReq(BaseModel):
order_id: str = Field(min_length=6, pattern=r”^[A-Z0-9-]{6,}$”)
class OrderResp(BaseModel):
status: str
updated_at: str | None = None
def auth(scope: str):
def _auth(token=Depends(…)): # your JWT dependency
if scope not in token.scopes:
raise HTTPException(403, “forbidden”)
return token.sub
return _auth
@app.post(“/tools/get_order_status”, response_model=OrderResp)
def get_order_status(req: OrderReq, _=Depends(auth(“order:read”))):
start = time.time()
# query read replica; maintain SLO use respective tools.
– General product usage -> retrieve_docs.
– Anything else -> answer concisely or ask a clarifying question.
6) Orchestrator: a small, reliable state machine
States:
– RECEIVE -> PLAN -> EXECUTE_TOOL? -> OBSERVE -> RESPOND -> END
Pseudo:
def handle_turn(msg, session_id):
budget = Budget(tokens=3000, tools=3, wall_ms=8000)
state = “PLAN”
memory = load_session(session_id)
while budget.ok() and state != “END”:
if state == “PLAN”:
action = llm_policy(memory, tool_manifest)
if action.type == “tool”:
state = “EXECUTE_TOOL”
else:
state = “RESPOND”
elif state == “EXECUTE_TOOL”:
result = call_tool(action.name, action.args, timeout=manifest[action.name].timeouts_ms)
record_observation(result)
state = “OBSERVE”
elif state == “OBSERVE”:
memory.update_with_observation(result)
if need_more_tools(result): state = “PLAN”
else: state = “RESPOND”
elif state == “RESPOND”:
reply = llm_response(memory)
emit_stream(reply)
state = “END”
Controls:
– Max 2 tool calls/turn for latency.
– Tool circuit breaker after 2x p95 failures.
– Token + time budgets enforced per turn.
7) Retrieval that doesn’t hallucinate
– Chunk docs to 300–500 tokens with overlap 50–100.
– Store title, URL, product tags, and last_updated.
– Rerank top 20 -> 5 with a fast cross-encoder or LLM-judge at small context.
– In answers, include “According to
– Evict stale docs with last_updated TTL checks.
8) Error handling and fallbacks
– Tool error classes: 4xx user-fixable (show guidance), 5xx transient (retry with jitter), timeout (offer manual escalation).
– If tools unavailable, switch to knowledge-only mode and surface a status note to the user.
– Log: request_id, user_hash, tool_calls, latencies, token_usage, model, success_flag.
9) WordPress integration (plugin sketch)
– Shortcode [aiguy_chat] renders chat UI.
– Frontend calls /wp-json/aiguy/v1/chat (nonce protected).
– Server proxy signs JWT to orchestrator and streams chunks back.
PHP (very abbreviated):
add_action(‘rest_api_init’, function () {
register_rest_route(‘aiguy/v1’, ‘/chat’, [‘methods’=>’POST’,’callback’=>’aiguy_chat’,’permission_callback’=>’__return_true’]);
});
function aiguy_chat(WP_REST_Request $r) {
$jwt = make_scoped_jwt([‘aud’=>’orchestrator’,’scopes’=>[‘chat:send’]]);
$resp = wp_remote_post(‘https://agent.example.com/chat’, [
‘headers’=>[‘Authorization’=>”Bearer $jwt”],
‘body’=>[‘session_id’=>get_session_id(), ‘message’=>$r->get_param(‘message’)],
‘timeout’=>15
]);
return rest_ensure_response(wp_remote_retrieve_body($resp));
}
10) Models and performance
– Use fast model for planning (e.g., gpt-4o-mini, llama-3.1-70b-instruct via Groq) and a stronger model for final generation when needed.
– Target p95 < 2.5s single-turn with 0–1 tool; < 4.5s with 2 tools.
– Cache retrieval and deterministic tool schemas to reduce tokens.
11) Security checklist
– Tool allowlist + strict JSON schemas.
– JWT with narrow scopes + rotation.
– PII redaction before logs; encrypt sensitive fields at rest.
– Separate read/write tools; require user confirmation for writes.
– Rate limit per IP/user/session; WAF on tool API.
12) Observability
– OpenTelemetry spans: plan, tool call, observe, generate.
– Emit metrics: latency p50/p95, tool error rate, deflection rate, CSAT.
– Log all prompts/responses with redaction; enable replay in a sandbox.
13) Deployment
– Dockerize orchestrator + tools. Separate autoscaling for tools with spiky I/O.
– Blue/green deploy; health checks include LLM warmup and tool canary.
– Backpressure: queue requests when model or DB under load; shed load gracefully with a user-facing “email me results” fallback.
14) Minimal smoke test
– Happy path: “Where is my order ABC123?”
– Tool timeout path: inject 2s delay; verify graceful message.
– Hallucination guard: ask for unavailable feature; ensure denial with doc citation.
Deliverables you can ship this week
– Tool API (FastAPI) with 3 tools: get_order_status, get_invoice_pdf, retrieve_docs.
– Orchestrator with state machine, budgets, and streaming.
– WordPress plugin with nonce-protected REST route and basic chat UI.
– Vector index with top 20 support articles and citations in answers.
– Dashboards: latency, tool errors, deflection, cost.
This is a fantastic architectural overview; I really appreciate the focus on safety and determinism with the Brain+Hands pattern. How have you found it best to handle cases where the Brain hallucinates a tool call or provides malformed arguments?
One thing I’m curious about: how are you validating and repairing tool calls end-to-end? For example, do you do strict JSON Schema validation at the orchestrator, then run a “repair” pass (or constrained decoding / function calling) when args don’t parse or don’t match types/enums?
Also, what’s your retry strategy—do you re-prompt the Brain with the validation errors, auto-coerce common cases (dates, IDs, booleans), or just reject and ask the user for clarification? Finally, when arguments are still malformed after N attempts, what’s your fallback path (safe human handoff, “no-op + ask a question,” or a deterministic default)?
Excellent breakdown—we follow a similar path of strict JSON schema validation, re-prompting with the error for retries, and a safe human handoff as the final fallback.