This post walks through a production-ready support agent with a Brain + Hands separation, wired into WordPress on the front, and Django on the back. The goal: predictable behavior, fast responses, measurable quality, and easy handoff to humans.
Use case
– Tier-1 support for order status, returns, product info, and FAQ
– Handoff to human when confidence is low or user requests it
– Works in a WordPress site widget, Slack, and email (shared backend)
Architecture (high level)
– Front-end: WordPress chat widget (vanilla JS) -> Django REST endpoint
– Brain: LLM for reasoning + routing (no direct data access)
– Hands: Tools in Django (Postgres + Redis) exposed via function-calling schemas
– Memory: Short-term thread memory (Redis), long-term knowledge (Postgres + pgvector)
– Orchestrator: Deterministic state machine (Django service + Celery tasks)
– RAG: Product/FAQ index with embeddings; constrained retrieval
– Observability: Request logs, traces, tool latency, outcomes, cost
– Deployment: Docker, Nginx, Gunicorn, Celery, Redis, Postgres
Brain + Hands separation
– Brain (LLM): Planning, deciding which tool to call, assembling final answer. No raw DB/API keys. Receives tool specs only.
– Hands (Tools): Deterministic, side-effect aware, with strict input/output schemas. Tools never “think”—they do.
Core tools (Hands)
– search_kb(query, top_k): RAG over Postgres+pgvector. Returns citations with IDs and source.
– get_order(email|order_id): Reads order status from internal service.
– create_ticket(email, subject, body, priority): Creates support case in helpdesk.
– handoff_human(reason, transcript_excerpt): Flags for live agent queue with context.
Tool contracts (JSON schema examples)
– search_kb input: { query: string, top_k: integer PLAN -> (TOOL_LOOP)* -> DRAFT -> GUARDRAIL -> RESPOND
– TOOL_LOOP limits to 3 tool calls per turn
– If Brain calls an unknown tool or wrong schema: correct and retry once, else fallback to handoff_human
– Timeouts: 3s per tool; overall SLA 6s; degrade mode returns partial + “We’re checking further via email” and opens ticket
Guardrails
– Content filter: block sensitive/abusive content; offer handoff
– PII sanitizer: mask tokens before vector search
– Citation checker: if answer references kb, verify at least one valid citation is present
– Safety fallback: neutral response + create_ticket when filter trips
RAG implementation
– Storage: Postgres with pgvector for embeddings
– Chunking: 512–800 tokens, overlap 80
– Metadata: doc_id, section, source, updated_at, allowed_channels
– Query: Hybrid BM25 + vector; re-rank top 8 to 3
– Response: Return only snippets + URLs; Brain composes final with citations “(See: Title)”
Error handling
– Tool failures: exponential backoff (200ms, 400ms); then circuit-break for 60s
– LLM failures: switch to fallback model on timeout; respond with concise generic + ticket
– Data drift: if RAG index empty or stale, disable search_kb and escalate
WordPress integration
– Front-end widget: Minimal JS injects a floating chat; posts to /api/agent/messages with thread_id and csrf token nonce
– Auth: Public sessions get rate-limited by IP + device fingerprint; logged-in users attach JWT from WordPress to Django via shared secret
– Webhooks: Ticket created -> WordPress admin notice and email; agent takeover -> support Slack channel
Django endpoints (concise)
– POST /api/agent/messages: { thread_id, user_msg }
– GET /api/agent/thread/{id}: returns last N messages + status
– POST /api/agent/feedback: thumbs_up/down, tags
– Admin: /admin/agent/tools, /admin/agent/kb, /admin/agent/metrics
Celery tasks
– run_brain_step(thread_id)
– execute_tool(call_id)
– rebuild_kb_index()
– nightly_eval() against golden test set
Model selection
– Primary: a function-calling LLM with low latency (e.g., GPT-4o-mini or Claude Sonnet-lite). Keep token limits reasonable.
– Fallback: cheaper model with same tool schema to maintain compatibility.
– Temperature: 0.2 for tool routing, 0.5 for final drafting.
Cost and latency targets
– P50: 1.4s response (no tools), 2.8s with RAG, 3.5s with order lookup
– P95: <5s
– Cost: X%
Deployment notes
– Docker services: web (Gunicorn), worker (Celery), scheduler (Celery Beat), redis, postgres, nginx
– Readiness probes: tool ping, RAG index freshness, model API status
– Secrets: mounted via Docker secrets; rotate quarterly
– Blue/green deploy: drain workers, warm RAG cache, switch traffic
Minimal data models
– threads(id, user_id, channel, status, created_at)
– messages(id, thread_id, role, content, tool_name?, tool_payload?, created_at)
– kb_docs(id, title, url, text, embedding, updated_at, allowed_channels)
– tickets(id, thread_id, external_id, status, priority, created_at)
Snippet: tool call flow (pseudo)
– User -> /messages
– Orchestrator builds context from Redis + last N messages
– Brain returns tool_call: search_kb
– Celery executes search_kb, stores items
– Brain drafts answer with citations
– Guardrail checks
– Respond; optionally create_ticket if unresolved
Rollout plan
– Phase 1: FAQ-only RAG; no order lookups; human-in-the-loop
– Phase 2: Enable get_order with safe whitelist; add evals
– Phase 3: Enable create_ticket + SLA timers
– Phase 4: Add Slack channel and email ingestion to same backend
What to avoid
– Letting the Brain call HTTP endpoints directly
– Unbounded memory growth in Redis
– RAG over unreviewed or user-generated content
– Returning tool stack traces to users
Repository checklist
– /orchestrator: state machine, guardrails
– /tools: deterministic functions, schemas, tests
– /brain: prompt templates, model client, retries
– /kb: loaders, chunker, embeddings, indexer
– /web: Django views, serializers, auth
– /ops: docker-compose, nginx, CI, eval harness, dashboards
This pattern gives you a predictable, support-ready agent that integrates cleanly with WordPress, scales under load, and stays auditable.