Overview
This tutorial shows how to implement a production-ready Retrieval-Augmented Generation (RAG) API using Django + Postgres (pgvector) and consume it from WordPress via a secure shortcode plugin. We’ll cover schema, ingestion, embeddings, retrieval, generation, auth, caching, and operational hardening.
What you’ll build
– Django service: /api/rag/query for answers with citations
– Postgres + pgvector for semantic search
– Background ingestion + batched embeddings
– OpenAI gpt-4o-mini (or any) for grounded responses
– WordPress plugin: [rag_ask] shortcode with a simple UI, JWT auth, and result caching
Prerequisites
– Python 3.10+, Django 4+
– Postgres 14+ with pgvector extension
– OpenAI API key
– WordPress 6+, admin access
– A domain with HTTPS
1) Provision Postgres with pgvector
Enable extension:
CREATE EXTENSION IF NOT EXISTS vector;
Recommend DB flags:
– shared_buffers: 25% RAM
– effective_cache_size: 50–75% RAM
– work_mem: 64–256MB
– maintenance_work_mem: 512MB+
– wal_compression = on
2) Django project setup
mkdir rag_service && cd rag_service
python -m venv .venv && source .venv/bin/activate
pip install django djangorestframework psycopg[binary,pool] pydantic openai==1.* numpy
django-admin startproject core .
python manage.py startapp rag
In core/settings.py
– Add rest_framework, rag
– Configure DATABASES for Postgres
– Set ALLOWED_HOSTS, CSRF_TRUSTED_ORIGINS
– Add a simple JWT secret for signed requests (e.g., RAG_JWT_SECRET in env)
3) Models and migrations
rag/models.py
from django.db import models
class Document(models.Model):
source_id = models.CharField(max_length=255, unique=True)
title = models.CharField(max_length=500)
url = models.URLField(blank=True, null=True)
created_at = models.DateTimeField(auto_now_add=True)
class Chunk(models.Model):
document = models.ForeignKey(Document, on_delete=models.CASCADE, related_name=”chunks”)
ordinal = models.IntegerField()
text = models.TextField()
embedding = models.BinaryField() # store as float32 bytes
created_at = models.DateTimeField(auto_now_add=True)
Run:
python manage.py makemigrations
python manage.py migrate
Create embedding index
In psql:
— 1536 for text-embedding-3-large; adjust if using a different dimension
ALTER TABLE rag_chunk ADD COLUMN IF NOT EXISTS embedding_vec vector(1536);
— backfill computed column from binary only if you plan to duplicate; otherwise store directly as vector
CREATE INDEX IF NOT EXISTS idx_chunk_embedding_vec ON rag_chunk USING ivfflat (embedding_vec vector_cosine) WITH (lists = 100);
Note: In production, store embeddings directly into embedding_vec vector. Below code will do that.
4) Settings and env
.env
OPENAI_API_KEY=sk-…
RAG_JWT_SECRET=super-long-random
DJANGO_DEBUG=False
Load env in settings or via your process manager.
5) Embedding utilities
rag/embeddings.py
import os, numpy as np
from openai import OpenAI
from django.db import connection
EMBED_MODEL = “text-embedding-3-large”
_client = OpenAI(api_key=os.getenv(“OPENAI_API_KEY”))
def get_embedding(text: str) -> list[float]:
resp = _client.embeddings.create(model=EMBED_MODEL, input=text)
return resp.data[0].embedding
def insert_chunk_embedding(chunk_id: int, emb: list[float]):
# Write directly to vector column using psycopg adapt
with connection.cursor() as cur:
# pgvector accepts array literal ‘[]’ or parameterized vector
cur.execute(“UPDATE rag_chunk SET embedding_vec = %s WHERE id = %s”, (emb, chunk_id))
6) Ingestion and chunking
rag/ingest.py
import textwrap
from .models import Document, Chunk
from .embeddings import get_embedding, insert_chunk_embedding
def chunk_text(text: str, max_tokens=400):
# naive by characters; replace with tiktoken in production
size = 1600 # ~400 tokens
for i in range(0, len(text), size):
yield text[i:i+size]
def ingest_document(source_id: str, title: str, url: str|None, text: str):
doc, _ = Document.objects.get_or_create(source_id=source_id, defaults={“title”: title, “url”: url})
if doc.chunks.exists():
return doc
for idx, piece in enumerate(chunk_text(text)):
c = Chunk.objects.create(document=doc, ordinal=idx, text=piece)
emb = get_embedding(piece)
insert_chunk_embedding(c.id, emb)
return doc
7) Retrieval
rag/retrieval.py
from django.db import connection
def search_chunks(query_emb: list[float], k: int = 8):
with connection.cursor() as cur:
cur.execute(“””
SELECT c.id, c.text, d.title, d.url, 1 – (c.embedding_vec %s) AS score
FROM rag_chunk c
JOIN rag_document d ON c.document_id = d.id
ORDER BY c.embedding_vec %s
LIMIT %s
“””, (query_emb, query_emb, k))
rows = cur.fetchall()
return [{“id”: r[0], “text”: r[1], “title”: r[2], “url”: r[3], “score”: float(r[4])} for r in rows]
8) Generation
rag/generate.py
import os
from openai import OpenAI
_client = OpenAI(api_key=os.getenv(“OPENAI_API_KEY”))
GEN_MODEL = “gpt-4o-mini”
SYSTEM = “You are a factual assistant. Use provided context only. Cite sources by title and URL if present.”
def answer(query: str, contexts: list[dict]):
ctx_str = “nn—nn”.join([c[“text”] for c in contexts])
prompt = f”Question: {query}nnContext:n{ctx_str}nnInstructions:n- Answer concisely.n- If unsure, say you don’t know.n- Provide 2–4 citations with title and URL if available.”
resp = _client.chat.completions.create(
model=GEN_MODEL,
messages=[{“role”: “system”, “content”: SYSTEM},
{“role”: “user”, “content”: prompt}],
temperature=0.2,
)
return resp.choices[0].message.content
9) RAG API endpoint
rag/api.py
import os, time, hmac, hashlib, base64, json
from rest_framework.decorators import api_view
from rest_framework.response import Response
from rest_framework import status
from .embeddings import get_embedding
from .retrieval import search_chunks
from .generate import answer
SECRET = os.getenv(“RAG_JWT_SECRET”, “”)
def verify_token(token: str) -> bool:
# token = base64url(header.payload).signature; simple HMAC for demo
try:
header_b64, payload_b64, sig_b64 = token.split(“.”)
signing_input = f”{header_b64}.{payload_b64}”.encode()
sig = base64.urlsafe_b64decode(sig_b64 + “===”)
expected = hmac.new(SECRET.encode(), signing_input, hashlib.sha256).digest()
return hmac.compare_digest(sig, expected)
except Exception:
return False
def parse_payload(token: str) -> dict:
payload_b64 = token.split(“.”)[1]
return json.loads(base64.urlsafe_b64decode(payload_b64 + “===”).decode())
@api_view([“POST”])
def rag_query(request):
try:
token = request.headers.get(“Authorization”, “”).replace(“Bearer “, “”)
if not token or not verify_token(token):
return Response({“error”: “unauthorized”}, status=status.HTTP_401_UNAUTHORIZED)
payload = parse_payload(token)
# Optional: enforce domain or nonce from payload
q = request.data.get(“q”, “”).strip()
if not q:
return Response({“error”: “missing q”}, status=status.HTTP_400_BAD_REQUEST)
q_emb = get_embedding(q)
hits = search_chunks(q_emb, k=8)
# take top 4 unique docs, dedupe by document title or URL
seen = set()
contexts = []
for h in hits:
key = h[“url”] or h[“title”]
if key in seen:
continue
seen.add(key)
contexts.append(h)
if len(contexts) >= 4:
break
content = answer(q, contexts)
citations = [{“title”: c[“title”], “url”: c[“url”], “score”: c[“score”]} for c in contexts]
return Response({“answer”: content, “citations”: citations})
except Exception as e:
return Response({“error”: “server_error”}, status=500)
core/urls.py
from django.urls import path
from rag.api import rag_query
urlpatterns = [path(“api/rag/query”, rag_query)]
10) Basic rate limiting and timeouts
– Put the Django app behind a reverse proxy (nginx) with:
– proxy_read_timeout 30s
– limit_req zone=rag burst=10 nodelay
– Use gunicorn with workers = 2 * cores, timeout = 60
– Consider django-ratelimit if needed.
11) Caching embeddings and answers
– Cache per (q normalized) for 5–30 minutes in Redis.
– Cache retrieval hits keyed by embedding hash for 1–5 minutes during spikes.
12) WordPress plugin (shortcode client)
Create wp-content/plugins/rag-client/rag-client.php
esc_url(get_option(‘rag_client_endpoint’, ”)),
], $atts);
ob_start(); ?>
(function(){
const box = document.currentScript.previousElementSibling;
const form = box.querySelector(‘.rag-form’);
const out = box.querySelector(‘.rag-result’);
async function signPayload() {
// Minimal HMAC via server to avoid exposing secret.
const r = await fetch(”, {credentials:’same-origin’});
if(!r.ok) throw new Error(‘token’);
return r.text();
}
form.addEventListener(‘submit’, async (e)=>{
e.preventDefault();
const q = new FormData(form).get(‘q’);
out.textContent = ‘Thinking…’;
try {
const token = await signPayload();
const r = await fetch(”, {
method:’POST’,
headers:{
‘Content-Type’:’application/json’,
‘Authorization’:’Bearer ‘ + token
},
body: JSON.stringify({q})
});
if(!r.ok){ out.textContent = ‘Error. Try again.’; return; }
const data = await r.json();
const cites = (data.citations||[]).map(c=>`- ${c.title}${c.url? ‘ (‘+c.url+’)’:”}`).join(‘n’);
out.innerHTML = `
‘)}
${cites}
`;
} catch(e){ out.textContent = ‘Network error.’; }
});
})();
RAG Client
<input type="url" name="rag_client_endpoint" value="” style=”width:420px;” required/>
Example: https://api.example.com/api/rag/query
site_url(), ‘iat’=> time(), ‘exp’=> time()+60];
$header = [‘alg’=>’HS256′,’typ’=>’JWT’];
$seg = function($x){ return rtrim(strtr(base64_encode(json_encode($x)), ‘+/’, ‘-_’), ‘=’); };
$signing_input = $seg($header).’.’.$seg($payload);
$sig = hash_hmac(‘sha256’, $signing_input, get_option(‘rag_client_local_hmac’,’local-demo’), true);
$token = $signing_input.’.’.rtrim(strtr(base64_encode($sig), ‘+/’, ‘-_’), ‘=’);
header(‘Content-Type: text/plain’);
echo $token;
wp_die();
}
Security note:
– Do not hardcode secrets in JS.
– Store local HMAC key in WP options or wp-config.php:
define(‘RAG_CLIENT_LOCAL_HMAC’, ‘long-random’);
update_option(‘rag_client_local_hmac’, RAG_CLIENT_LOCAL_HMAC);
Usage in posts/pages
[ragexample]
[ragexample] is not used. Use:
[rag_ask]
13) Hardening and ops
– HTTPS end-to-end. Block Django endpoint to only accept requests from your WP origin(s) via firewall or middleware.
– Add CORS: allow only your domain.
– Set timeouts: embedding 10s, completion 20–30s.
– Retries: exponential backoff (max 2) on 429/5xx.
– Observability: log query latency, retrieval hits, tokens used, cache hit rate.
– Data retention: do not log raw user questions if sensitive.
– Backups: nightly Postgres base + WAL archiving.
14) Quick ingestion example
Create a management command:
python manage.py startapp commands (or use rag/management/commands)
rag/management/commands/ingest_demo.py
from django.core.management.base import BaseCommand
from rag.ingest import ingest_document
SAMPLE = “””Your docs or policy text here…”””
class Command(BaseCommand):
def handle(self, *args, **kwargs):
ingest_document(“sample-001”, “Sample Docs”, “https://example.com/docs”, SAMPLE)
self.stdout.write(self.style.SUCCESS(“Ingested”))
Run:
python manage.py ingest_demo
15) Simple load test
– Insert 10–50 docs, 200–1,000 chunks.
– Run hey:
hey -n 200 -c 10 -m POST -H “Authorization: Bearer ” -D body.json https://api.example.com/api/rag/query
body.json:
{“q”:”What is covered by our policy?”}
Expect p95 < 2.5s with warmed caches and IVFFLAT.
16) Cost and performance tips
– Use text-embedding-3-small if acceptable; reduce dim to lower memory and faster ANN.
– Pre-filter by metadata (doc type, section).
– Cache answers. Deduplicate contexts.
– Tune IVFFLAT lists (64–256) and probes (SET LOCAL ivfflat.probes = 5–20).
That’s it. You now have a deployable RAG API with a clean WordPress integration.