Reverse Proxy Architecture for Local AI Services

Series 7 — Part 6 of 6

Nginx sits in front of all AI microservices as a reverse proxy. It handles TLS termination, routes requests to the correct backend service, authenticates at the proxy layer, and rate-limits by service. This article covers the complete Nginx configuration for a self-hosted AI stack.

Routing to AI Services

upstream kokoro_tts  { server 127.0.0.1:9010; }
upstream whisper_stt { server 127.0.0.1:9011; }
upstream audio_conv  { server 127.0.0.1:9012; }
upstream ollama_llm  { server 127.0.0.1:11434; }
upstream chromadb    { server 127.0.0.1:8000; }

server {
    listen 443 ssl;
    server_name ai.govindpreetsingh.com;

    ssl_certificate     /etc/letsencrypt/live/ai.govindpreetsingh.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.govindpreetsingh.com/privkey.pem;

    # Internal API key auth at proxy layer
    location /tts/ {
        auth_request /auth;
        proxy_pass http://kokoro_tts/;
        limit_req zone=tts_zone burst=5 nodelay;
    }

    location /stt/ {
        auth_request /auth;
        proxy_pass http://whisper_stt/;
        limit_req zone=stt_zone burst=3 nodelay;
    }

    location /llm/ {
        auth_request /auth;
        proxy_pass http://ollama_llm/;
        limit_req zone=llm_zone burst=2 nodelay;
        proxy_read_timeout 60s;  # LLM responses take longer
    }

    # Internal auth sub-request handler
    location = /auth {
        internal;
        proxy_pass http://127.0.0.1:8080/internal/auth;
        proxy_pass_request_body off;
        proxy_set_header Content-Length "";
        proxy_set_header X-API-Key $http_x_api_key;
    }
}

Rate Limiting Per Service

http {
    # Rate limit zones — per IP
    limit_req_zone $binary_remote_addr zone=tts_zone:10m rate=10r/m;
    limit_req_zone $binary_remote_addr zone=stt_zone:10m rate=6r/m;
    limit_req_zone $binary_remote_addr zone=llm_zone:10m rate=4r/m;
    # LLM gets the tightest limit — most resource-intensive
}

SSL Termination for Local Services

The AI microservices (Kokoro, Whisper, etc.) run HTTP only — no TLS. Nginx terminates TLS at the proxy and forwards plain HTTP internally. This is correct and standard: TLS at the proxy + plain HTTP to localhost is equivalent in security to end-to-end TLS when the path never leaves the machine.

What to Watch For

proxy_read_timeout for LLM — Ollama can take 30-60 seconds for a long response. Default Nginx timeout is 60 seconds. Set it explicitly for the LLM upstream to avoid unexpected 504 errors.
Client body size for audio upload — WhatsApp voice notes can be 4MB+. Set client_max_body_size 10M for the STT and converter routes.
Rate limit storage — The 10m in limit_req_zone is the memory zone size (10 megabytes), not the rate. 10MB handles ~160,000 unique IPs.

Routing to AI Services

Rate Limiting Per Service

SSL Termination for Local Services

What to Watch For

Stay at the cutting edge