Series 4 — Part 4 of 8

Ollama runs Llama 3.1 locally with an OpenAI-compatible API. For a SaaS platform handling sensitive business conversations, local inference means conversation data never leaves your infrastructure. This article covers setup, integration, prompt engineering, and latency management.

Running Llama 3.1 with Ollama

# Install and start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &

# Pull the model (8B fits on 16GB RAM; 70B needs ~48GB)
ollama pull llama3.1:8b

# Verify with a health check
curl http://localhost:11434/api/tags

Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. Point any OpenAI SDK client at http://localhost:11434/v1 with any string as the API key.

OpenAI-Compatible Integration

from openai import OpenAI

ollama_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama ignores this but the SDK requires it
)

def generate_response(system_prompt: str, messages: list, model: str = "llama3.1:8b") -> str:
    response = ollama_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            *messages,
        ],
        temperature=0.7,
        max_tokens=800,
        stream=False,
        timeout=30.0,  # Never wait longer than this for a local model
    )
    return response.choices[0].message.content or ""

Prompt Engineering for Sales Contexts

Llama 3.1 follows instructions well but needs explicit constraints for sales contexts:

  • Role clarity — State the persona in the first sentence. Llama's default "helpful assistant" persona will bleed through if not overridden explicitly.
  • Constraint exhaustion — List what the bot should NOT do: do not offer discounts, do not make promises about features not in the knowledge base, do not schedule appointments without availability data.
  • Output format — If you need structured output (JSON for CRM field extraction), include a RESPONSE FORMAT section in the system prompt with an example.

Latency Management

Llama 3.1 8B on a Raspberry Pi 5 generates ~10-15 tokens/second. A 200-token response takes 15-20 seconds — too long for a synchronous webhook. Use streaming and the Celery task queue (see Celery + Redis Task Queue for AI) to avoid timing out.

def generate_streaming(system_prompt, messages, model="llama3.1:8b"):
    stream = ollama_client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": system_prompt}, *messages],
        stream=True,
    )
    full_text = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        full_text += delta
        yield delta  # send to caller progressively
    return full_text

What to Watch For

  • Model drift on pullollama pull always fetches the latest version of the model tag. Pin model names to specific digests in production.
  • Context window limits — Llama 3.1 8B has a 128k token context but practical throughput degrades after ~8k. Keep your combined prompt + history under 8k.
  • Resource contention — If Ollama and ChromaDB run on the same machine, an ingestion job during peak traffic will starve inference. Schedule ingestion outside business hours.