Ollama runs Llama 3.1 locally with an OpenAI-compatible API. For a SaaS platform handling sensitive business conversations, local inference means conversation data never leaves your infrastructure. This article covers setup, integration, prompt engineering, and latency management.
Running Llama 3.1 with Ollama
# Install and start Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
# Pull the model (8B fits on 16GB RAM; 70B needs ~48GB)
ollama pull llama3.1:8b
# Verify with a health check
curl http://localhost:11434/api/tags
Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. Point any OpenAI SDK client at http://localhost:11434/v1 with any string as the API key.
OpenAI-Compatible Integration
from openai import OpenAI
ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama ignores this but the SDK requires it
)
def generate_response(system_prompt: str, messages: list, model: str = "llama3.1:8b") -> str:
response = ollama_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
*messages,
],
temperature=0.7,
max_tokens=800,
stream=False,
timeout=30.0, # Never wait longer than this for a local model
)
return response.choices[0].message.content or ""
Prompt Engineering for Sales Contexts
Llama 3.1 follows instructions well but needs explicit constraints for sales contexts:
- Role clarity — State the persona in the first sentence. Llama's default "helpful assistant" persona will bleed through if not overridden explicitly.
- Constraint exhaustion — List what the bot should NOT do: do not offer discounts, do not make promises about features not in the knowledge base, do not schedule appointments without availability data.
- Output format — If you need structured output (JSON for CRM field extraction), include a
RESPONSE FORMATsection in the system prompt with an example.
Latency Management
Llama 3.1 8B on a Raspberry Pi 5 generates ~10-15 tokens/second. A 200-token response takes 15-20 seconds — too long for a synchronous webhook. Use streaming and the Celery task queue (see Celery + Redis Task Queue for AI) to avoid timing out.
def generate_streaming(system_prompt, messages, model="llama3.1:8b"):
stream = ollama_client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": system_prompt}, *messages],
stream=True,
)
full_text = ""
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
full_text += delta
yield delta # send to caller progressively
return full_text
What to Watch For
- Model drift on pull —
ollama pullalways fetches the latest version of the model tag. Pin model names to specific digests in production. - Context window limits — Llama 3.1 8B has a 128k token context but practical throughput degrades after ~8k. Keep your combined prompt + history under 8k.
- Resource contention — If Ollama and ChromaDB run on the same machine, an ingestion job during peak traffic will starve inference. Schedule ingestion outside business hours.