ChromaDB gives each tenant their own vector collection — strict data isolation with fast semantic search. This article covers the ingestion pipeline, query-time retrieval, and how to keep knowledge bases fresh without rebuilding from scratch.
One Collection Per Tenant
ChromaDB supports multiple collections in a single instance. Use one collection per client, named by client slug: rag_acme, rag_legalfirm. Never share a collection across clients.
import chromadb
chroma = chromadb.HttpClient(host="localhost", port=8000)
def get_client_collection(client_slug: str):
return chroma.get_or_create_collection(
name=f"rag_{client_slug}",
metadata={"hnsw:space": "cosine"}
)
# Isolation check: never query without scoping to the client collection
def retrieve(client_slug: str, query: str, top_k: int = 5) -> list[str]:
col = get_client_collection(client_slug)
emb = embed(query) # your embedding model
res = col.query(query_embeddings=[emb], n_results=top_k)
return res['documents'][0] if res['documents'] else []
Document Ingestion Pipeline
import pypdf, docx, hashlib
def ingest_document(client_slug: str, file_path: str, source_label: str):
text = extract_text(file_path) # PDF or DOCX
chunks = chunk_text(text, size=400, overlap=50)
col = get_client_collection(client_slug)
ids, embeddings, metadatas, documents = [], [], [], []
for i, chunk in enumerate(chunks):
chunk_id = hashlib.sha256(f"{source_label}:{i}:{chunk}".encode()).hexdigest()[:16]
ids.append(chunk_id)
embeddings.append(embed(chunk))
metadatas.append({"source": source_label, "chunk": i, "ingested_at": now_iso()})
documents.append(chunk)
# Upsert — safe to re-run on document update
col.upsert(ids=ids, embeddings=embeddings, metadatas=metadatas, documents=documents)
return len(chunks)
def extract_text(path: str) -> str:
if path.endswith('.pdf'):
reader = pypdf.PdfReader(path)
return "\n".join(p.extract_text() for p in reader.pages if p.extract_text())
if path.endswith('.docx'):
doc = docx.Document(path)
return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
raise ValueError(f"Unsupported format: {path}")
Incremental Updates
Use upsert() instead of add(). Upsert with a deterministic chunk ID (based on source + chunk index + content hash) means re-ingesting an unchanged document produces no net change. Only genuinely modified chunks are updated.
To delete a document: query all chunks where metadata.source == source_label, collect their IDs, then call col.delete(ids=ids). This is the only correct GDPR erasure path for RAG data.
What to Watch For
- Chunk size tuning — 400 tokens is a starting point, not a law. Legal documents need larger chunks (800+) to preserve argument context. Product FAQs work with smaller chunks (200).
- Embedding model consistency — Once you embed a collection with one model, you must query it with the same model. Changing the model requires full re-ingestion.
- Top-k relevance floor — If the top retrieved chunk has cosine similarity below 0.6, it is probably noise. Add a similarity floor before including chunks in the prompt.