Series 4 — Part 3 of 8

ChromaDB gives each tenant their own vector collection — strict data isolation with fast semantic search. This article covers the ingestion pipeline, query-time retrieval, and how to keep knowledge bases fresh without rebuilding from scratch.

One Collection Per Tenant

ChromaDB supports multiple collections in a single instance. Use one collection per client, named by client slug: rag_acme, rag_legalfirm. Never share a collection across clients.

import chromadb

chroma = chromadb.HttpClient(host="localhost", port=8000)

def get_client_collection(client_slug: str):
    return chroma.get_or_create_collection(
        name=f"rag_{client_slug}",
        metadata={"hnsw:space": "cosine"}
    )

# Isolation check: never query without scoping to the client collection
def retrieve(client_slug: str, query: str, top_k: int = 5) -> list[str]:
    col  = get_client_collection(client_slug)
    emb  = embed(query)  # your embedding model
    res  = col.query(query_embeddings=[emb], n_results=top_k)
    return res['documents'][0] if res['documents'] else []

Document Ingestion Pipeline

import pypdf, docx, hashlib

def ingest_document(client_slug: str, file_path: str, source_label: str):
    text   = extract_text(file_path)           # PDF or DOCX
    chunks = chunk_text(text, size=400, overlap=50)
    col    = get_client_collection(client_slug)

    ids, embeddings, metadatas, documents = [], [], [], []
    for i, chunk in enumerate(chunks):
        chunk_id = hashlib.sha256(f"{source_label}:{i}:{chunk}".encode()).hexdigest()[:16]
        ids.append(chunk_id)
        embeddings.append(embed(chunk))
        metadatas.append({"source": source_label, "chunk": i, "ingested_at": now_iso()})
        documents.append(chunk)

    # Upsert — safe to re-run on document update
    col.upsert(ids=ids, embeddings=embeddings, metadatas=metadatas, documents=documents)
    return len(chunks)

def extract_text(path: str) -> str:
    if path.endswith('.pdf'):
        reader = pypdf.PdfReader(path)
        return "\n".join(p.extract_text() for p in reader.pages if p.extract_text())
    if path.endswith('.docx'):
        doc = docx.Document(path)
        return "\n".join(p.text for p in doc.paragraphs if p.text.strip())
    raise ValueError(f"Unsupported format: {path}")

Incremental Updates

Use upsert() instead of add(). Upsert with a deterministic chunk ID (based on source + chunk index + content hash) means re-ingesting an unchanged document produces no net change. Only genuinely modified chunks are updated.

To delete a document: query all chunks where metadata.source == source_label, collect their IDs, then call col.delete(ids=ids). This is the only correct GDPR erasure path for RAG data.

What to Watch For

  • Chunk size tuning — 400 tokens is a starting point, not a law. Legal documents need larger chunks (800+) to preserve argument context. Product FAQs work with smaller chunks (200).
  • Embedding model consistency — Once you embed a collection with one model, you must query it with the same model. Changing the model requires full re-ingestion.
  • Top-k relevance floor — If the top retrieved chunk has cosine similarity below 0.6, it is probably noise. Add a similarity floor before including chunks in the prompt.