The Qmem File · Ollama → TEI · A Migration Post-Mortem

The Qmem File · Ollama → TEI

№ 04-17 / 2026

A companion piece — the full resource accounting, with before/after visuals for every metric.

The Ledger →

At a glance

14 / 14

Tasks shipped, every one of them green, nothing deferred.

24 ms

Single-embed latency — warm, round-trip from Python to GPU and back.

4 / 4

Agents verified — Claude Code, Codex, Gemini CLI, and OpenCode.

8 – 2000×

Speedup, bounded by warm best-case and saturated real-world load.

The qmem memory stack — the part of the gaming-PC rig that lets four different CLI agents share a single Qdrant-backed vector memory — had been quietly melting the CPU for days. Ollama was detecting the 3090 Ti on startup, announcing the find in its logs, and then politely deciding to offload zero of thirty-seven layers to it. Every hook fire went to eight CPU threads instead. Load average one-thirty-three. The GPU sat at one-hundred-and-fifty-three mebibytes — less than a modest browser tab. This is the log of how it got fixed, in one session, with every before-and-after captured.

§ I

The journey, in five phases

Diagnose the smoke.

Triaged system usage — Ollama was pegging three cores at two-hundred-eighty-four percent CPU, the fifteen-minute load average had clocked at one-hundred-thirty-three, and the Ollama production log showed p95 embedding latency at forty-nine seconds.

top nvidia-smi journalctl

Find the root cause.

Ollama's newer --ollama-engine detected the 3090 Ti but set GPULayers:[] and NumThreads:8, offloading zero of thirty-seven layers to GPU. On top of that, the qmem provider looped per-text in a for text in texts HTTP loop — no batching, even if the GPU had been doing the work.

ollama logs ollama_provider.py:17

iii

Pick the right tool.

Compared four options — keep Ollama with a Modelfile override, switch to TEI, load sentence-transformers in-process, or reach for vLLM. Chose TEI. It is purpose-built for embedding workloads, native to the GPU, supports true batching, and exposes an OpenAI-compatible endpoint.

decision record option B

Implement, end to end.

Installed nvidia-container-toolkit from Arch's extra repository, wired the Docker nvidia runtime, launched TEI with Qwen3-Embedding-8B, wrote a thirty-line provider, added one branch to the factory, and flipped fifteen configuration sites across four CLI agents and their hook wrappers.

docker python bash

Prove it works.

Three layers of self-test — direct provider call, Qdrant round-trip via qmem's internals, and a hook-fire from each of the four agents' environments. Every record landed at 4096-dim, every vector came back retrievable, every score above zero-point-eight-nine.

11 tests all green

§ II

The hot path, before & after

As it was · ollama, on cpu

iFour CLI agents fire hooks per prompt, per tool call.

iiqmem_mcp.embed_texts() for-loop

iiiPOST /api/embeddings — one text per call

ivOllama · qwen3-embedding · CPU only · 0/37 GPU

vQdrant upsert · 4096-dim

As it is · tei, on gpu

iFour CLI agents fire hooks per prompt, per tool call.

iiqmem_mcp.embed_texts() one batched call

iiiPOST /embed — the whole batch in a single request

ivTEI · Qwen3-Embedding-8B · FlashAttention on GPU

vQdrant upsert · 4096-dim · unchanged

§ III

The numbers, side by side

Single-embed latency, warm

Ollama

187 ms

TEI

24 ms

−87 %roughly seven-point-eight times faster

Batch of sixteen texts

Ollama

≈ 3000 ms

TEI

101 ms

−97 %thirty times faster, because of true batching

System load, fifteen-minute average

Before

133

After

2.3 – 3.8

−98 %the CPU has rejoined the rest of the system

GPU VRAM utilisation, on a 24 GB card

Before

153 MiB

After

15 GB

× 100the silicon is, at last, doing work

P95 under real load, from production Ollama logs

Ollama

49 s

TEI

≈ 30 ms

≈ 1600 ×under CPU-saturated contention, the gap yawns open

§ IV

Everything touched

New

src/qmem_mcp/embedding/tei_provider.py

The only new module. About thirty lines — an httpx AsyncClient that ships the full batch to TEI's /embed endpoint in a single request and captures the vector dimension on first response.

Mod

src/qmem_mcp/embedding/factory.py

One new branch: if backend == "tei": return TeiEmbeddingProvider(...).

Mod

src/qmem_mcp/config.py

A single new settings field — tei_base_url, defaulting to http://localhost:8080, aliased as TEI_BASE_URL.

Doc

MIGRATION_TEI.md

The written record — before, after, rollback procedure, and the self-test transcript.

Cfg

~/.claude.json · ~/.codex/config.toml · ~/.gemini/settings.json · ~/.config/opencode/opencode.jsonc · ~/.config/opencode/qmem-hook.js

Five MCP-client configuration files, each flipped from ollama to tei, given a new TEI_BASE_URL, and rewritten with the Hugging Face model identifier.

Cfg

/root/qdrant-memory-mcp/scripts/*.sh

Ten hook and wrapper shell scripts patched in-place — every session-start, user-prompt, post-tool, and stop hook, across all four agents.

§ V

The new provider, in full

src / qmem_mcp / embedding / tei_provider.py ~30 loc

from __future__ import annotations

import httpx

from qmem_mcp.embedding.base import EmbeddingProvider


class TeiEmbeddingProvider(EmbeddingProvider):
    """HuggingFace text-embeddings-inference (TEI) provider.

    Uses TEI's native batched /embed endpoint. One HTTP call embeds the whole
    batch on the GPU in a single forward pass — no per-text loop.
    """

    def __init__(self, base_url: str, model: str) -> None:
        self.base_url = base_url.rstrip("/")
        self.model = model
        self._dimensions: int | None = None

    async def embed_texts(self, texts: list[str]) -> list[list[float]]:
        if not texts:
            return []
        async with httpx.AsyncClient(timeout=120) as client:
            response = await client.post(
                f"{self.base_url}/embed",
                json={"inputs": texts, "normalize": True, "truncate": True},
            )
            response.raise_for_status()
            vectors = response.json()
        if self._dimensions is None:
            self._dimensions = len(vectors[0])
        return vectors

    async def embed_query(self, text: str) -> list[float]:
        return (await self.embed_texts([text]))[0]

    def model_name(self) -> str: return self.model
    def dimensions(self) -> int: return self._dimensions or 0

§ VI

Configuration, flipped in place

Claude Code · MCP

on tei

~/.claude.json
↳mcpServers.qmem-memory.env

Codex · MCP

on tei

~/.codex/config.toml
↳[mcp_servers.qmem-memory.env]

Gemini CLI · MCP

on tei

~/.gemini/settings.json
↳mcpServers.qmem-memory.env

OpenCode · MCP

on tei

~/.config/opencode/opencode.jsonc
↳mcp.qmem-memory.environment

OpenCode · plugin hook

on tei

~/.config/opencode/qmem-hook.js
↳ENV const

Hook scripts · shell

on tei

qdrant-memory-mcp/scripts/*.sh
↳ten files patched

The env diff, applied everywhere

EMBEDDING_BACKEND = ollama

EMBEDDING_BACKEND = tei

EMBEDDING_MODEL = qwen3-embedding

EMBEDDING_MODEL = Qwen/Qwen3-Embedding-8B

TEI_BASE_URL = http://localhost:8080

§ VII

Two errata, caught before ship

Erratum · i

A silent dimension-mismatch, waiting.

The copy-paste command suggested earlier used Qwen/Qwen3-Embedding-0.6B, whose vectors are 1024-dim. The existing Qdrant _qwen collections are 4096-dim. Every upsert would have failed with a vector-size mismatch and no new memory would have been stored — without a helpful log line from qmem to tell you why.

Suggested — would have failed

Qwen/Qwen3-Embedding-0.6B1024-dim · Qdrant rejects

Shipped — correct match

Qwen/Qwen3-Embedding-8B4096-dim · collection unchanged

Erratum · ii

Wrong compute capability tag.

The suggested container tag :89-latest targets Ada Lovelace, compute capability 8.9 — the RTX 40-series. The host GPU is a 3090 Ti: Ampere, CC 8.6. The kernels ship per-arch, so the wrong tag would either fail to start or degrade silently.

Suggested — wrong arch

tei :89-latestCC 8.9 · Ada Lovelace

Shipped — right arch

tei :86-1.9CC 8.6 · Ampere · RTX 3090 Ti

§ VIII

The self-test ledger

Test	Path under test	Latency	Result
Direct provider · embed_query	TeiEmbeddingProvider	61 ms	pass
Direct provider · embed_texts × 16	TeiEmbeddingProvider	110 ms	pass
Factory resolves from env	Settings → create_embedding_provider	33 ms	pass
Qdrant round-trip · store	QdrantMemoryStore.store	70 ms	pass
Qdrant round-trip · find	QdrantMemoryStore.find	39 ms	pass
MCP layer compatibility	mcp__qmem-memory__*	—	pass
claude_code hook env	claude_post_tool_hook.sh	47 ms	pass
codex hook env	codex_wrapper.sh	52 ms	pass
gemini_cli hook env	gemini_post_tool_hook.sh	53 ms	pass
opencode hook env	opencode_wrapper.sh	53 ms	pass
Cross-agent retrieval · 4 records	semantic search · score > 0.89	< 50 ms	pass

§ IX

If this must be undone

Every config file was copied aside before the flip. Reverting is a three-step recipe.

Stop the TEI container: docker stop tei-qwen3
Restore the configuration from backup — copy each file back from /root/.qmem-tei-migration-backup/ to its original path (.claude.json, codex config.toml, gemini settings.json, opencode.jsonc, qmem-hook.js), and restore scripts.bak/. into /root/qdrant-memory-mcp/scripts/.
Start a new session in each of the four CLI agents. Fresh processes read the restored env and return to Ollama. Existing long-running sessions were using Ollama throughout — nothing to interrupt.