The Qmem File  · Ollama → TEI
№ 04-17 / 2026
A companion piece — the full resource accounting, with before/after visuals for every metric.
The Ledger →
An Incident Post-Mortem

Ollamatei.

From eighty-five percent CPU and a fifteen-minute load average of one hundred thirty-three — down to twenty-four milliseconds on the GPU.

At a glance
14 / 14
Tasks shipped, every one of them green, nothing deferred.
24 ms
Single-embed latency — warm, round-trip from Python to GPU and back.
4 / 4
Agents verified — Claude Code, Codex, Gemini CLI, and OpenCode.
8 – 2000×
Speedup, bounded by warm best-case and saturated real-world load.

The qmem memory stack — the part of the gaming-PC rig that lets four different CLI agents share a single Qdrant-backed vector memory — had been quietly melting the CPU for days. Ollama was detecting the 3090 Ti on startup, announcing the find in its logs, and then politely deciding to offload zero of thirty-seven layers to it. Every hook fire went to eight CPU threads instead. Load average one-thirty-three. The GPU sat at one-hundred-and-fifty-three mebibytes — less than a modest browser tab. This is the log of how it got fixed, in one session, with every before-and-after captured.

§ I

The journey, in five phases

i

Diagnose the smoke.

Triaged system usage — Ollama was pegging three cores at two-hundred-eighty-four percent CPU, the fifteen-minute load average had clocked at one-hundred-thirty-three, and the Ollama production log showed p95 embedding latency at forty-nine seconds.

top nvidia-smi journalctl
ii

Find the root cause.

Ollama's newer --ollama-engine detected the 3090 Ti but set GPULayers:[] and NumThreads:8, offloading zero of thirty-seven layers to GPU. On top of that, the qmem provider looped per-text in a for text in texts HTTP loop — no batching, even if the GPU had been doing the work.

ollama logs ollama_provider.py:17
iii

Pick the right tool.

Compared four options — keep Ollama with a Modelfile override, switch to TEI, load sentence-transformers in-process, or reach for vLLM. Chose TEI. It is purpose-built for embedding workloads, native to the GPU, supports true batching, and exposes an OpenAI-compatible endpoint.

decision record option B
iv

Implement, end to end.

Installed nvidia-container-toolkit from Arch's extra repository, wired the Docker nvidia runtime, launched TEI with Qwen3-Embedding-8B, wrote a thirty-line provider, added one branch to the factory, and flipped fifteen configuration sites across four CLI agents and their hook wrappers.

docker python bash
v

Prove it works.

Three layers of self-test — direct provider call, Qdrant round-trip via qmem's internals, and a hook-fire from each of the four agents' environments. Every record landed at 4096-dim, every vector came back retrievable, every score above zero-point-eight-nine.

11 tests all green
§ II

The hot path, before & after

As it was · ollama, on cpu
iFour CLI agents fire hooks per prompt, per tool call.
iiqmem_mcp.embed_texts() for-loop
iiiPOST /api/embeddings — one text per call
ivOllama · qwen3-embedding · CPU only · 0/37 GPU
vQdrant upsert · 4096-dim
As it is · tei, on gpu
iFour CLI agents fire hooks per prompt, per tool call.
iiqmem_mcp.embed_texts() one batched call
iiiPOST /embed — the whole batch in a single request
ivTEI · Qwen3-Embedding-8B · FlashAttention on GPU
vQdrant upsert · 4096-dim · unchanged
§ III

The numbers, side by side

Single-embed latency, warm
Ollama
187 ms
TEI
24 ms
−87 %roughly seven-point-eight times faster
Batch of sixteen texts
Ollama
≈ 3000 ms
TEI
101 ms
−97 %thirty times faster, because of true batching
System load, fifteen-minute average
Before
133
After
2.3 – 3.8
−98 %the CPU has rejoined the rest of the system
GPU VRAM utilisation, on a 24 GB card
Before
153 MiB
After
15 GB
× 100the silicon is, at last, doing work
P95 under real load, from production Ollama logs
Ollama
49 s
TEI
≈ 30 ms
≈ 1600 ×under CPU-saturated contention, the gap yawns open
§ IV

Everything touched

New
src/qmem_mcp/embedding/tei_provider.py
The only new module. About thirty lines — an httpx AsyncClient that ships the full batch to TEI's /embed endpoint in a single request and captures the vector dimension on first response.
Mod
src/qmem_mcp/embedding/factory.py
One new branch: if backend == "tei": return TeiEmbeddingProvider(...).
Mod
src/qmem_mcp/config.py
A single new settings field — tei_base_url, defaulting to http://localhost:8080, aliased as TEI_BASE_URL.
Doc
MIGRATION_TEI.md
The written record — before, after, rollback procedure, and the self-test transcript.
Cfg
~/.claude.json · ~/.codex/config.toml · ~/.gemini/settings.json · ~/.config/opencode/opencode.jsonc · ~/.config/opencode/qmem-hook.js
Five MCP-client configuration files, each flipped from ollama to tei, given a new TEI_BASE_URL, and rewritten with the Hugging Face model identifier.
Cfg
/root/qdrant-memory-mcp/scripts/*.sh
Ten hook and wrapper shell scripts patched in-place — every session-start, user-prompt, post-tool, and stop hook, across all four agents.
§ V

The new provider, in full

src / qmem_mcp / embedding / tei_provider.py ~30 loc
from __future__ import annotations

import httpx

from qmem_mcp.embedding.base import EmbeddingProvider


class TeiEmbeddingProvider(EmbeddingProvider):
    """HuggingFace text-embeddings-inference (TEI) provider.

    Uses TEI's native batched /embed endpoint. One HTTP call embeds the whole
    batch on the GPU in a single forward pass — no per-text loop.
    """

    def __init__(self, base_url: str, model: str) -> None:
        self.base_url = base_url.rstrip("/")
        self.model = model
        self._dimensions: int | None = None

    async def embed_texts(self, texts: list[str]) -> list[list[float]]:
        if not texts:
            return []
        async with httpx.AsyncClient(timeout=120) as client:
            response = await client.post(
                f"{self.base_url}/embed",
                json={"inputs": texts, "normalize": True, "truncate": True},
            )
            response.raise_for_status()
            vectors = response.json()
        if self._dimensions is None:
            self._dimensions = len(vectors[0])
        return vectors

    async def embed_query(self, text: str) -> list[float]:
        return (await self.embed_texts([text]))[0]

    def model_name(self) -> str: return self.model
    def dimensions(self) -> int: return self._dimensions or 0
§ VI

Configuration, flipped in place

Claude Code · MCP
on tei
~/.claude.json
mcpServers.qmem-memory.env
Codex · MCP
on tei
~/.codex/config.toml
[mcp_servers.qmem-memory.env]
Gemini CLI · MCP
on tei
~/.gemini/settings.json
mcpServers.qmem-memory.env
OpenCode · MCP
on tei
~/.config/opencode/opencode.jsonc
mcp.qmem-memory.environment
OpenCode · plugin hook
on tei
~/.config/opencode/qmem-hook.js
ENV const
Hook scripts · shell
on tei
qdrant-memory-mcp/scripts/*.sh
ten files patched
The env diff, applied everywhere
EMBEDDING_BACKEND = ollama
EMBEDDING_BACKEND = tei
EMBEDDING_MODEL = qwen3-embedding
EMBEDDING_MODEL = Qwen/Qwen3-Embedding-8B
TEI_BASE_URL = http://localhost:8080
§ VII

Two errata, caught before ship

Erratum · i

A silent dimension-mismatch, waiting.

The copy-paste command suggested earlier used Qwen/Qwen3-Embedding-0.6B, whose vectors are 1024-dim. The existing Qdrant _qwen collections are 4096-dim. Every upsert would have failed with a vector-size mismatch and no new memory would have been stored — without a helpful log line from qmem to tell you why.

Suggested — would have failed
Qwen/Qwen3-Embedding-0.6B1024-dim · Qdrant rejects
Shipped — correct match
Qwen/Qwen3-Embedding-8B4096-dim · collection unchanged
Erratum · ii

Wrong compute capability tag.

The suggested container tag :89-latest targets Ada Lovelace, compute capability 8.9 — the RTX 40-series. The host GPU is a 3090 Ti: Ampere, CC 8.6. The kernels ship per-arch, so the wrong tag would either fail to start or degrade silently.

Suggested — wrong arch
tei :89-latestCC 8.9 · Ada Lovelace
Shipped — right arch
tei :86-1.9CC 8.6 · Ampere · RTX 3090 Ti
§ VIII

The self-test ledger

Test Path under test Latency Result
Direct provider · embed_queryTeiEmbeddingProvider61 mspass
Direct provider · embed_texts × 16TeiEmbeddingProvider110 mspass
Factory resolves from envSettings → create_embedding_provider33 mspass
Qdrant round-trip · storeQdrantMemoryStore.store70 mspass
Qdrant round-trip · findQdrantMemoryStore.find39 mspass
MCP layer compatibilitymcp__qmem-memory__*pass
claude_code hook envclaude_post_tool_hook.sh47 mspass
codex hook envcodex_wrapper.sh52 mspass
gemini_cli hook envgemini_post_tool_hook.sh53 mspass
opencode hook envopencode_wrapper.sh53 mspass
Cross-agent retrieval · 4 recordssemantic search · score > 0.89< 50 mspass
§ IX

If this must be undone

Every config file was copied aside before the flip. Reverting is a three-step recipe.

  1. Stop the TEI container: docker stop tei-qwen3

  2. Restore the configuration from backup — copy each file back from /root/.qmem-tei-migration-backup/ to its original path (.claude.json, codex config.toml, gemini settings.json, opencode.jsonc, qmem-hook.js), and restore scripts.bak/. into /root/qdrant-memory-mcp/scripts/.

  3. Start a new session in each of the four CLI agents. Fresh processes read the restored env and return to Ollama. Existing long-running sessions were using Ollama throughout — nothing to interrupt.