From eighty-five percent CPU and a fifteen-minute load average of one hundred thirty-three — down to twenty-four milliseconds on the GPU.
The qmem memory stack — the part of the gaming-PC rig that lets four different CLI agents share a single Qdrant-backed vector memory — had been quietly melting the CPU for days. Ollama was detecting the 3090 Ti on startup, announcing the find in its logs, and then politely deciding to offload zero of thirty-seven layers to it. Every hook fire went to eight CPU threads instead. Load average one-thirty-three. The GPU sat at one-hundred-and-fifty-three mebibytes — less than a modest browser tab. This is the log of how it got fixed, in one session, with every before-and-after captured.
Triaged system usage — Ollama was pegging three cores at two-hundred-eighty-four percent CPU, the fifteen-minute load average had clocked at one-hundred-thirty-three, and the Ollama production log showed p95 embedding latency at forty-nine seconds.
Ollama's newer --ollama-engine detected the 3090 Ti but set GPULayers:[] and NumThreads:8, offloading zero of thirty-seven layers to GPU. On top of that, the qmem provider looped per-text in a for text in texts HTTP loop — no batching, even if the GPU had been doing the work.
Compared four options — keep Ollama with a Modelfile override, switch to TEI, load sentence-transformers in-process, or reach for vLLM. Chose TEI. It is purpose-built for embedding workloads, native to the GPU, supports true batching, and exposes an OpenAI-compatible endpoint.
Installed nvidia-container-toolkit from Arch's extra repository, wired the Docker nvidia runtime, launched TEI with Qwen3-Embedding-8B, wrote a thirty-line provider, added one branch to the factory, and flipped fifteen configuration sites across four CLI agents and their hook wrappers.
Three layers of self-test — direct provider call, Qdrant round-trip via qmem's internals, and a hook-fire from each of the four agents' environments. Every record landed at 4096-dim, every vector came back retrievable, every score above zero-point-eight-nine.
httpx AsyncClient that ships the full batch to TEI's /embed endpoint in a single request and captures the vector dimension on first response.if backend == "tei": return TeiEmbeddingProvider(...).tei_base_url, defaulting to http://localhost:8080, aliased as TEI_BASE_URL.ollama to tei, given a new TEI_BASE_URL, and rewritten with the Hugging Face model identifier.from __future__ import annotations import httpx from qmem_mcp.embedding.base import EmbeddingProvider class TeiEmbeddingProvider(EmbeddingProvider): """HuggingFace text-embeddings-inference (TEI) provider. Uses TEI's native batched /embed endpoint. One HTTP call embeds the whole batch on the GPU in a single forward pass — no per-text loop. """ def __init__(self, base_url: str, model: str) -> None: self.base_url = base_url.rstrip("/") self.model = model self._dimensions: int | None = None async def embed_texts(self, texts: list[str]) -> list[list[float]]: if not texts: return [] async with httpx.AsyncClient(timeout=120) as client: response = await client.post( f"{self.base_url}/embed", json={"inputs": texts, "normalize": True, "truncate": True}, ) response.raise_for_status() vectors = response.json() if self._dimensions is None: self._dimensions = len(vectors[0]) return vectors async def embed_query(self, text: str) -> list[float]: return (await self.embed_texts([text]))[0] def model_name(self) -> str: return self.model def dimensions(self) -> int: return self._dimensions or 0
The copy-paste command suggested earlier used Qwen/Qwen3-Embedding-0.6B, whose vectors are 1024-dim. The existing Qdrant _qwen collections are 4096-dim. Every upsert would have failed with a vector-size mismatch and no new memory would have been stored — without a helpful log line from qmem to tell you why.
The suggested container tag :89-latest targets Ada Lovelace, compute capability 8.9 — the RTX 40-series. The host GPU is a 3090 Ti: Ampere, CC 8.6. The kernels ship per-arch, so the wrong tag would either fail to start or degrade silently.
| Test | Path under test | Latency | Result |
|---|---|---|---|
| Direct provider · embed_query | TeiEmbeddingProvider | 61 ms | pass |
| Direct provider · embed_texts × 16 | TeiEmbeddingProvider | 110 ms | pass |
| Factory resolves from env | Settings → create_embedding_provider | 33 ms | pass |
| Qdrant round-trip · store | QdrantMemoryStore.store | 70 ms | pass |
| Qdrant round-trip · find | QdrantMemoryStore.find | 39 ms | pass |
| MCP layer compatibility | mcp__qmem-memory__* | — | pass |
| claude_code hook env | claude_post_tool_hook.sh | 47 ms | pass |
| codex hook env | codex_wrapper.sh | 52 ms | pass |
| gemini_cli hook env | gemini_post_tool_hook.sh | 53 ms | pass |
| opencode hook env | opencode_wrapper.sh | 53 ms | pass |
| Cross-agent retrieval · 4 records | semantic search · score > 0.89 | < 50 ms | pass |
Every config file was copied aside before the flip. Reverting is a three-step recipe.
Stop the TEI container: docker stop tei-qwen3
Restore the configuration from backup — copy each file back from /root/.qmem-tei-migration-backup/ to its original path (.claude.json, codex config.toml, gemini settings.json, opencode.jsonc, qmem-hook.js), and restore scripts.bak/. into /root/qdrant-memory-mcp/scripts/.
Start a new session in each of the four CLI agents. Fresh processes read the restored env and return to Ollama. Existing long-running sessions were using Ollama throughout — nothing to interrupt.