Keep qwen3-embedding, get it off the CPU, and finally put that 3090 Ti to work. Four ways to do it, ranked.
Ollama detects the GPU (found 1 CUDA devices: NVIDIA GeForce RTX 3090 Ti) then offloads 0 of 37 layers to it. The new --ollama-engine defaults num_gpu=0 for this embedder. Combined with single-request-per-call in ollama_provider.py, every hook stalls on CPU.
qwen3-embedding, both actually use the 3090 Ti, both add native batching.
Also batch the qmem hook writes — the for text in texts loop in
ollama_provider.py:17 is the hidden second bottleneck. Fixing the backend without
batching still leaves per-request HTTP overhead on every hook fire.
qwen3-embedding is 1024-dim; if you
ever swap to a smaller model you need to rebuild the collections. Good news — your collection names
are already suffixed _qwen (code_memory_qwen, ops_memory_qwen,
session_memory_qwen, raw_events_qwen), so past-you already planned for this.
Keep qwen3, keep the collections, just move the compute.
No iteration. No KV cache. Nothing to schedule around.
Stages 2–5 exist to solve generation problems — scheduling around growing KV caches, batching across requests with different output lengths, avoiding head-of-line blocking when one request emits 10 tokens and another emits 1000. For a single-pass embedding, all four carry cost with zero payoff.
vLLM defaults --gpu-memory-utilization=0.9 — pre-allocating a KV pool you don't need. You can shrink it, but then you've configured away the thing that makes vLLM fast.
vLLM earns its keep at 100+ concurrent streams. You have 4 CLI agents firing occasional hooks — TEI with a 50ms batch window saturates the 3090 Ti well under your peak rate.
| Scenario | vLLM | TEI | ST |
|---|---|---|---|
| 1–5 req/s bursty · small model (you) | overkill | ideal | fine |
| 100+ concurrent streams | yes | yes | no |
| 1000+ req/s sustained | yes | tune | no |
| Multiple models on one GPU | yes | one | one |
| Multi-GPU tensor parallelism | yes | no | no |
| Already running vLLM for LLMs | yes | extra svc | extra svc |
| Fastest to ship · fewest moving parts | no | mid | yes |
~100 concurrent requests or start hosting multiple embedding models on
the same GPU, revisit vLLM — by then it becomes the right tool.