decision doc · infra

Memory Stack: GPU Options

Keep qwen3-embedding, get it off the CPU, and finally put that 3090 Ti to work. Four ways to do it, ranked.

85%
CPU load
0/37
Layers on GPU
49s
p95 latency
23 GB
VRAM idle
01

Current architecture

CLI Agents
Claude Codehooks
Codexwrapper
Gemini CLIhooks
OpenCodeplugin
per prompt + per tool call
qmem_mcp
MCP serverpython
embed_texts()sequential loop
HTTP /api/embeddings
Ollama · CPU
qwen3-embedding 85% CPU · 6.1 GB RAM · GPU idle
Qdrant · vectors
code_memory_qwen ops · session · raw_events (_qwen)
02

The problem

CPU · today
Bottleneck
CPU85%
0 / 37
layers on GPU
49s
tail latency
1/req
no batching
load 133
15-min avg
GPU · RTX 3090 Ti
Idle
VRAM · 153 MiB / 23 GB0.7%
Ampere
compute 8.6
CUDA 13.1
driver
0%
utilization
Xorg only
Hyprland

Ollama detects the GPU (found 1 CUDA devices: NVIDIA GeForce RTX 3090 Ti) then offloads 0 of 37 layers to it. The new --ollama-engine defaults num_gpu=0 for this embedder. Combined with single-request-per-call in ollama_provider.py, every hook stalls on CPU.

03

Four ways to fix it

Fix Ollama via Modelfile
Force num_gpu=99 on load
A
effort · 5 min

Pros

  • Zero code changes
  • Same stack, same ports
  • Fastest to ship

Cons

  • Still no batching
  • Still HTTP overhead per call
  • Ollama embed path still weakest link
printf 'FROM qwen3-embedding\nPARAMETER num_gpu 99\n' | ollama create qwen3-gpu -f -
Recommended
text-embeddings-inference
HF's Rust server · GPU-native · true batching
B
effort · 30 min

Pros

  • 10–50× throughput
  • Batched /embed endpoint
  • OpenAI-compatible API
  • Purpose-built for this

Cons

  • New service to run
  • New provider (~30 lines)
  • Docker/GPU plumbing
docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:89-latest --model-id Qwen/Qwen3-Embedding-0.6B
sentence-transformers in-process
Provider already exists in the repo
C
effort · 10 min

Pros

  • No extra service
  • Native batching
  • No HTTP hop
  • Code path already wired

Cons

  • Model loaded per qmem process
  • Cold-start on restart
  • Heavier qmem memory footprint
EMBEDDING_BACKEND=sentence-transformers EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B
vLLM embed mode
Continuous batching · highest throughput
D
effort · 1 hr

Pros

  • Best throughput of any option
  • Scales to high QPS
  • OpenAI-compatible

Cons

  • Overkill for this workload
  • Heavy VRAM footprint
  • Most ops complexity
vllm serve Qwen/Qwen3-Embedding-0.6B --task embed
04

Recommendation

pick one
Go with B (TEI) — or C (sentence-transformers) if you hate new services.
Both keep qwen3-embedding, both actually use the 3090 Ti, both add native batching. Also batch the qmem hook writes — the for text in texts loop in ollama_provider.py:17 is the hidden second bottleneck. Fixing the backend without batching still leaves per-request HTTP overhead on every hook fire.
05

Migration note

dimension lock
Qdrant collections are dimension-locked per model. qwen3-embedding is 1024-dim; if you ever swap to a smaller model you need to rebuild the collections. Good news — your collection names are already suffixed _qwen (code_memory_qwen, ops_memory_qwen, session_memory_qwen, raw_events_qwen), so past-you already planned for this. Keep qwen3, keep the collections, just move the compute.
06

Deep dive · why not vLLM?

vLLM isn't bad — it's the wrong tool for this job. It's the best-in-class engine for generative LLMs. Its two headline features (PagedAttention and continuous batching) solve problems embedding workloads don't have. Picking it here is like renting a semi-truck to deliver a pizza — it'll get there, but you're paying for capacity you'll never use.

The two workloads look nothing alike

Generation (what vLLM is for)
LLM chat
1Prompt arrives
2Forward → token 1KV: 1
3Forward → token 2KV: 2
… hundreds of tokens, KV cache grows each step …
NDone — seconds later
Embedding (your workload)
qmem hooks
1Batch of texts arrives
2One forward pass
3Vectors out · milliseconds

No iteration. No KV cache. Nothing to schedule around.

What's inside vLLM — and what you don't need

vLLM pipeline · embed mode7 stages
1 · Request queue Accepts incoming HTTP requests
2 · Scheduler Rebuilds batch every forward-pass step unused for embed
3 · PagedAttention allocator Manages fixed-size KV-cache pages like OS virtual memory unused for embed
4 · KV cache pool Pre-reserved GPU memory for generation sequences unused for embed
5 · Continuous batcher Handles variable-length generation without head-of-line blocking unused for embed
6 · Forward pass Custom CUDA kernels — this is what you actually need
7 · Vectors out Pooled embeddings returned

Stages 2–5 exist to solve generation problems — scheduling around growing KV caches, batching across requests with different output lengths, avoiding head-of-line blocking when one request emits 10 tokens and another emits 1000. For a single-pass embedding, all four carry cost with zero payoff.

TEI does what you need — nothing more

TEI pipeline4 stages
1 · Batch queue Accumulates requests for a short window (10–100ms)
2 · Padding batcher Groups by sequence length, pads to common size
3 · Forward pass Rust-native, hand-tuned kernels for embedding workloads
4 · Vectors out OpenAI-compatible response

VRAM footprint at idle · 24 GB card

TEI
~2 GB
sentence-trans.
~3 GB
Ollama (fixed)
~3.5 GB
vLLM tuned
~4 GB
vLLM default
~21 GB reserved
0 GB12 GB24 GB

vLLM defaults --gpu-memory-utilization=0.9 — pre-allocating a KV pool you don't need. You can shrink it, but then you've configured away the thing that makes vLLM fast.

Your actual request rate · from the ollama logs

embedding requests · past hour · bursty pattern
peak ≈ 2 req/s avg ≈ 0.3 req/s long idle gaps

vLLM earns its keep at 100+ concurrent streams. You have 4 CLI agents firing occasional hooks — TEI with a 50ms batch window saturates the 3090 Ti well under your peak rate.

When vLLM IS the right call

Scenario vLLM TEI ST
1–5 req/s bursty · small model (you) overkill ideal fine
100+ concurrent streams yes yes no
1000+ req/s sustained yes tune no
Multiple models on one GPU yes one one
Multi-GPU tensor parallelism yes no no
Already running vLLM for LLMs yes extra svc extra svc
Fastest to ship · fewest moving parts no mid yes
verdict
You can use vLLM. You shouldn't. It solves problems you don't have and reserves VRAM you could give back to something else. TEI is purpose-built for exactly this workload, ships with true batching, and leaves the GPU free for you to run real generative models alongside it. If you ever hit ~100 concurrent requests or start hosting multiple embedding models on the same GPU, revisit vLLM — by then it becomes the right tool.