The Qmem File  ·  The Ledger
← post-mortem
A Resource Accounting

What we were losing.

For five hours and seventeen minutes, your machine ran Ollama. Ollama's share of that wall-clock, in CPU time: five hours and eleven. A forensic ledger of the cost, the waste, and what TEI returned.

Ollama's CPU burn ratio, final 5h 17m session
98.2%

For every minute the machine was up, Ollama burned fifty-nine seconds of CPU — almost an entire core, continuously, to produce the same embeddings TEI now serves from idle GPU in milliseconds.

Source: journalctl -u ollama at shutdown — Consumed 5h 11min 10s CPU time over 5h 17min 31s wall clock time, 10G memory peak, 1.4G memory swap peak.
§ I

The silicon's share of work

Same hardware, same model, same workload — here is where the cycles went.

CPU utilisation · under load
98.2% Ollama · CPU
Three cores pegged, load average 133.
The 24-thread machine was effectively a doorstop.
GPU VRAM utilisation · earning its keep
65% TEI · GPU
Ampere tensor cores, persistent model,
24 GB card doing what it was bought for.
§ II

The ledger, itemised

Twelve rows. Each one a line on the balance sheet — what we spent, what we save now.

Resource Before · Ollama/CPU After · TEI/GPU Delta
CPU time / wall timeollama.service, 5h 17m session5h 11min≈ 0s−98.2% burn
Top CPU pressurethree cores pegged284%< 1%−300× one core
System load · 15-minpeak during hook storm1332.3 – 3.8machine responsive again
RAM peakollama runner resident set10 GB< 100 MB10 GB returned
Swap peakmemory pressure1.4 GB0no swap contention
GPU VRAMof 24 GB available153 MiB15 GB×100 · finally used
GPU utilisationsampled during load0%20 – 65%asset → earning
Single-embed latency, warmper one text187 ms24 ms7.8× faster
Batch of 16 textstypical hook volume≈ 3 s101 ms30× faster
p95 under loadfrom ollama production log49 s≈ 30 ms≈1600× faster
HTTP calls for 16 embedsthe chatty for-loop16116× fewer
Disk footprintmodel blobs, both homes8.8 GB08.8 GB reclaimed
The 24-thread box was quietly running as a single-threaded embedding machine. Everything else — the web servers, the agents, the shells — competed for what was left.
§ III

Anatomy of a single hook fire

Sixteen texts arrive at embed_texts(). Here is every step each system takes, with timings.

Ollama · on CPU
seven stages · sixteen calls
iAgent hook fires. Gather 16 texts into a list.~0 ms
iiEnter the per-text for-loop in ollama_provider.py.~0 ms
HTTP POST /api/embeddings · one text at a time · sixteen round-trips× 16
iiiOllama routes to CPU backend. Zero of 37 layers on GPU; eight AVX2 threads.
ivLoad 4 GB of model weights from RAM, matmul through 37 layers.187 ms
vPool the hidden states, return one 4096-dim vector.
viRepeat for next text. And the next. And the next.× 15 more
viiUpsert 16 vectors to Qdrant.~40 ms
Blocking time ≈ 3 000 ms
TEI · on GPU
four stages · one call
iAgent hook fires. Same 16-text list.~0 ms
iiSingle HTTP POST /embed with all 16 inputs in one JSON body.~2 ms
iiiTEI's Rust scheduler pads + batches, dispatches to the resident GPU model.~5 ms
ivOne forward pass on Ampere tensor cores with FlashAttention · 16 vectors returned.~55 ms
vUpsert 16 vectors to Qdrant.~40 ms
Blocking time ≈ 101 ms

Agents used to wait 2.9 seconds longer on every hook fire. Multiply by every prompt, every tool call, every session-start context-fetch — that is the invisible tax we were paying.

§ IV

The shape of the latency curve

Warm best-case to p95 under contention — the distance between them tells the real story.

Ollama · CPU warm 187 ms → p95 49 s
TEI · GPU warm 24 ms → p95 ≈ 30 ms
10 ms 100 ms 1 s 50 s

Ollama's distribution had a fat tail that swallowed agent sessions whole. Under CPU saturation — exactly when embeddings are most needed — the tail reached forty-nine seconds. TEI's entire distribution fits inside Ollama's best-case warm latency.

§ V

Four dimensions of waste

CPU, memory, disk, and network chatter — each drawn to scale, before against after.

CPU pressure
% of one core
Ollama
284%
TEI
< 1%
Ollama held three cores continuously. TEI's HTTP+tokenisation path barely registers.
Host RAM occupied
peak resident
Ollama
10 GB
TEI
< 100 MB
The model now lives in VRAM. Host memory is free for the 2 dozen other services on this box.
Disk footprint
model blobs
Ollama
8.8 GB
TEI
0
Two full copies of the model lived on disk — one at /var/lib/ollama, another at /root/.ollama. Both gone; TEI keeps its weights in its own container cache.
HTTP round-trips per batch
16 texts
Ollama
16
TEI
1
Ollama's /api/embeddings accepts one text at a time. TEI's /embed accepts the whole batch.
§ VI

The throughput ceiling

How many embeddings the hardware can actually deliver, per minute.

embeddings per minute · steady-state
Ollama · one-at-a-time 187 ms each
320
TEI · single calls 24 ms each
2 500
TEI · batched (16) 101 ms per 16 · 6.3 ms amortised
9 500
At the batch path, TEI delivers thirty times more embeddings per minute than Ollama could have — and still mostly leaves the GPU idle, ready for whatever else you want to run there.
§ VII

What that meant, in plain terms

Numbers are easy to skim. Here is what they described.

§
Your 24-thread machine was effectively single-threaded for embedding workloads. Load average hit 133 — the shell's own keystrokes stuttered.
Every agent hook fire could stall up to 49 seconds. Agents couldn't chain tool calls cleanly; long-running sessions silently timed out on memory writes.
The 3090 Ti — your biggest piece of hardware — sat at 0.7% utilisation. 23 gigabytes of VRAM doing nothing while the CPU rendered itself unusable.
For every sixteen-text embed, 16 HTTP round-trips bounced through the kernel. TEI now does it in one.
Two copies of a four-gigabyte model slept on disk at all times — one in /var/lib/ollama, another in /root/.ollama. 8.8 GB reclaimed.
Ollama's process consumed 98.2% of its wall clock in CPU time. It was not using the machine, it was holding it hostage.