For five hours and seventeen minutes, your machine ran Ollama. Ollama's share of that wall-clock, in CPU time: five hours and eleven. A forensic ledger of the cost, the waste, and what TEI returned.
Ollama's CPU burn ratio, final 5h 17m session
98.2%
For every minute the machine was up, Ollama burned fifty-nine seconds of CPU — almost an entire core, continuously, to produce the same embeddings TEI now serves from idle GPU in milliseconds.
Source: journalctl -u ollama at shutdown — Consumed 5h 11min 10s CPU time over 5h 17min 31s wall clock time, 10G memory peak, 1.4G memory swap peak.
§ I
The silicon's share of work
Same hardware, same model, same workload — here is where the cycles went.
CPU utilisation · under load
98.2%Ollama · CPU
Three cores pegged, load average 133. The 24-thread machine was effectively a doorstop.
GPU VRAM utilisation · earning its keep
65%TEI · GPU
Ampere tensor cores, persistent model, 24 GB card doing what it was bought for.
§ II
The ledger, itemised
Twelve rows. Each one a line on the balance sheet — what we spent, what we save now.
Resource
Before · Ollama/CPU
After · TEI/GPU
Delta
CPU time / wall timeollama.service, 5h 17m session
5h 11min
≈ 0s
−98.2% burn
Top CPU pressurethree cores pegged
284%
< 1%
−300× one core
System load · 15-minpeak during hook storm
133
2.3 – 3.8
machine responsive again
RAM peakollama runner resident set
10 GB
< 100 MB
10 GB returned
Swap peakmemory pressure
1.4 GB
0
no swap contention
GPU VRAMof 24 GB available
153 MiB
15 GB
×100 · finally used
GPU utilisationsampled during load
0%
20 – 65%
asset → earning
Single-embed latency, warmper one text
187 ms
24 ms
7.8× faster
Batch of 16 textstypical hook volume
≈ 3 s
101 ms
30× faster
p95 under loadfrom ollama production log
49 s
≈ 30 ms
≈1600× faster
HTTP calls for 16 embedsthe chatty for-loop
16
1
16× fewer
Disk footprintmodel blobs, both homes
8.8 GB
0
8.8 GB reclaimed
The 24-thread box was quietly running as a single-threaded embedding machine. Everything else — the web servers, the agents, the shells — competed for what was left.
§ III
Anatomy of a single hook fire
Sixteen texts arrive at embed_texts(). Here is every step each system takes, with timings.
Ollama · on CPU
seven stages · sixteen calls
iAgent hook fires. Gather 16 texts into a list.~0 ms
iiEnter the per-text for-loop in ollama_provider.py.~0 ms
↻HTTP POST /api/embeddings · one text at a time · sixteen round-trips× 16
iiiOllama routes to CPU backend. Zero of 37 layers on GPU; eight AVX2 threads.—
ivLoad 4 GB of model weights from RAM, matmul through 37 layers.187 ms
vPool the hidden states, return one 4096-dim vector.—
viRepeat for next text. And the next. And the next.× 15 more
viiUpsert 16 vectors to Qdrant.~40 ms
Blocking time≈ 3 000 ms
TEI · on GPU
four stages · one call
iAgent hook fires. Same 16-text list.~0 ms
iiSingle HTTP POST /embed with all 16 inputs in one JSON body.~2 ms
iiiTEI's Rust scheduler pads + batches, dispatches to the resident GPU model.~5 ms
ivOne forward pass on Ampere tensor cores with FlashAttention · 16 vectors returned.~55 ms
vUpsert 16 vectors to Qdrant.~40 ms
Blocking time≈ 101 ms
Agents used to wait 2.9 seconds longer on every hook fire. Multiply by every prompt, every tool call, every session-start context-fetch — that is the invisible tax we were paying.
§ IV
The shape of the latency curve
Warm best-case to p95 under contention — the distance between them tells the real story.
Ollama · CPU warm 187 ms → p95 49 s
TEI · GPU warm 24 ms → p95 ≈ 30 ms
10 ms100 ms1 s50 s
Ollama's distribution had a fat tail that swallowed agent sessions whole. Under CPU saturation — exactly when embeddings are most needed — the tail reached forty-nine seconds. TEI's entire distribution fits inside Ollama's best-case warm latency.
§ V
Four dimensions of waste
CPU, memory, disk, and network chatter — each drawn to scale, before against after.
CPU pressure
% of one core
Ollama
284%
TEI
< 1%
Ollama held three cores continuously. TEI's HTTP+tokenisation path barely registers.
Host RAM occupied
peak resident
Ollama
10 GB
TEI
< 100 MB
The model now lives in VRAM. Host memory is free for the 2 dozen other services on this box.
Disk footprint
model blobs
Ollama
8.8 GB
TEI
0
Two full copies of the model lived on disk — one at /var/lib/ollama, another at /root/.ollama. Both gone; TEI keeps its weights in its own container cache.
HTTP round-trips per batch
16 texts
Ollama
16
TEI
1
Ollama's /api/embeddings accepts one text at a time. TEI's /embed accepts the whole batch.
§ VI
The throughput ceiling
How many embeddings the hardware can actually deliver, per minute.
embeddings per minute · steady-state
Ollama · one-at-a-time187 ms each
320
TEI · single calls24 ms each
2 500
TEI · batched (16)101 ms per 16 · 6.3 ms amortised
9 500
At the batch path, TEI delivers thirty times more embeddings per minute than Ollama could have — and still mostly leaves the GPU idle, ready for whatever else you want to run there.
§ VII
What that meant, in plain terms
Numbers are easy to skim. Here is what they described.
§
Your 24-thread machine was effectively single-threaded for embedding workloads. Load average hit 133 — the shell's own keystrokes stuttered.
¶
Every agent hook fire could stall up to 49 seconds. Agents couldn't chain tool calls cleanly; long-running sessions silently timed out on memory writes.
†
The 3090 Ti — your biggest piece of hardware — sat at 0.7% utilisation. 23 gigabytes of VRAM doing nothing while the CPU rendered itself unusable.
⁂
For every sixteen-text embed, 16 HTTP round-trips bounced through the kernel. TEI now does it in one.
☙
Two copies of a four-gigabyte model slept on disk at all times — one in /var/lib/ollama, another in /root/.ollama. 8.8 GB reclaimed.
◊
Ollama's process consumed 98.2% of its wall clock in CPU time. It was not using the machine, it was holding it hostage.