The Ledger · What We Were Losing

The Qmem File · The Ledger

Ollama's CPU burn ratio, final 5h 17m session

98.2%

For every minute the machine was up, Ollama burned fifty-nine seconds of CPU — almost an entire core, continuously, to produce the same embeddings TEI now serves from idle GPU in milliseconds.

Source: journalctl -u ollama at shutdown — Consumed 5h 11min 10s CPU time over 5h 17min 31s wall clock time, 10G memory peak, 1.4G memory swap peak.

§ I

The silicon's share of work

Same hardware, same model, same workload — here is where the cycles went.

CPU utilisation · under load

98.2% Ollama · CPU

Three cores pegged, load average 133.
The 24-thread machine was effectively a doorstop.

GPU VRAM utilisation · earning its keep

65% TEI · GPU

Ampere tensor cores, persistent model,
24 GB card doing what it was bought for.

§ II

The ledger, itemised

Twelve rows. Each one a line on the balance sheet — what we spent, what we save now.

Resource	Before · Ollama/CPU	After · TEI/GPU	Delta
CPU time / wall timeollama.service, 5h 17m session	5h 11min	≈ 0s	−98.2% burn
Top CPU pressurethree cores pegged	284%	< 1%	−300× one core
System load · 15-minpeak during hook storm	133	2.3 – 3.8	machine responsive again
RAM peakollama runner resident set	10 GB	< 100 MB	10 GB returned
Swap peakmemory pressure	1.4 GB	0	no swap contention
GPU VRAMof 24 GB available	153 MiB	15 GB	×100 · finally used
GPU utilisationsampled during load	0%	20 – 65%	asset → earning
Single-embed latency, warmper one text	187 ms	24 ms	7.8× faster
Batch of 16 textstypical hook volume	≈ 3 s	101 ms	30× faster
p95 under loadfrom ollama production log	49 s	≈ 30 ms	≈1600× faster
HTTP calls for 16 embedsthe chatty for-loop	16	1	16× fewer
Disk footprintmodel blobs, both homes	8.8 GB	0	8.8 GB reclaimed

The 24-thread box was quietly running as a single-threaded embedding machine. Everything else — the web servers, the agents, the shells — competed for what was left.

§ III

Anatomy of a single hook fire

Sixteen texts arrive at embed_texts(). Here is every step each system takes, with timings.

Ollama · on CPU

seven stages · sixteen calls

iAgent hook fires. Gather 16 texts into a list.~0 ms

iiEnter the per-text for-loop in ollama_provider.py.~0 ms

↻HTTP POST /api/embeddings · one text at a time · sixteen round-trips× 16

iiiOllama routes to CPU backend. Zero of 37 layers on GPU; eight AVX2 threads.—

ivLoad 4 GB of model weights from RAM, matmul through 37 layers.187 ms

vPool the hidden states, return one 4096-dim vector.—

viRepeat for next text. And the next. And the next.× 15 more

viiUpsert 16 vectors to Qdrant.~40 ms

Blocking time ≈ 3 000 ms

TEI · on GPU

four stages · one call

iAgent hook fires. Same 16-text list.~0 ms

iiSingle HTTP POST /embed with all 16 inputs in one JSON body.~2 ms

iiiTEI's Rust scheduler pads + batches, dispatches to the resident GPU model.~5 ms

ivOne forward pass on Ampere tensor cores with FlashAttention · 16 vectors returned.~55 ms

vUpsert 16 vectors to Qdrant.~40 ms

Blocking time ≈ 101 ms

Agents used to wait 2.9 seconds longer on every hook fire. Multiply by every prompt, every tool call, every session-start context-fetch — that is the invisible tax we were paying.

§ IV

The shape of the latency curve

Warm best-case to p95 under contention — the distance between them tells the real story.

Ollama · CPU warm 187 ms → p95 49 s

TEI · GPU warm 24 ms → p95 ≈ 30 ms

10 ms 100 ms 1 s 50 s

Ollama's distribution had a fat tail that swallowed agent sessions whole. Under CPU saturation — exactly when embeddings are most needed — the tail reached forty-nine seconds. TEI's entire distribution fits inside Ollama's best-case warm latency.

§ V

Four dimensions of waste

CPU, memory, disk, and network chatter — each drawn to scale, before against after.

CPU pressure

% of one core

Ollama

284%

TEI

< 1%

Ollama held three cores continuously. TEI's HTTP+tokenisation path barely registers.

Host RAM occupied

peak resident

Ollama

10 GB

TEI

< 100 MB

The model now lives in VRAM. Host memory is free for the 2 dozen other services on this box.

Disk footprint

model blobs

Ollama

8.8 GB

TEI

Two full copies of the model lived on disk — one at /var/lib/ollama, another at /root/.ollama. Both gone; TEI keeps its weights in its own container cache.

HTTP round-trips per batch

16 texts

Ollama

TEI

Ollama's /api/embeddings accepts one text at a time. TEI's /embed accepts the whole batch.

§ VI

The throughput ceiling

How many embeddings the hardware can actually deliver, per minute.

embeddings per minute · steady-state

Ollama · one-at-a-time 187 ms each

320

TEI · single calls 24 ms each

2 500

TEI · batched (16) 101 ms per 16 · 6.3 ms amortised

9 500

At the batch path, TEI delivers thirty times more embeddings per minute than Ollama could have — and still mostly leaves the GPU idle, ready for whatever else you want to run there.

§ VII

What that meant, in plain terms

Numbers are easy to skim. Here is what they described.

Your 24-thread machine was effectively single-threaded for embedding workloads. Load average hit 133 — the shell's own keystrokes stuttered.

Every agent hook fire could stall up to 49 seconds. Agents couldn't chain tool calls cleanly; long-running sessions silently timed out on memory writes.

†

The 3090 Ti — your biggest piece of hardware — sat at 0.7% utilisation. 23 gigabytes of VRAM doing nothing while the CPU rendered itself unusable.

⁂

For every sixteen-text embed, 16 HTTP round-trips bounced through the kernel. TEI now does it in one.

☙

Two copies of a four-gigabyte model slept on disk at all times — one in /var/lib/ollama, another in /root/.ollama. 8.8 GB reclaimed.

◊

Ollama's process consumed 98.2% of its wall clock in CPU time. It was not using the machine, it was holding it hostage.