Sizing GPUs for a Voice Call Center · Interactive Tutorial

01The big idea

A call is not constant GPU work.

A 2.5-minute call is mostly silence, listening and thinking. The GPU only works during the short bursts when the AI speaks. So we never size GPUs to “number of calls” — we convert calls into a request rate, then ask how many requests one node can serve.

📞

One call → many turns

A conversation is back-and-forth. Each turn the user speaks, then the assistant replies.

turns = AHT ÷ turn length

🤖

Only assistant turns hit the LLM

User turns go to speech-to-text; the LLM runs on the assistant’s half of the turns.

assistant turns = turns × 50%

⚡

Turns become a request rate

Spread the assistant turns across the busy window to get requests per second (RPS).

RPS = assistant turns ÷ window

🖥️

RPS becomes nodes

Each node serves a measured number of requests per second. Divide and round up.

nodes = ceil( RPS ÷ node capacity )

02Step one

From calls to request rate.

Turn the dials. Everything on the right recomputes live — this is exactly the math the calculator runs.

Peak concurrency 90 Most simultaneous calls at the busiest moment. Size for peak, not average.

AHT — call length 150 s Average Handling Time: the length of one whole conversation.

Turn length 15 s Average seconds per conversational turn. Shorter turns → more requests.

Assistant (LLM) turns 50% Share of turns that call the LLM. 50% if user/assistant strictly alternate.

Peak window 2 h Just for totals — it cancels out of RPS, so it never changes the GPU count.

Live computation

Calls / secconcurrency ÷ AHT

0.60

Turns / callAHT ÷ turn length

Total turnscalls × turns/call

43,200

Assistant turnstotal × 50%

21,600

▼

RPS — LLM requests/secassistant turns ÷ window

3.00

03Step two

From request rate to GPUs.

Each service has a measured capacity in requests/sec per node. Nodes = ceil(RPS ÷ capacity). The LLM capacity comes from real benchmarks.

LLM model & context A node = 8× H200 running one replica. Bigger context → lower req/s.

Services hitting GPUs

🎙️ STT (Whisper) req/s

🔊 TTS req/s

STT/TTS capacities are entered directly (per-GPU). One GPU each here.

$ per GPU-hour $3.99 Marketplace ~$1.90–$4.00 · hyperscaler ~$10.

GPUs required · RPS = 3.00

10GPUs total

Service	req/s/node	nodes	GPUs

$39.90

/ hour

$958

/ day

$29,127

/ month

04Worked example

The BOC voice agent, end to end.

300,000 calls/month scaling, peak 90 concurrency, 2.5-min calls. Press play to watch the numbers cascade.

Given

90 concurrent calls · AHT 150s · turn 15s

From the project proposal: peak window 10AM–12PM, 2.5-minute average handling time.

Calls per second

90 ÷ 150 = 0.6 calls/sec

Little’s law: throughput = concurrency ÷ service time.

Turns & requests

150 ÷ 15 = 10 turns · 0.6 × 7200 = 4,320 calls · ×10 = 43,200 turns

Over the 2-hour window.

Assistant turns → RPS

43,200 ÷ 2 = 21,600 · ÷ 7200 = 3 req/s

User/assistant alternate, so half the turns hit the LLM.

LLM nodes

ceil( 3 ÷ 23.78 ) = 1 node = 8 H200

GPT-OSS-120B @ 4K ctx, measured on one 8×H200 node.

Result

8 LLM + 1 STT + 1 TTS = 10 GPUs

≈ $40/hour on marketplace pricing — comfortably inside budget.

Sizing GPUs for a
voice call center.