SCICOM · GPU Sizing
Interactive Tutorial · Voice AI Infrastructure

Sizing GPUs for a
voice call center.

How many H200 nodes does it take to run an AI voice agent for thousands of calls? Learn the method, step by step — and turn the dials yourself.

# the whole method in one line
RPS = concurrency ÷ AHT × turns/call × assistant %
nodes = ceil( RPS ÷ requests-per-second-per-node )
Concurrency
90
concurrent calls
Request rate
3
LLM req/sec
Hardware
10
GPUs total
01The big idea

A call is not constant GPU work.

A 2.5-minute call is mostly silence, listening and thinking. The GPU only works during the short bursts when the AI speaks. So we never size GPUs to “number of calls” — we convert calls into a request rate, then ask how many requests one node can serve.

📞

One call → many turns

A conversation is back-and-forth. Each turn the user speaks, then the assistant replies.

turns = AHT ÷ turn length
🤖

Only assistant turns hit the LLM

User turns go to speech-to-text; the LLM runs on the assistant’s half of the turns.

assistant turns = turns × 50%

Turns become a request rate

Spread the assistant turns across the busy window to get requests per second (RPS).

RPS = assistant turns ÷ window
🖥️

RPS becomes nodes

Each node serves a measured number of requests per second. Divide and round up.

nodes = ceil( RPS ÷ node capacity )
02Step one

From calls to request rate.

Turn the dials. Everything on the right recomputes live — this is exactly the math the calculator runs.

Most simultaneous calls at the busiest moment. Size for peak, not average.
Average Handling Time: the length of one whole conversation.
Average seconds per conversational turn. Shorter turns → more requests.
Share of turns that call the LLM. 50% if user/assistant strictly alternate.
Just for totals — it cancels out of RPS, so it never changes the GPU count.
Live computation
Calls / secconcurrency ÷ AHT
0.60
Turns / callAHT ÷ turn length
10
Total turnscalls × turns/call
43,200
Assistant turnstotal × 50%
21,600
RPS — LLM requests/secassistant turns ÷ window
3.00
03Step two

From request rate to GPUs.

Each service has a measured capacity in requests/sec per node. Nodes = ceil(RPS ÷ capacity). The LLM capacity comes from real benchmarks.

A node = 8× H200 running one replica. Bigger context → lower req/s.
🎙️ STT (Whisper) req/s
🔊 TTS req/s
STT/TTS capacities are entered directly (per-GPU). One GPU each here.
Marketplace ~$1.90–$4.00 · hyperscaler ~$10.
GPUs required · RPS = 3.00
10GPUs total
Servicereq/s/nodenodesGPUs
$39.90
/ hour
$958
/ day
$29,127
/ month
04Worked example

The BOC voice agent, end to end.

300,000 calls/month scaling, peak 90 concurrency, 2.5-min calls. Press play to watch the numbers cascade.

Given
90 concurrent calls · AHT 150s · turn 15s
From the project proposal: peak window 10AM–12PM, 2.5-minute average handling time.
Calls per second
90 ÷ 150 = 0.6 calls/sec
Little’s law: throughput = concurrency ÷ service time.
Turns & requests
150 ÷ 15 = 10 turns · 0.6 × 7200 = 4,320 calls · ×10 = 43,200 turns
Over the 2-hour window.
Assistant turns → RPS
43,200 ÷ 2 = 21,600 · ÷ 7200 = 3 req/s
User/assistant alternate, so half the turns hit the LLM.
LLM nodes
ceil( 3 ÷ 23.78 ) = 1 node = 8 H200
GPT-OSS-120B @ 4K ctx, measured on one 8×H200 node.
Result
8 LLM + 1 STT + 1 TTS = 10 GPUs
≈ $40/hour on marketplace pricing — comfortably inside budget.
05Every parameter

What each dial means.

06Before you start

Five numbers to bring.

The defaults are a starting point. For a real estimate, these should be your measured values.

Ready to size yours?

Open the full calculator.

View the GPU Calculator ↗ Replay the tutorial ↺