How many H200 nodes does it take to run an AI voice agent for thousands of calls? Learn the method, step by step — and turn the dials yourself.
A 2.5-minute call is mostly silence, listening and thinking. The GPU only works during the short bursts when the AI speaks. So we never size GPUs to “number of calls” — we convert calls into a request rate, then ask how many requests one node can serve.
A conversation is back-and-forth. Each turn the user speaks, then the assistant replies.
User turns go to speech-to-text; the LLM runs on the assistant’s half of the turns.
Spread the assistant turns across the busy window to get requests per second (RPS).
Each node serves a measured number of requests per second. Divide and round up.
Turn the dials. Everything on the right recomputes live — this is exactly the math the calculator runs.
Each service has a measured capacity in requests/sec per node. Nodes = ceil(RPS ÷ capacity). The LLM capacity comes from real benchmarks.
| Service | req/s/node | nodes | GPUs |
|---|
300,000 calls/month scaling, peak 90 concurrency, 2.5-min calls. Press play to watch the numbers cascade.
The defaults are a starting point. For a real estimate, these should be your measured values.