NVIDIA Dynamo · DynoSim · a field guide to tuning a GPU fleet without one
Simulating the Pareto frontier
Serving a large language model well means setting a hundred knobs that all fight each other — how to split the model across GPUs, where to send each request, how to tier the cache, when to add a worker. Tuning one shifts the bottleneck to another, and a single honest experiment on a big model can cost a rack of GPUs for an answer you might not want. DynoSim's bet: build a simulator faithful enough to trust, and let it sweep thousands of configurations in seconds — then spend real silicon only on the handful worth checking.
Simulate broadly and cheaply; verify narrowly and expensively. The simulator becomes the inner loop; the real cluster becomes the judge.
A tower of knobs that fight each other
Picture what a modern LLM deployment actually is. You pick an inference engine (vLLM, SGLang, TensorRT-LLM). You choose how to shard the model across GPUs — the tensor-parallel shape. You decide whether to split the two phases of inference onto separate machines, and in what ratio. You set worker counts, a routing policy, scheduler budgets, how the key-value cache spills across memory tiers, and the thresholds that add or remove capacity under load. Each of those is a dial, and the dials are coupled: turn up one and the bottleneck slides somewhere else. The space of combinations is enormous — and for a large model, evaluating even one point in it can mean booking many GPUs or whole nodes just to learn whether the idea was worth testing.
The configuration space · why you can't just try them all
That is the wall every serving team hits. You cannot brute-force a combinatorial space when each trial costs GPU-hours and an engineer's afternoon. So the question becomes: can you find the good configurations without running most of them for real?
First, what you're actually tuning
To follow the rest, you need the shape of how a model gets served. Every request runs in two phases with opposite appetites. Prefill reads your whole prompt at once and produces the first token; it's compute-bound, and it builds the KV cache — the stored attention state for every token so far. Then decode takes over, emitting one token at a time, each step reading and extending that cache; it's memory-bound. Two metrics fall out directly: TTFT (time to first token) is set by prefill, and TPOT (time per output token) is set by decode.
The two phases of inference · prefill vs. decode
Because the phases want different hardware, the modern move is disaggregated serving: run prefill on one set of GPUs and decode on another, optimize each independently, and ship the KV cache between them. Orchestrating all of that — routing, cache transfer, placement, autoscaling — across a GPU fleet is exactly what NVIDIA Dynamo does. Dynamo (unveiled at GTC 2025, and, like DynoSim, written in Rust) doesn't replace engines like vLLM or SGLang; it sits above them as the orchestration layer, with a KV-aware Router, an SLA-driven Planner that scales workers, and a KVBM that manages the cache across memory tiers. DynoSim is a faithful simulation of that whole stack — a Dynamo twin.
An old idea, part one: discrete-event simulation
How do you simulate an hour of a busy serving system without waiting an hour? You stop thinking in ticks and start thinking in events. A discrete-event simulation keeps a virtual clock and a queue of future events — a request arrives, a scheduler steps, a forward pass finishes, a cache block transfers — each stamped with a modeled duration. The engine simply jumps the clock to the next event, updates the world, lets that change schedule new events, and jumps again. Nothing waits in real time, so idle gaps cost nothing and a busy hour collapses into milliseconds of computation.
The idea is older than it looks. It crystallized around 1961–1967 in two places: Geoffrey Gordon's GPSS at IBM, and Simula, built in Oslo by Kristen Nygaard and Ole-Johan Dahl. Nygaard had been hand-running Monte Carlo simulations for nuclear-reactor and operations-research problems and wanted a language to describe them; Simula's core nouns were literally "stations" and "customers" moving through a system. In generalizing those, Nygaard and Dahl invented classes, objects, and inheritance — and so accidentally fathered object-oriented programming, a Turing Award in hand decades later. DynoSim composing Dynamo's parts as actors on a shared event timeline is, almost exactly, Simula's worldview pointed at GPUs.
The virtual clock · 60 minutes of traffic in ~2.4 seconds
An old idea, part two: the Pareto frontier
Once you can score a configuration cheaply, you need a way to compare configurations that are good at different things — one gives lower latency, another higher throughput. There is no single winner, and weighting the objectives into one number just hides the choice. The right tool is a century old. Vilfredo Pareto, an Italian engineer-turned-economist, introduced in his 1906 Manual of Political Economy the notion that an allocation is optimal when you can't make one thing better without making another worse. Carried into engineering, that gives the Pareto frontier: the set of configurations that are non-dominated — for each one, nothing else beats it on every axis at once. Everything behind the frontier is strictly wasteful; the frontier itself is the menu of honest trade-offs.
For serving, the axes are throughput and latency, and mapping that frontier is the whole game. Drag the system from one routing or caching choice to another and watch the frontier move.
The Pareto frontier · throughput vs. latency
A whole sweep of configurations becomes a cloud of points; the frontier is the lower-right edge you'd actually choose from; and a better algorithm — smarter routing, a deeper cache tier — shows up as the entire frontier shifting toward more throughput at lower latency. DynoSim's job is to draw that picture in seconds instead of GPU-weeks.
DynoSim: composing Dynamo as events
The design choice that makes this work is composition. DynoSim is not one monolithic model of "a serving system"; it's the actual set of Dynamo components — a replay harness, per-worker engine simulations, and the cross-worker behaviors of Router, Planner, and cache manager — all running on one shared discrete-event timeline. A request threads through them as a chain of scheduled events, and crucially, every decision changes the future: a routing choice reshapes a worker's queue, a scaling decision delays capacity, a cache movement changes when decode can start. Step through one request's journey through the twin.
A request's journey through the twin
When a fast estimate isn't enough
A simulator is only useful if you trust its numbers, so DynoSim's hardest job is fidelity. Part of that comes from AIConfigurator (AIC), which estimates how long a given forward pass takes on real silicon — model, backend, GPU, tensor-parallel shape, pass shape — and it's strong at it, especially for throughput and per-token time. But raw engine timing isn't the whole story. TTFT depends on how requests wait, batch, chunk, and enter prefill under load — and that's the scheduler's doing, not the kernel's. So DynoSim models the scheduler itself, backend-specific: the vLLM path with its waiting/running queues, shared token budget, and preemption-recompute; the SGLang path with radix-cache-aware admission and chunked prefill. The difference shows up exactly where it hurts.
Scheduler-aware replay vs. timing estimate vs. hardware
At low concurrency, an engine-timing estimate is fine. Push concurrency up and TTFT diverges — queueing effects the estimate never sees — while the scheduler-aware replay stays glued to the hardware curve. That gap is the entire argument for simulating the scheduler instead of guessing a number.
The parts that watch each other
A single worker is the easy case. Dynamo's real power lives in components that make decisions from live system feedback — and that's the hard thing to simulate honestly, because each decision changes the state the next one reads. The Router needs current cache state and decode load. The Planner needs traffic, worker health, and SLA signals. The cache manager needs transfer pressure and tier capacity. DynoSim models all of them on the same timestamp-ordered queue, so the feedback is real, not faked.
You already saw the Router at work in the Pareto picture above — cache-aware placement lifting prefix reuse and dragging the frontier toward the good corner (with the honest caveat that affinity can crowd a few hot workers and nudge decode latency up at peak). The KVBM — the KV block manager — is the other half: it decides where each cache block lives across a memory hierarchy, from GPU memory down to host RAM, SSD, and remote stores.
The KV cache memory hierarchy · and what one more tier buys
The insight is simple once you see it: a KV block you can reuse from any tier is prefill you don't have to recompute. Turning on the host-memory tier lets blocks that would have been evicted survive and be reused — and DynoSim predicts that single change shifts the whole throughput–latency frontier upward, with a 19.3% TTFT improvement at the busiest tested point.
When to add a machine
The Planner — autoscaling — is the component DynoSim was almost built to study, for two reasons. First, its behavior is macro: it only emerges from minutes of traffic, delayed worker startup, capacity churn, and the feedback between scaling decisions, queues, and routing. No unit test exercises that. Second, testing it for real means standing up a full Kubernetes cluster and burning GPU-hours per policy tweak. Simulation lets you sweep it in seconds instead.
Static deployments vs. an SLA-targeted Planner
The first result is the clean one: a Planner told to hit a latency SLA lands in the corner no fixed deployment can reach — lower p90 latency and fewer GPU-hours than any static replica count. The second experiment sweeps how often it's allowed to act: between 1 and 10 seconds, p90 TTFT barely moves, but the number of scaling events collapses from about 1,529 to 233 — so a 5–10 second interval buys the same responsiveness with a fraction of the churn. The third is the cautionary one, and it has teeth.
The SLA cliff · when new capacity arrives too late
Adding a worker isn't instant — a new pod needs seconds to minutes before it can serve. DynoSim models that delay and finds a cliff: the Planner keeps the SLA until cold-start hits about 180 seconds, and then performance falls off a wall — by 300 seconds the system is stuck behind a traffic burst with p90 TTFT at 242 seconds. That's a concrete engineering directive you'd never get from intuition: keep cold start under ~200 seconds, or no autoscaler can save you.
Replay as a scoring function
Once a workload can run through the composed twin, replay stops being a viewer and becomes a scoring function: propose a layout, run the trace, read the metrics, compare to your objective. Today the optimizer is deliberately crude — block-coordinate descent over the knobs (pick a tensor-parallel shape, then a worker split, then a router setting) — which works because the space is still small and locally smooth. As it grows, the same scoring loop bolts onto sharper black-box optimizers: Bayesian search, genetic algorithms, Vizier. And the loop isn't limited to numeric dials. In the spirit of Karpathy's "autoresearch," an agent can propose a real code change to a router cost function or cache policy, rebuild Dynamo, rerun the trace, and keep the change only if the objective improves — a bounded research loop where the simulator is the referee.
Two loops · simulate broadly, verify narrowly
The loop that never really closes
The point was never to replace the real cluster — it's to aim it. Simulation is the inner loop, cheap and exhaustive; the cluster is the outer loop, the judge that validates a shortlist and calibrates the twin from telemetry. And the team's plan is to keep that loop spinning in production: a sweep that runs periodically against recently-recorded traffic, searches the configuration space under the workload you actually have right now, and recommends — or applies — a better deployment when it finds one. Because traffic shape drifts over hours and days, the right tensor-parallel shape, prefill/decode split, router policy, and Planner setting from last week may quietly stop being optimal. A continuous sweep keeps the live system tracking the moving target instead of frozen at whatever you guessed on launch day.
Step back and the shape is familiar. This is the same move as the robotics work — simulate cheaply, verify expensively — pointed at a different frontier: there, teaching a robot in a dream before it touches the world; here, tuning a GPU fleet in a dream before it bills you. It's the systems-design discipline underneath both: build a faithful enough model, make it the inner loop, and spend the expensive real thing only on what the model says is worth it. For your map, that's the bridge — the AWS side is operating systems at scale; this is operating the inference layer at scale, and the technique that makes it tractable is the oldest one in the book, a virtual clock and an event queue, now racing a hundred-knob GPU fleet a thousand times faster than the world it models.
Now — go build.