Helios-2048 Adaptive Routing Cutover
Helios-2048 is a 2,048-node system on an HDR 200 dragonfly fabric. The runtime experienced intermittent p95 latency spikes during all-to-all collectives when multiple teams shared the same global links.
Baseline context
- Cluster: Helios-2048
- Fabric: HDR 200, dragonfly topology
- Runtime: GASNet-style collectives with adaptive routing enabled
- Pain point: tail latency spikes during synchronized collectives
Observed behavior
- Queue depth oscillations coincided with global link congestion.
- p95 latency spikes appeared at 64 KB and above under mixed workloads.
- Adaptive routing was enabled but not tuned for congestion epochs.
Interventions
- Enabled congestion control telemetry to capture ECN marks per rail.
- Adjusted adaptive routing thresholds to prefer local groups under pressure.
- Increased the message coalescing window for large collective payloads.
Results (dataset-backed)
The dataset above is stored under static/data/benchmarks/helios-2048.json.
Latency tails tightened once routing bias stayed within each local group while
congestion was detected.
Decision log
- Accepted: Adaptive routing bias toward local groups.
- Accepted: Larger collective chunk size to reduce global link pressure.
- Deferred: Firmware upgrade pending vendor validation.
Follow-up work
- Validate improvements with multi-tenant workloads.
- Compare ECN thresholds across rails to confirm stability.
Loading comments...