Version: 2.0 (Current)

Helios-2048 Adaptive Routing Cutover

Helios-2048 is a 2,048-node system on an HDR 200 dragonfly fabric. The runtime experienced intermittent p95 latency spikes during all-to-all collectives when multiple teams shared the same global links.

Baseline context

Cluster: Helios-2048
Fabric: HDR 200, dragonfly topology
Runtime: GASNet-style collectives with adaptive routing enabled
Pain point: tail latency spikes during synchronized collectives

Observed behavior

Queue depth oscillations coincided with global link congestion.
p95 latency spikes appeared at 64 KB and above under mixed workloads.
Adaptive routing was enabled but not tuned for congestion epochs.

Interventions

Enabled congestion control telemetry to capture ECN marks per rail.
Adjusted adaptive routing thresholds to prefer local groups under pressure.
Increased the message coalescing window for large collective payloads.

Results (dataset-backed)

Case Study Charts

Interactive charts require JavaScript. Below is a summary of the benchmark data:

Message Size	Latency (p50/p95)	Bandwidth
8B	1.2 / 1.6 µs	-
1KB	2.1 / 2.8 µs	18 GB/s
64KB	8.6 / 10.4 µs	92 GB/s
1MB	14.2 / 18.0 µs	128 GB/s

Loading benchmark charts...

The dataset above is stored under static/data/benchmarks/helios-2048.json. Latency tails tightened once routing bias stayed within each local group while congestion was detected.

Decision log

Accepted: Adaptive routing bias toward local groups.
Accepted: Larger collective chunk size to reduce global link pressure.
Deferred: Firmware upgrade pending vendor validation.

Follow-up work

Validate improvements with multi-tenant workloads.
Compare ECN thresholds across rails to confirm stability.

Loading comments...

Baseline context​

Observed behavior​

Interventions​

Results (dataset-backed)​

Decision log​

Follow-up work​

Related Documentation