Skip to main content
Skip to main content
Version: 2.0 (Current)

Helios-2048 Adaptive Routing Cutover

Helios-2048 is a 2,048-node system on an HDR 200 dragonfly fabric. The runtime experienced intermittent p95 latency spikes during all-to-all collectives when multiple teams shared the same global links.

Baseline context

  • Cluster: Helios-2048
  • Fabric: HDR 200, dragonfly topology
  • Runtime: GASNet-style collectives with adaptive routing enabled
  • Pain point: tail latency spikes during synchronized collectives

Observed behavior

  • Queue depth oscillations coincided with global link congestion.
  • p95 latency spikes appeared at 64 KB and above under mixed workloads.
  • Adaptive routing was enabled but not tuned for congestion epochs.

Interventions

  1. Enabled congestion control telemetry to capture ECN marks per rail.
  2. Adjusted adaptive routing thresholds to prefer local groups under pressure.
  3. Increased the message coalescing window for large collective payloads.

Results (dataset-backed)

Loading benchmark charts...

The dataset above is stored under static/data/benchmarks/helios-2048.json. Latency tails tightened once routing bias stayed within each local group while congestion was detected.

Decision log

  • Accepted: Adaptive routing bias toward local groups.
  • Accepted: Larger collective chunk size to reduce global link pressure.
  • Deferred: Firmware upgrade pending vendor validation.

Follow-up work

  • Validate improvements with multi-tenant workloads.
  • Compare ECN thresholds across rails to confirm stability.
Loading comments...