Orion-1024 GPU Direct RDMA Enablement
Orion-1024 runs accelerator-heavy workloads where GPU buffers dominate the data path. The initial configuration relied on host staging, creating extra copies and limiting bandwidth.
Baseline context
- Cluster: Orion-1024
- Fabric: Slingshot, fat-tree topology
- Runtime: GASNet-style RDMA path with GPU-aware support disabled
- Pain point: large-message bandwidth plateau below target
Observed behavior
- CPU utilization spiked during large transfers due to staging copies.
- Bandwidth plateaued earlier than expected in the 64 KB–1 MB range.
- GPU kernel overlap stalled because RMA operations waited on host buffers.
Interventions
- Enabled GPU Direct RDMA support in the transport stack.
- Introduced pinned staging buffers for control-plane messages only.
- Registered GPU buffers during job initialization to avoid on-demand costs.
Results (dataset-backed)
Bandwidth improved for large message sizes once GPU buffers were registered up-front and host staging was avoided.
Decision log
- Accepted: Persistent GPU buffer registration.
- Accepted: Separate control-plane staging pool.
- Deferred: GPU memory oversubscription tests.
Next measurement
Validate mixed CPU/GPU traffic to ensure the progress engine does not starve GPU kernels under heavy load.
Loading comments...