Version: 2.0 (Current)

Orion-1024 GPU Direct RDMA Enablement

Orion-1024 runs accelerator-heavy workloads where GPU buffers dominate the data path. The initial configuration relied on host staging, creating extra copies and limiting bandwidth.

Baseline context

Cluster: Orion-1024
Fabric: Slingshot, fat-tree topology
Runtime: GASNet-style RDMA path with GPU-aware support disabled
Pain point: large-message bandwidth plateau below target

Observed behavior

CPU utilization spiked during large transfers due to staging copies.
Bandwidth plateaued earlier than expected in the 64 KB–1 MB range.
GPU kernel overlap stalled because RMA operations waited on host buffers.

Interventions

Enabled GPU Direct RDMA support in the transport stack.
Introduced pinned staging buffers for control-plane messages only.
Registered GPU buffers during job initialization to avoid on-demand costs.

Results (dataset-backed)

Case Study Charts

Interactive charts require JavaScript. Below is a summary of the benchmark data:

Message Size	Latency (p50/p95)	Bandwidth
8B	1.2 / 1.6 µs	-
1KB	2.1 / 2.8 µs	18 GB/s
64KB	8.6 / 10.4 µs	92 GB/s
1MB	14.2 / 18.0 µs	128 GB/s

Loading benchmark charts...

Bandwidth improved for large message sizes once GPU buffers were registered up-front and host staging was avoided.

Decision log

Accepted: Persistent GPU buffer registration.
Accepted: Separate control-plane staging pool.
Deferred: GPU memory oversubscription tests.

Next measurement

Validate mixed CPU/GPU traffic to ensure the progress engine does not starve GPU kernels under heavy load.

Loading comments...

Baseline context​

Observed behavior​

Interventions​

Results (dataset-backed)​

Decision log​

Next measurement​

Related Documentation