Zephyr-512 OFI TCP Fallback Stabilization
Zephyr-512 faced intermittent RDMA transport failures during peak usage. The team opted to validate a TCP fallback path to keep workflows running while hardware diagnostics were underway.
Baseline context
- Cluster: Zephyr-512
- Fabric: OFI TCP over leaf-spine Ethernet
- Runtime: GASNet-style OFI transport in TCP mode
- Pain point: RDMA transport instability during queue bursts
Observed behavior
- RDMA queue overruns triggered job failures.
- TCP fallback kept jobs alive but reduced bandwidth.
- Latency tails widened under concurrent flows.
Interventions
- Forced OFI provider selection to TCP for stability.
- Lowered message injection rate to avoid queue bursts.
- Added runtime health checks to detect RDMA recovery.
Results (dataset-backed)
TCP throughput was lower, but the runtime avoided critical transport failures and stabilized job completion.
Decision log
- Accepted: Temporary TCP fallback with explicit bandwidth expectations.
- Accepted: Health checks before re-enabling RDMA.
- Deferred: Switch firmware upgrades pending maintenance window.
Follow-up work
Re-test RDMA once firmware updates are applied and compare against the TCP baseline to quantify the recovery.
Loading comments...