Version: 2.0 (Current)

Atlas-4096 Fabric Tuning

This case study documents a tuning cycle for an HDR 200 fabric connecting 4,096 nodes. The objective was to reduce active message tail latency while preserving bandwidth at larger message sizes.

Baseline context

Cluster: Atlas-4096
Fabric: HDR 200, fat-tree topology
Runtime: GASNet-style active message layer with RDMA transport
Pain point: p95 latency spikes on small messages during mixed workloads

Observed behavior

p50 latency stayed within budget but p95 exceeded the SLA.
Bandwidth hit a plateau later than expected, indicating coalescing pressure.
CPU utilization on progress threads peaked at 78% under fan-out collectives.

Interventions

Adjusted CQ polling cadence to reduce bursty completion processing.
Tuned active message coalescing thresholds to reduce head-of-line blocking.
Pinned progress threads to a dedicated NUMA region.

Results (dataset-backed)

Case Study Charts

Interactive charts require JavaScript. Below is a summary of the benchmark data:

Message Size	Latency (p50/p95)	Bandwidth
8B	1.2 / 1.6 µs	-
1KB	2.1 / 2.8 µs	18 GB/s
64KB	8.6 / 10.4 µs	92 GB/s
1MB	14.2 / 18.0 µs	128 GB/s

Loading benchmark charts...

The dataset above is stored under static/data/benchmarks/atlas-4096.json and can be updated as new runs are produced.

Decision log

Accepted: Dedicated progress cores (reduced p95 jitter).
Accepted: Smaller coalescing window for latency-sensitive queues.
Deferred: NIC firmware tuning (needs vendor support).

Next measurement

Validate mixed GPU workloads to confirm the p95 gains persist under accelerator traffic.

Loading comments...

Baseline context​

Observed behavior​

Interventions​

Results (dataset-backed)​

Decision log​

Next measurement​

Related Documentation