Troubleshooting Guide
This guide helps you diagnose and resolve common GASNet-related issues.
Installation errors
MPI transport not found
Symptom: configure: error: MPI transport requested but MPI compiler wrappers not found
Solutions:
-
Load the MPI module before configuring:
module load mpi/openmpi-x86_64 -
Explicitly set MPI compilers:
export MPICC=mpicc
export MPICXX=mpicxx -
Verify MPI installation:
which mpicc && mpicc --showme:version
UCX transport failures
Symptom: GASNet-EX: Unable to initialize UCX transport
Common causes:
| Issue | Check | Fix |
| ------------------------ | ------------- | ---------------------------------------------------- | --------------------------------- |
| UCX not installed | ucx_info -v | Install UCX via package manager or build from source |
| Incompatible UCX version | ucx_info -v | grep Ver | Ensure UCX >= 1.9.x for GASNet-EX |
| Missing IB devices | ibstat | Verify InfiniBand drivers are loaded |
OFI/libfabric binding errors
Symptom: libfabric: fi_getinfo() failed
Debug steps:
# Check available providers
fi_info -p <provider>
# Test fabric connectivity
fi_pingpong -p <provider> -s <server_addr>
Common provider issues:
- verbs: Requires
rdmav2library and working RDMA devices - psm2: Only available on Intel Omni-Path fabrics
- tcp: Falls back to TCP but loses RDMA benefits
Runtime failures
GASNet initialization failed
Symptom: gasnet_init() returned error GASNET_ERR_RESOURCE
Diagnostics:
-
Check node count matches allocation:
echo $GASNET_PSHM_NODES
echo $SLURM_JOB_NUM_NODES -
Verify shared memory segment limits:
# Check current limits
ulimit -a | grep -i shm
# Increase if needed (Linux)
sysctl kernel.shmmax=2147483648 -
Validate transport selection:
export GASNET_BACKTRACE=1
export GASNET_VERBOSE=1
Endpoint discovery timeout
Symptom: Process hangs during gasnet_attach()
Checks:
# Verify firewall rules aren't blocking communication
sudo iptables -L -n | grep -i reject
# Check if ports in range are available
netstat -tuln | grep <port_range>
Set explicit timeout for debugging:
export GASNET_BOOTSTRAP_TIMEOUT=300
Segmentation faults on put/get
Symptom: Crash during gasnet_put() or gasnet_get()
Common causes:
-
Unregistered memory: Always register regions before RMA operations:
gasnet_register_local_region(ptr, size, &handle); -
Alignment violations: Ensure pointers are properly aligned:
// Bad: unaligned pointer
char *ptr = malloc(1024) + 1;
// Good: aligned allocation
gasnet_seginfo_t seginfo;
gasnet_get_seginfo(&seginfo, gasnet_nodes());
void *ptr = seginfo[mynode].addr; -
Bounds checking: Enable bounds checking in debug builds:
configure --enable-debug --enable-boundchecks
Performance debugging
Unexpectedly high latency
Measurement approach:
#include <gasnetex.h>
double t_start = gasnett_gettime_ns();
gasnet_get(dst, node, src, nbytes);
gasnet_wait_sync(geth);
double t_end = gasnett_gettime_ns();
printf("Latency: %.2f us\n", (t_end - t_start) / 1000.0);
Potential causes:
| Symptom | Likely cause | Diagnostic |
|---|---|---|
| >2us small-msg latency | Wrong transport | Check GASNET_SPMD_NODEINFO |
| Latency degrades with size | Path MTU issue | Verify ibv_devinfo -v |
| High variance | CPU frequency scaling | Disable turbo boost |
Low bandwidth saturation
Check NIC settings:
# InfiniBand port state
ibstat | grep -A 7 "Port .* state"
# PCIe negotiation width
lspci -vvv | grep -A 10 "Infiniband"
Common fixes:
-
Disable interrupt moderation for latency-sensitive workloads:
ethtool -C <iface> rx-usecs 0 rx-frames 0 -
Increase completion queue depth:
export IBV_CQ_DEPTH=4096 -
Verify MTU matches fabric:
# Set to 2048 or 4096 for HDR
ip link set dev <iface> mtu 2048
Progress thread starvation
Symptom: Communication stalls under high CPU load
Solutions:
-
Pin progress threads to dedicated cores:
#include <gasnetex.h>
gasnet_set_progress_thread_affinity(core_id); -
Use automatic progress:
export GASNET_AUTOPROGRESS=1 -
Increase progress polling frequency:
export GASNET_POLLFreq=1000
Environment-specific issues
Cray XC systems
Symptom: GASNet_ERR_PMI on Cray
Solution:
# Use Cray's PMI
module load cray-pmi
export GASNET_BOOTSTRAP=pmi
Slurm clusters
Symptom: Inconsistent node counts
Validation script:
#!/bin/bash
echo "SLURM nodes: $SLURM_JOB_NUM_NODES"
echo "GASNet nodes: ${GASNET_PSHM_NODES:-unset}"
srun -n $SLURM_JOB_NUM_NODES hostname | sort
Container environments
Symptom: RDMA device not found in containers
Fixes:
-
Pass through RDMA devices:
docker run --device=/dev/infiniband/uverbs0 ... -
Mount required sysfs:
docker run -v /sys/class/infiniband:/sys/class/infiniband ... -
Use RDMA-aware orchestrator (Singularity, Apptainer):
singularity exec --rocm container.sif ./app
Getting help
When reporting issues, include:
- Environment: OS, kernel, compiler versions
- Transport: Which GASNet conduit and version
- Reproducer: Minimal test case showing the failure
- Logs: Output with
GASNET_VERBOSE=1andGASNET_BACKTRACE=1
Useful debugging commands:
# Full environment dump
env | grep -i gasnet > gasnet_env.log
# Capture configuration
./config.log > gasnet_config.log 2>&1
# Trace library calls
LD_DEBUG=libs,symbols ./app 2>&1 | tee gasnet_trace.log