RCCL environment variables¶
To mention:
NCCL_DEBUG_FILENCCL_ALGONCCL_MIN_CHANNELSMSCCLPP_READ_ALLREDNCCL_TUNER_PLUGIN
Network interfaces and protocols¶
- 
NCCL_SOCKET_IFNAME: Explicitly defines the network interfaces for RCLL to use. Multiple interfaces may be provided as comma separated values.On LUMI,
NCCL_SOCKET_IFNAME=hsnorNCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3should be used and this is even essential as by default RCCL will try to use the management network interface. - 
NCCL_NET_GDR_LEVEL: maximum level of distance between a GPU and NIC at which GPU Direct RDMA/PeerDirect should be used.Values (integer values are legacy and discouraged)
- 
LOCor0: Never use RDMA (always disabled). - 
PIXor1: Use RDMA when GPU and NIC are connected to the same PCI switch. - 
PXBor2: Use RDMA when GPU and NIC are connected through different PCI switches (potentially multiple hops). - 
PHBor3: Use RDMA when GPU and NIC are on the same NUMA node. Traffic will go through the CPU. - 
SYSor4: Use RDMA even across the SMP interconnect between NUMA nodes (e.g., QPI/UPI) (always enabled). 
The value needed on LUMI is
PHBor3. From ROCm 6.2 onwards this is also the default. - 
 
Debugging RCCL issues¶
- 
NCCL_DEBUG: Sets the debug information that is displayed. Possible values (assuming the are the same as on NVIDIA):- 
VERSION: Prints the NCCL version at the start of the program. - 
WARN: Prints an explicit error message whenever any NCCL call errors out. - 
INFO: Prints debug information - 
TRACE: Prints replayable trace information on every call. 
 - 
 - 
NCCL_DEBUG_SUBSYS: To be used withNCCL_DEBUG=INFOto filter the information based on susbystem.Values seen in AMD docs (found here):
- 'INIT': Initialisation
 COLL: Collective operationsP2P: Peer-to-peer operationsNET: NetworkGRAPH: Topology detection and graph searchTUNING: Algorithm and protocol tuningENV: Environment settingsALLOC: Memory allocationsALL: Include all.
 
MSCCL/MSCCL++¶
See RCCL usage tips, MSCCL/MSCCL++.
RCCL integrates MSCCL and MSCCL++ to leverage these highly efficient GPU-GPU communication primitives for collective operations. Microsoft Corporation collaborated with AMD for this project.
- 
RCCL_MSCCL_FORCE_ENABLE: MSCCL is only enabled by default on MI300X. You can enable it on other platforms by settingRCCL_MSCCL_FORCE_ENABLE=1. - 
RCCL_MSCCL_ENABLE_SINGLE_PROCESS: By default, MSCCL is only used if every rank belongs to a unique process. To disable this restriction for multi-threaded or single-threaded configurations, use the settingRCCL_MSCCL_ENABLE_SINGLE_PROCESS=1. - 
RCCL_MSCCLPP_ENABLE: RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable:RCCL_MSCCLPP_ENABLE=1. - 
RCCL_MSCCLPP_THRESHOLD: Set the message size threshold for using MSCCL++, default is 1MB. 
Other tuning¶
NCCL_ALGOwith possible valuesTreeorRing: Sets the algorithm for a collective operation. By default a mix may be used depending on the message size, and a different algorithm for each collective. This environment variable will force the same algorithm for all collectives and all message sizes.