RCCL environment variables¶
To mention:
NCCL_DEBUG_FILE
NCCL_ALGO
NCCL_MIN_CHANNELS
MSCCLPP_READ_ALLRED
NCCL_TUNER_PLUGIN
Network interfaces and protocols¶
-
NCCL_SOCKET_IFNAME
: Explicitly defines the network interfaces for RCLL to use. Multiple interfaces may be provided as comma separated values.On LUMI,
NCCL_SOCKET_IFNAME=hsn
orNCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
should be used and this is even essential as by default RCCL will try to use the management network interface. -
NCCL_NET_GDR_LEVEL
: maximum level of distance between a GPU and NIC at which GPU Direct RDMA/PeerDirect should be used.Values (integer values are legacy and discouraged)
-
LOC
or0
: Never use RDMA (always disabled). -
PIX
or1
: Use RDMA when GPU and NIC are connected to the same PCI switch. -
PXB
or2
: Use RDMA when GPU and NIC are connected through different PCI switches (potentially multiple hops). -
PHB
or3
: Use RDMA when GPU and NIC are on the same NUMA node. Traffic will go through the CPU. -
SYS
or4
: Use RDMA even across the SMP interconnect between NUMA nodes (e.g., QPI/UPI) (always enabled).
The value needed on LUMI is
PHB
or3
. From ROCm 6.2 onwards this is also the default. -
Debugging RCCL issues¶
-
NCCL_DEBUG
: Sets the debug information that is displayed. Possible values (assuming the are the same as on NVIDIA):-
VERSION
: Prints the NCCL version at the start of the program. -
WARN
: Prints an explicit error message whenever any NCCL call errors out. -
INFO
: Prints debug information -
TRACE
: Prints replayable trace information on every call.
-
-
NCCL_DEBUG_SUBSYS
: To be used withNCCL_DEBUG=INFO
to filter the information based on susbystem.Values seen in AMD docs (found here):
- 'INIT': Initialisation
COLL
: Collective operationsP2P
: Peer-to-peer operationsNET
: NetworkGRAPH
: Topology detection and graph searchTUNING
: Algorithm and protocol tuningENV
: Environment settingsALLOC
: Memory allocationsALL
: Include all.
MSCCL/MSCCL++¶
See RCCL usage tips, MSCCL/MSCCL++.
RCCL integrates MSCCL and MSCCL++ to leverage these highly efficient GPU-GPU communication primitives for collective operations. Microsoft Corporation collaborated with AMD for this project.
-
RCCL_MSCCL_FORCE_ENABLE
: MSCCL is only enabled by default on MI300X. You can enable it on other platforms by settingRCCL_MSCCL_FORCE_ENABLE=1
. -
RCCL_MSCCL_ENABLE_SINGLE_PROCESS
: By default, MSCCL is only used if every rank belongs to a unique process. To disable this restriction for multi-threaded or single-threaded configurations, use the settingRCCL_MSCCL_ENABLE_SINGLE_PROCESS=1
. -
RCCL_MSCCLPP_ENABLE
: RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable:RCCL_MSCCLPP_ENABLE=1
. -
RCCL_MSCCLPP_THRESHOLD
: Set the message size threshold for using MSCCL++, default is 1MB.
Other tuning¶
NCCL_ALGO
with possible valuesTree
orRing
: Sets the algorithm for a collective operation. By default a mix may be used depending on the message size, and a different algorithm for each collective. This environment variable will force the same algorithm for all collectives and all message sizes.