RCCL environment variables¶

To mention:

NCCL_DEBUG_FILE
NCCL_ALGO
NCCL_MIN_CHANNELS
MSCCLPP_READ_ALLRED
NCCL_TUNER_PLUGIN

Network interfaces and protocols¶

NCCL_SOCKET_IFNAME: Explicitly defines the network interfaces for RCLL to use. Multiple interfaces may be provided as comma separated values.

On LUMI, NCCL_SOCKET_IFNAME=hsn or NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 should be used and this is even essential as by default RCCL will try to use the management network interface.
NCCL_NET_GDR_LEVEL: maximum level of distance between a GPU and NIC at which GPU Direct RDMA/PeerDirect should be used.

Values (integer values are legacy and discouraged)
- LOC or 0: Never use RDMA (always disabled).
- PIX or 1: Use RDMA when GPU and NIC are connected to the same PCI switch.
- PXB or 2: Use RDMA when GPU and NIC are connected through different PCI switches (potentially multiple hops).
- PHB or 3: Use RDMA when GPU and NIC are on the same NUMA node. Traffic will go through the CPU.
- SYS or 4: Use RDMA even across the SMP interconnect between NUMA nodes (e.g., QPI/UPI) (always enabled).
The value needed on LUMI is PHB or 3. From ROCm 6.2 onwards this is also the default.

Debugging RCCL issues¶

See "Troubleshooting RCCL"

NCCL_DEBUG: Sets the debug information that is displayed. Possible values (assuming the are the same as on NVIDIA):
- VERSION: Prints the NCCL version at the start of the program.
- WARN: Prints an explicit error message whenever any NCCL call errors out.
- INFO: Prints debug information
- TRACE: Prints replayable trace information on every call.
NCCL_DEBUG_SUBSYS: To be used with NCCL_DEBUG=INFO to filter the information based on susbystem.

Values seen in AMD docs (found here):
- 'INIT': Initialisation
- COLL: Collective operations
- P2P: Peer-to-peer operations
- NET: Network
- GRAPH: Topology detection and graph search
- TUNING: Algorithm and protocol tuning
- ENV: Environment settings
- ALLOC: Memory allocations
- ALL: Include all.

MSCCL/MSCCL++¶

See RCCL usage tips, MSCCL/MSCCL++.

RCCL integrates MSCCL and MSCCL++ to leverage these highly efficient GPU-GPU communication primitives for collective operations. Microsoft Corporation collaborated with AMD for this project.

RCCL_MSCCL_FORCE_ENABLE: MSCCL is only enabled by default on MI300X. You can enable it on other platforms by setting RCCL_MSCCL_FORCE_ENABLE=1.
RCCL_MSCCL_ENABLE_SINGLE_PROCESS: By default, MSCCL is only used if every rank belongs to a unique process. To disable this restriction for multi-threaded or single-threaded configurations, use the setting RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1.
RCCL_MSCCLPP_ENABLE: RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable: RCCL_MSCCLPP_ENABLE=1.
RCCL_MSCCLPP_THRESHOLD: Set the message size threshold for using MSCCL++, default is 1MB.

Other tuning¶

NCCL_ALGO with possible values Tree or Ring: Sets the algorithm for a collective operation. By default a mix may be used depending on the message size, and a different algorithm for each collective. This environment variable will force the same algorithm for all collectives and all message sizes.