Skip to content

RCCL environment variables

To mention:

  • NCCL_DEBUG_FILE
  • NCCL_ALGO
  • NCCL_MIN_CHANNELS
  • MSCCLPP_READ_ALLRED
  • NCCL_TUNER_PLUGIN

Network interfaces and protocols

  • NCCL_SOCKET_IFNAME: Explicitly defines the network interfaces for RCLL to use. Multiple interfaces may be provided as comma separated values.

    On LUMI, NCCL_SOCKET_IFNAME=hsn or NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3 should be used and this is even essential as by default RCCL will try to use the management network interface.

  • NCCL_NET_GDR_LEVEL: maximum level of distance between a GPU and NIC at which GPU Direct RDMA/PeerDirect should be used.

    Values (integer values are legacy and discouraged)

    • LOC or 0: Never use RDMA (always disabled).

    • PIX or 1: Use RDMA when GPU and NIC are connected to the same PCI switch.

    • PXB or 2: Use RDMA when GPU and NIC are connected through different PCI switches (potentially multiple hops).

    • PHB or 3: Use RDMA when GPU and NIC are on the same NUMA node. Traffic will go through the CPU.

    • SYS or 4: Use RDMA even across the SMP interconnect between NUMA nodes (e.g., QPI/UPI) (always enabled).

    The value needed on LUMI is PHB or 3. From ROCm 6.2 onwards this is also the default.

Debugging RCCL issues

See "Troubleshooting RCCL"

  • NCCL_DEBUG: Sets the debug information that is displayed. Possible values (assuming the are the same as on NVIDIA):

    • VERSION: Prints the NCCL version at the start of the program.

    • WARN: Prints an explicit error message whenever any NCCL call errors out.

    • INFO: Prints debug information

    • TRACE: Prints replayable trace information on every call.

  • NCCL_DEBUG_SUBSYS: To be used with NCCL_DEBUG=INFO to filter the information based on susbystem.

    Values seen in AMD docs (found here):

    • 'INIT': Initialisation
    • COLL: Collective operations
    • P2P: Peer-to-peer operations
    • NET: Network
    • GRAPH: Topology detection and graph search
    • TUNING: Algorithm and protocol tuning
    • ENV: Environment settings
    • ALLOC: Memory allocations
    • ALL: Include all.

MSCCL/MSCCL++

See RCCL usage tips, MSCCL/MSCCL++.

RCCL integrates MSCCL and MSCCL++ to leverage these highly efficient GPU-GPU communication primitives for collective operations. Microsoft Corporation collaborated with AMD for this project.

  • RCCL_MSCCL_FORCE_ENABLE: MSCCL is only enabled by default on MI300X. You can enable it on other platforms by setting RCCL_MSCCL_FORCE_ENABLE=1.

  • RCCL_MSCCL_ENABLE_SINGLE_PROCESS: By default, MSCCL is only used if every rank belongs to a unique process. To disable this restriction for multi-threaded or single-threaded configurations, use the setting RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1.

  • RCCL_MSCCLPP_ENABLE: RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable: RCCL_MSCCLPP_ENABLE=1.

  • RCCL_MSCCLPP_THRESHOLD: Set the message size threshold for using MSCCL++, default is 1MB.

Other tuning

  • NCCL_ALGO with possible values Tree or Ring: Sets the algorithm for a collective operation. By default a mix may be used depending on the message size, and a different algorithm for each collective. This environment variable will force the same algorithm for all collectives and all message sizes.