NCCL Basics

Published:

NCCL库是多GPU通信的高性能库, 用C++编写, 其中封装了多种通信协议下的集合通信原语, 例如Bcast, AllReduce, Reduce, AllGather等。

它支持多种拓扑结构, 包括单机多卡, 多机多卡, 以及NVLink, PCIe, InfiniBand等高速互联技术。相比于MPI, 其主要针对的是GPU通信, 并主要优化的是一部分特定的集合通信.

参考文献:

Bcast


template<typename T, typename RedOp, typename Proto>
  __device__ __forceinline__ void runRing(int tid, int nthreads, struct ncclDevWorkColl* work) {
    ncclRing *ring = &ncclShmem.channel.ring;
    const int rank = ring->userRanks[0];
    const int nextRank = ring->userRanks[1];
    const int root = work->root;
    ssize_t chunkCount;
    ssize_t channelCount;
    ssize_t gridOffset;
    ncclCollCbdPart(work, ncclShmem.channelId, Proto::Id, sizeof(T), (ssize_t*)nullptr, &gridOffset, &channelCount, &chunkCount);
    size_t offset;
    int nelem;
    int workNthreads;
    bool isNetOffload = work->isOneRPN && work->netRegUsed;

    T *inputBuf = (T*)work->sendbuff;
    T *outputBuf = (T*)work->recvbuff;
    workNthreads = isNetOffload ? WARP_SIZE : nthreads;

    if (tid < workNthreads) {
      // Coverity reports that the callee treats &ring->next as an array.  However, due to the use of
      // FanSymmetric<1>, only the first element is ever accessed, so it's fine.
      // coverity[callee_ptr_arith:FALSE]
      Primitives<T, RedOp, FanSymmetric<1>, 1, Proto, 0>
        prims(tid, workNthreads, &ring->prev, &ring->next, inputBuf, outputBuf, work->redOpArg, 0, 0, 0, work);

      for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
        offset = gridOffset + elemOffset;
        nelem = min(chunkCount, channelCount - elemOffset);

        if (rank == root) {
          if (inputBuf == outputBuf || isNetOffload) {
            prims.directSend(offset, offset, nelem);
          } else {
            prims.directCopySend(offset, offset, nelem);
          }
        } else if (nextRank == root) {
          prims.directRecv(offset, nelem);
        } else {
          prims.directRecvCopyDirectSend(offset, offset, nelem);
        }
      }
    } else if (inputBuf != outputBuf && rank == root) {
      inputBuf = inputBuf + gridOffset;
      outputBuf = outputBuf + gridOffset;
      reduceCopy<COLL_UNROLL, RedOp, T, 0, 1, 1, 0, 1, 1, /*PreOpSrcs=*/0>
        (tid - workNthreads, nthreads - workNthreads, work->redOpArg, false, 1, (void**)&inputBuf, 1, (void**)&outputBuf, channelCount);
    }
    if (isNetOffload) barrier_sync(14, nthreads);
  }

可以看到这里面只是外层的通信逻辑, 具体的通信细节在Primitives类里面实现, 这里面会根据不同的通信协议实现不同的通信细节. 看一个具体的协议, 以nvlink为例:

层级含义示例
物理链路协议NVLink 自身的链路编码NVLink 3.0 flit/credit
传输层协议是否 TCP/IB/UCX
NCCL 通信协议LL / LL128 / Simple

经验规则:

消息大小NCCL 协议
≤ 几 KBLL
几 KB ~ 几 MBLL128
≥ 几 MBSimple
NVLink 特性LL128 对应
极低延迟warp 同步 pipeline
高带宽128-bit 对齐传输
GPU-GPU 直连不经过 host
特性NVLinkInfiniBand
物理层NVLink flit/creditIB link layer
传输栈GPU 内部 load/storeRDMA verbs
NCCL 协议LL / LL128 / SimpleLL / Simple(但实现不同)
是否经过 CPU通常要注册内存、走 NIC
延迟极低高于 NVLink
带宽极高受 NIC 限制

infiniBand 作为跨节点通信的主要手段, 其性能和特性与 NVLink 有显著差异:

维度NVLinkInfiniBand
GPU-GPU 单跳带宽极高(600GB/s 级别 / 节点)受 NIC 限制(~25–50GB/s / GPU)
延迟极低(亚微秒级)高于 NVLink(几微秒)
拓扑范围仅限单节点可跨节点扩展到上万 GPU
可扩展性差(受主板/机箱限制)极强(HPC/AI 超算核心)
可靠性节点内高网络级高可靠
价格/功耗昂贵但局部昂贵但可规模化
编程复杂度透明需要网络栈、RDMA、注册
NCCL 性能(小/中消息)更快略慢
NCCL 性能(大规模系统)不适用唯一可行