Module nccl_priority

Module nccl_priority 

Source
Expand description

NCCL Network Priority via eBPF/tc

When the scheduler detects NcclCollective phase, this module uses Linux traffic control (tc) with eBPF classifiers to give NCCL traffic absolute priority on network interfaces.

This reduces all-reduce tail latency which is the critical path in distributed training โ€” the slowest rank determines throughput.

Constantsยง

DEFAULT_INTERFACES ๐Ÿ”’
Network interface to apply NCCL priority rules to.
NCCL_DSCP ๐Ÿ”’
DSCP marking for high-priority NCCL traffic.
NCCL_PORT_END ๐Ÿ”’
NCCL_PORT_START ๐Ÿ”’
NCCL uses a specific port range (typically 29500-29999 for PyTorch).

Functionsยง

detect_nccl_interface
Detect the primary network interface for NCCL traffic.
disable_nccl_priority
Remove NCCL priority rules.
enable_nccl_priority
Apply high-priority tc rules for NCCL traffic. Requires root / CAP_NET_ADMIN.
mark_nccl_dscp
Mark outgoing NCCL packets with DSCP EF for switches/routers.