Expand description
SSH-based multi-node distributed training backend.
Distributes training across multiple nodes via passwordless SSH. Each node runs torchrun with the correct –node_rank and –master_addr.
Functions§
- cancel_
ssh_ job - Cancel an SSH job by killing processes on all hosts.
- parse_
hosts - Parse a hosts specification into a list of hostnames. Accepts: “host1,host2,host3” or a path to a file with one host per line.
- run_
ssh_ job - Launch a distributed training job across multiple nodes via SSH.