Module job_ssh

Module job_ssh 

Source
Expand description

SSH-based multi-node distributed training backend.

Distributes training across multiple nodes via passwordless SSH. Each node runs torchrun with the correct –node_rank and –master_addr.

Functions§

cancel_ssh_job
Cancel an SSH job by killing processes on all hosts.
parse_hosts
Parse a hosts specification into a list of hostnames. Accepts: “host1,host2,host3” or a path to a file with one host per line.
run_ssh_job
Launch a distributed training job across multiple nodes via SSH.