Module gpu_watchdog

Module gpu_watchdog 

Source
Expand description

GPU Memory Pressure Watchdog

Continuously monitors GPU memory usage and sends early warnings before OOM occurs. Can trigger automatic checkpointing in training frameworks that support SIGUSR1 signal handling.

Structs§

GpuMemoryState
GPU memory state.
WatchdogConfig
Thresholds for memory pressure alerts.

Functions§

get_cuda_pids 🔒
Get PIDs of processes using CUDA on a specific GPU.
poll_gpu_memory
Poll current GPU memory state.
run_watchdog
Run the GPU memory watchdog loop.