Expand description
GPU Memory Pressure Watchdog
Continuously monitors GPU memory usage and sends early warnings before OOM occurs. Can trigger automatic checkpointing in training frameworks that support SIGUSR1 signal handling.
Structs§
- GpuMemory
State - GPU memory state.
- Watchdog
Config - Thresholds for memory pressure alerts.
Functions§
- get_
cuda_ 🔒pids - Get PIDs of processes using CUDA on a specific GPU.
- poll_
gpu_ memory - Poll current GPU memory state.
- run_
watchdog - Run the GPU memory watchdog loop.