Copyright (C) 2026 Dyber, Inc.
Zernel ships a custom sched_ext CPU scheduler written in Rust, purpose-built for ML workloads. It leverages the sched_ext framework (merged into Linux 6.12) to load a fully custom, pluggable CPU scheduler into the kernel at runtime -- without forking the kernel.
Standard Linux schedulers (CFS, EEVDF) are designed for general-purpose workloads. ML training on GPUs has a fundamentally different workload profile:
The standard scheduler has no concept of these phases. Zernel's scheduler does.
The scheduler maintains per-task state and classifies each ML process into one of five phases:
enum WorkloadPhase {
DataLoading, // High I/O, many threads, CPU-intensive
GpuCompute, // CPU idle, waiting on GPU
NcclCollective, // Network coordination for collectives
OptimizerStep, // Short CPU burst after GPU sync
Unknown, // Non-ML or unclassified
}
| Phase | Signals |
|---|---|
| DataLoading | io_wait_fraction > 0.3 AND gpu_utilization < 10% |
| GpuCompute | gpu_utilization > 80% AND cpu_burst_duration == 0 |
| NcclCollective | nccl_active == true AND futex_wait_count > 0 |
| OptimizerStep | cpu_burst < 5ms AND last_gpu_sync > 0 |
To prevent rapid phase flapping, the scheduler uses a stability tracker. A phase transition only commits after phase_stability_count consecutive identical classifications (default: 3).
| Phase | CPU Priority | Preemption | Latency Target | Rationale |
|---|---|---|---|---|
| DataLoading | +10 (high) | Aggressive | 100us | More CPU = faster pipeline = GPU never starves |
| GpuCompute | -5 (low) | Immediate yield | -- | CPU doing nothing useful, give it to data loaders |
| NcclCollective | +10 (high) | Low latency | 50us | Collectives are on the critical path |
| OptimizerStep | +10 (high) | Preemptive | 100us | Minimize GPU idle time between steps |
| Unknown | 0 (normal) | Standard | -- | CFS-equivalent behavior |
On multi-socket servers, the scheduler reads NUMA topology from /sys/devices/system/node/ and GPU-to-NUMA mappings from PCI sysfs:
NUMA Node 0: CPUs 0-31, GPU 0-3
NUMA Node 1: CPUs 32-63, GPU 4-7
When scheduling a task bound to GPU 2, the scheduler prefers CPUs 0-31 (same NUMA node). Within that set, it picks the CPU with the lowest current load.
On shared GPU servers, the scheduler enforces GPU-proportional CPU allocation:
Training > Inference > Interactive > Background| Class | Base Priority | Use Case |
|---|---|---|
| Training | +5 | Training jobs -- highest resource allocation |
| Inference | +3 | Inference serving -- latency-sensitive |
| Interactive | +1 | Jupyter notebooks -- responsive but best-effort |
| Background | -5 | Batch preprocessing -- lowest priority |
The scheduler reads from /etc/zernel/scheduler.toml:
[general]
phase_eval_interval_ms = 100
gpu_poll_interval_ms = 500
max_tracked_tasks = 65536
[phase_detection]
io_wait_threshold = 0.3
optimizer_burst_max_ns = 5000000
gpu_idle_threshold = 10
gpu_active_threshold = 80
phase_stability_count = 3
nccl_detection_enabled = false
[numa]
enabled = true
gpu_affinity = true
memory_affinity = true
[multi_tenant]
enabled = false
default_priority_class = "normal"
[telemetry]
metrics_port = 9093
push_interval_ms = 1000
Generate a default config:
zernel-scheduler --dump-config > /etc/zernel/scheduler.toml
The scheduler exports metrics in Prometheus format:
zernel_scheduler_tasks{type="total"} 42
zernel_scheduler_tasks{type="ml"} 8
zernel_scheduler_decisions_total 1847293
zernel_scheduler_phase_transitions_total 4821
zernel_scheduler_phase_tasks{phase="data_loading"} 3
zernel_scheduler_phase_tasks{phase="gpu_compute"} 4
zernel_scheduler_phase_time_pct{phase="gpu_compute"} 72.30
zernel_scheduler_phase_time_pct{phase="data_loading"} 18.50
The scheduler uses two components:
sched_ext ops, reads decisions from BPF maps, applies them at scheduling decision pointsThe BPF program implements these hooks:
- ops.select_cpu() -- NUMA-aware CPU selection
- ops.enqueue() -- priority-based enqueue ordering
- ops.dispatch() -- phase-aware dispatch
- ops.running() / ops.stopping() -- phase timing
Note: Full BPF implementation requires the
bpfCargo feature and Linux 6.12+. The userspace logic works on any platform for development and testing.
Copyright © 2026 Dyber, Inc.