Copyright (C) 2026 Dyber, Inc.
Endpoint: GET http://localhost:9091/metrics
Returns metrics in Prometheus text exposition format (v0.0.4).
# GPU Memory
zernel_gpu_memory_used_bytes{pid="<pid>", gpu_id="<id>"}
zernel_gpu_memory_peak_bytes{pid="<pid>", gpu_id="<id>"}
# CUDA Kernel Launch Latency
zernel_cuda_launch_latency_seconds{pid="<pid>", quantile="0.5"}
zernel_cuda_launch_latency_seconds{pid="<pid>", quantile="0.99"}
zernel_cuda_launch_latency_seconds_count{pid="<pid>"}
# NCCL Collective Duration
zernel_nccl_collective_duration_seconds{op="<op>", quantile="0.5"}
zernel_nccl_collective_duration_seconds{op="<op>", quantile="0.99"}
# DataLoader Wait
zernel_dataloader_wait_seconds{pid="<pid>", quantile="0.5"}
zernel_dataloader_wait_seconds{pid="<pid>", quantile="0.99"}
zernel_scheduler_tasks{type="total|ml"}
zernel_scheduler_decisions_total
zernel_scheduler_phase_transitions_total
zernel_scheduler_phase_tasks{phase="data_loading|gpu_compute|nccl_collective|optimizer_step|unknown"}
zernel_scheduler_phase_time_pct{phase="..."}
Endpoint: GET http://localhost:9091/health
Returns 200 OK with body ok when zerneld is healthy.
Endpoint: GET http://localhost:9091/json
Returns the full aggregated metrics as JSON. Useful for debugging.
Endpoint: ws://localhost:9092
Pushes JSON snapshots at a configurable interval (default: 1 second).
{
"gpu_utilization": [
{
"key": "<pid>:<gpu_id>",
"current_bytes": 83886080000,
"peak_bytes": 83886080000
}
],
"cuda_latency_p50_us": 142.0,
"cuda_latency_p99_us": 891.0,
"nccl_allreduce_p50_ms": 34.0,
"nccl_allreduce_p99_ms": 67.0,
"dataloader_wait_p50_ms": 8.0,
"last_update_ms": 1711800000000
}
import asyncio
import websockets
import json
async def watch():
async with websockets.connect("ws://localhost:9092") as ws:
async for message in ws:
data = json.loads(message)
print(f"CUDA p50: {data['cuda_latency_p50_us']:.0f}us")
asyncio.run(watch())
import zernel
# Track an experiment programmatically
with zernel.experiment("my-run") as exp:
exp.log_param("lr", 0.001)
for step in range(1000):
loss = train_step()
exp.log_metric("loss", loss, step=step)
Install: pip install zernel
Copyright © 2026 Dyber, Inc.