eBPF Observability Layer

zerneld is the Zernel observability daemon. It loads eBPF probes into the kernel to instrument ML workloads in real time -- without any application code changes. Think perf but purpose-built for ML.

Probe Architecture

+-------------------+    +-------------------+    +-------------------+
| gpu_mem.bpf.c     |    | cuda_trace.bpf.c  |    | nccl.bpf.c        |
| uprobe: libcuda   |    | uprobe: libcuda   |    | uprobe: libnccl   |
+--------+----------+    +--------+----------+    +--------+----------+
         |                         |                         |
         v                         v                         v
+--------+-------------------------+-------------------------+--------+
|                    BPF Ring Buffers                                   |
+--------+-------------------------+-------------------------+--------+
         |                         |                         |
         v                         v                         v
+--------+----------+    +--------+----------+    +--------+----------+
| GpuMemConsumer    |    | CudaTraceConsumer |    | NcclConsumer      |
+--------+----------+    +--------+----------+    +--------+----------+
         |                         |                         |
         +------------+------------+------------+------------+
                      |                         |
                      v                         v
              +-------+--------+    +-----------+-----------+
              | AggregatedMetrics|    | AlertEngine          |
              +-------+--------+    +-----------+-----------+
                      |
         +------------+------------+
         |                         |
         v                         v
+--------+----------+    +--------+----------+
| Prometheus HTTP   |    | WebSocket Server  |
| :9091/metrics     |    | :9092             |
+-------------------+    +-------------------+
         |                         |
         v                         v
  Grafana / Alertmanager     zernel watch (CLI)

What Gets Instrumented

GPU Memory (zernel-gpumem)

CUDA Kernel Launch Latency (zernel-cuda-trace)

NCCL Collective Bottlenecks (zernel-nccl)

Dataset I/O Pipeline (zernel-dataload)

Distributed Synchronization (zernel-dist)

Prometheus Metrics Reference

# GPU Memory
zernel_gpu_memory_used_bytes{pid, gpu_id}
zernel_gpu_memory_peak_bytes{pid, gpu_id}

# CUDA Kernel Latency
zernel_cuda_launch_latency_seconds{pid, quantile="0.5"}
zernel_cuda_launch_latency_seconds{pid, quantile="0.99"}
zernel_cuda_launch_latency_seconds_count{pid}

# NCCL Collectives
zernel_nccl_collective_duration_seconds{op, quantile="0.5"}
zernel_nccl_collective_duration_seconds{op, quantile="0.99"}

# DataLoader
zernel_dataloader_wait_seconds{pid, quantile="0.5"}
zernel_dataloader_wait_seconds{pid, quantile="0.99"}

WebSocket Protocol

zerneld pushes JSON snapshots to connected clients at a configurable interval (default: 1s):

{
  "gpu_utilization": [
    {"key": "1000:0", "current_bytes": 83886080000, "peak_bytes": 83886080000}
  ],
  "cuda_latency_p50_us": 142.0,
  "cuda_latency_p99_us": 891.0,
  "nccl_allreduce_p50_ms": 34.0,
  "nccl_allreduce_p99_ms": 67.0,
  "dataloader_wait_p50_ms": 8.0,
  "last_update_ms": 1711800000000
}

websocat ws://localhost:9092

Running zerneld

# Development mode (simulated telemetry, no BPF)
zerneld --simulate

# Production mode (requires Linux + root)
sudo zerneld

# Custom ports
ZERNEL_LOG=debug zerneld --simulate

Grafana Integration

Import the Zernel dashboard from docs/grafana-dashboard.json (coming soon) or create a Prometheus data source pointing to http://zernel-host:9091.

Alert Configuration

GPU OOM Warning: triggers when gpu_memory_used_pct > 95%

eBPF Observability Layer

Overview

Probe Architecture

What Gets Instrumented

GPU Memory (`zernel-gpumem`)

CUDA Kernel Launch Latency (`zernel-cuda-trace`)

NCCL Collective Bottlenecks (`zernel-nccl`)

Dataset I/O Pipeline (`zernel-dataload`)

Distributed Synchronization (`zernel-dist`)

Prometheus Metrics Reference

WebSocket Protocol

Running zerneld

Grafana Integration

Alert Configuration

eBPF Observability Layer

Overview

Probe Architecture

What Gets Instrumented

GPU Memory (zernel-gpumem)

CUDA Kernel Launch Latency (zernel-cuda-trace)

NCCL Collective Bottlenecks (zernel-nccl)

Dataset I/O Pipeline (zernel-dataload)

Distributed Synchronization (zernel-dist)

Prometheus Metrics Reference

WebSocket Protocol

Running zerneld

Grafana Integration

Alert Configuration

GPU Memory (`zernel-gpumem`)

CUDA Kernel Launch Latency (`zernel-cuda-trace`)

NCCL Collective Bottlenecks (`zernel-nccl`)

Dataset I/O Pipeline (`zernel-dataload`)

Distributed Synchronization (`zernel-dist`)