ENGINEERINGGPU

How Stealthium Built a Faster GPU Monitoring Layer

MAR 2026By Branislav Brzak, Stealthium Security Research Team
How Stealthium Built a Faster GPU Monitoring Layer

Managing GPUs at scale requires constant communication between software and hardware. NVML, NVIDIA's Management Library, is the standard tool for that: it's how most software asks a GPU what it's doing, how hot it is, how much power it's drawing, and hundreds of other questions. It's widely adopted, well understood, and for many use cases, good enough.

But good enough has limits. When you're running GPU infrastructure at scale, with tight monitoring loops and real-time observability, the overhead of every individual call starts to matter. NVML wasn't built with that environment as its primary target.

Stealthium GPU Monitor is our answer to that. It's a GPU monitoring layer we built and own, designed from the ground up for deep observability at production scale. In most cases it surfaces more information than NVML, at lower overhead, and it's a codebase we can improve on our own timeline.

The Problem We Went After

Modern GPU infrastructure demands tighter observability than ever before. Teams are running continuous telemetry polling, real-time scheduling decisions, anomaly detection, automated remediation, and dynamic policy enforcement, all simultaneously, all at scale. Per-call overhead compounds fast, and at that scale it becomes real cluster-level cost with no clear path to fix it if you don't own the stack.

We decided to own the stack.

How We Measured It

We benchmarked the same monitoring API surface across two backend modes: Stealthium GPU Monitor and an NVML-compatible path. Same host, same GPU, same workload profile, same iteration counts, same warmup strategy. No shortcuts.

To reflect realistic production conditions, 5 CUDA programs running at high compute intensity were active in the background throughout the benchmark. This isn't a quiet-machine test; the numbers reflect how both paths perform under actual load.

This measures what matters in production: end-to-end behavior through the product API surface, not isolated microbenchmarks with no connection to real usage.

What the Numbers Show

1) Startup / Initialization: Stealthium GPU Monitor Wins Clearly

Frontend initialization on Stealthium GPU Monitor completed at roughly 2.78 ms, compared to 16.47 ms on the NVML-compatible path, a ~5.9x improvement.

For monitoring components that initialize frequently, such as worker startup, failover recovery, and short-lived monitoring agents, this is an immediate operational gain.

2) High-Frequency Telemetry: Where the Gap Opens Up

The biggest wins are on calls that run in tight production loops:

  • Clock domains query: ~899x faster
  • Process utilization sample query: ~386x faster
  • PCIe data query: ~31x faster
  • Power limit query: ~27x faster

Consistent low-latency behavior across hot-path calls means faster decision loops, lower cumulative CPU pressure, and a platform that scales without proportionally increasing resource cost.

3) Where NVML Leads

On a subset of operations, the NVML-compatible path is faster, and it's worth explaining why.

In some cases, NVML returns results from hardcoded userspace lookup tables, values baked directly into the library that are served without ever touching the driver. Stealthium GPU Monitor doesn't carry those tables; we go to the driver to get that information, which ensures we're always serving accurate, live data. That does add per-call cost on those specific reads.

In other cases, NVML is faster because it returns less data. Less to process means a faster result, but also a less complete one. Stealthium GPU Monitor surfaces richer information, and that depth has a small per-call cost that shows up in the benchmarks.

Both are deliberate tradeoffs. Driver queries and full introspection are foundational to what Stealthium GPU Monitor does. As we introduce targeted caching where it makes sense, these gaps will narrow.

4) Memory Footprint: Stealthium GPU Monitor Runs Much Leaner

Peak process RSS during benchmark flows:

  • Stealthium GPU Monitor: ~2.3 to 2.7 MB
  • NVML-compatible path: ~22.1 to 22.7 MB

Roughly 19 to 20 MB lower peak footprint per process. For customers running dense agent or sidecar deployments, that translates directly to better node density, less memory pressure, and lower infrastructure cost over time.

5) Data Depth: Stealthium GPU Monitor Exposes More Than NVML Does

Performance aside, Stealthium GPU Monitor gives us access to GPU data that NVML simply doesn't expose. NVML was designed around a specific set of metrics and has stayed largely within that boundary. Because we query the driver directly and own the full stack, we can surface information that would otherwise be invisible to any monitoring system built on top of NVML.

This matters for observability. More data means better diagnostics, more accurate decisions, and a clearer picture of what's happening on the hardware at any given moment. In a number of cases, we surface information that wasn't accessible at all before.

What Owning the Stack Actually Means

The performance numbers are real, but the more important point is architectural.

Our product uses Stealthium GPU Monitor internally in place of NVML. That means every GPU monitoring call our platform makes, telemetry, power and thermal data, utilization and process-level visibility, goes through a layer we built and control, and can improve without waiting on anyone else's roadmap. When we find a bottleneck, we fix it.

Where Stealthium GPU Monitor doesn't yet have driver coverage, we fall back to NVML automatically rather than fail. That fallback is intentional and shrinks as our driver support expands. NVML stays in the stack, but increasingly as a safety net rather than the default.

What This Means in Practice

If you're running our platform on GPU infrastructure with frequent telemetry polling, tight observability requirements, or dense agent deployments, these improvements are already working for you under the hood. Faster initialization, lower memory pressure, and more complete GPU data, without any configuration changes on your end.

Stealthium GPU Monitor isn't a feature you opt into. It's how our platform is built.

Our Position on NVML

NVML is a solid piece of engineering and remains an important part of the GPU software ecosystem. Our goal with Stealthium GPU Monitor isn't to dismiss it, but to go further where GPU observability demands it. Within our supported driver range, Stealthium GPU Monitor handles the vast majority of operations, with NVML available as a fallback where we don't yet have full coverage. That balance will continue to shift as our driver support grows.

The Bottom Line

Stealthium GPU Monitor powers the observability layer inside our product. It's faster at startup, faster on high-frequency telemetry, uses a fraction of the memory, and in most cases gives us richer data than NVML does. Where we still fall back to NVML, there's a clear technical reason, and we're closing those gaps.

The result is a monitoring platform that doesn't have to trade accuracy for speed, or performance for compatibility.

Full Benchmark Results

All operations benchmarked under identical conditions on the same host and GPU, with 5 high-compute-intensity CUDA programs running in the background throughout.

Operation Stealthium Average NVML Average Faster Speedup Stealthium Peak RAM NVML Peak RAM RAM Increase (NVML vs Stealthium)
Frontend Initialization 2.78 ms 16.47 ms Stealthium 5.92x 2.27 MB 7.29 MB +5.02 MB
GPU Count Lookup 2.42 ns 1.73 ns NVML 1.40x 2.28 MB 21.73 MB +19.45 MB
Driver Version Lookup 22.18 ns 16.18 ns NVML 1.37x 2.28 MB 21.73 MB +19.45 MB
NVML Version Lookup 22.91 ns 15.97 ns NVML 1.43x 2.28 MB 21.73 MB +19.45 MB
CUDA Driver Version Lookup 2.77 ns 1.92 ns NVML 1.44x 2.28 MB 21.73 MB +19.45 MB
SM Version Query 8.39 us 0.94 us NVML 8.97x 2.28 MB 21.73 MB +19.45 MB
Device Name Query 3.35 us 1.11 us NVML 3.02x 2.28 MB 21.73 MB +19.45 MB
PCI Bus ID Query 200.18 ns 1.26 us Stealthium 6.27x 2.28 MB 21.73 MB +19.45 MB
OEM Board Info Query 2.82 us 61.62 ns NVML 45.81x 2.28 MB 21.73 MB +19.45 MB
Temperature Query 6.17 us 0.96 us NVML 6.40x 2.28 MB 21.79 MB +19.51 MB
Clock Domains Query 6.18 us 5.55 ms Stealthium 898.78x 2.34 MB 21.91 MB +19.57 MB
Utilization Query 6.32 us 2.34 us NVML 2.70x 2.34 MB 22.04 MB +19.69 MB
Current P-State Query 6.17 us 1.00 us NVML 6.19x 2.34 MB 22.08 MB +19.74 MB
Power Limit Query 6.21 us 167.87 us Stealthium 27.04x 2.34 MB 22.14 MB +19.80 MB
Average Power Query 6.38 us 1.10 us NVML 5.82x 2.34 MB 22.14 MB +19.80 MB
Instant Power Query 6.22 us 1.04 us NVML 5.98x 2.34 MB 22.14 MB +19.80 MB
PCIe Data Query 6.24 us 191.98 us Stealthium 30.75x 2.34 MB 22.14 MB +19.80 MB
Process Utilization Info 3.38 ms 2.69 ms NVML 1.26x 2.59 MB 22.15 MB +19.56 MB
Process Memory Info 99.82 us 162.50 us Stealthium 1.63x 2.59 MB 22.15 MB +19.56 MB
Process Utilization Sample Query 6.23 us 2.41 ms Stealthium 385.93x 2.59 MB 22.15 MB +19.56 MB