SECURITYNVIDIA

GPUBreach: A Root Shell Through GPU Abuse — and How Stealthium detects it.

APR 2026By André Brandão, Branislav Brzak and Bartosz Szczepanek. Stealthium Core Team.
GPUBreach: A Root Shell Through GPU Abuse — and How Stealthium detects it.

Summary

The University of Toronto research team, led by Chris S. Lin and Prof. Gururaj Saileshwar recently disclosed GPUBreach (https://gpubreach.ca) a new class of attack targeting NVIDIA GPU drivers. The work highlights how fault injection techniques such as Rowhammer can be combined with GPU memory management behavior to achieve privilege escalation, even in environments with protections like the IOMMU enabled.

While the full technical paper is not yet public, the initial disclosure provides enough insight to understand the broader attack surface and its implications for modern GPU-accelerated systems.

In this post, we examine what this class of vulnerability means from a defensive perspective, focusing on how it manifests within the NVIDIA driver stack and, critically, what signals it leaves behind at runtime.

Our analysis is based solely on publicly available information and NVIDIA's open-source kernel modules, and is intended to identify observable indicators of exploitation rather than reconstruct the full exploit chain.

We then show how these signals can be detected in practice using Stealthium's GPU runtime telemetry.

The Attack

The NVIDIA kernel driver allocates a shared memory region (pSharedMemDesc) in host RAM that the GPU's onboard ARM processor (GSP) uses as a bidirectional RPC message queue. The IOMMU explicitly permits GPU DMA access to this region so that the GSP firmware is able to write its responses there.

Lin et al. use Rowhammer on GDDR6 to flip aperture bits in a GPU PTE, redirecting a framebuffer write into this IOMMU-permitted system memory. The GPU doesn't know anything changed. The IOMMU sees a legitimate DMA to a permitted address. But the write lands in the driver's status queue, where a single unvalidated field (elemCount) drives a heap buffer overflow into adjacent kernel function pointers.

The IOMMU is not bypassed. It is used exactly as intended. That is what makes this attack work.

A Note on Exploitability

Under normal circumstances, this path is relatively harmless. Triggering it requires either:

  • Compromise of the GSP firmware itself - which is signed by NVIDIA and verified at load time, making tampering computationally infeasible without a separate firmware signing vulnerability, or;
  • Corruption of the communication channel between GSP and the driver — specifically the status queue pages inside pSharedMemDesc.

In practice, the GSP does not write elemCount > 16 — the driver's own maximum RPC size caps legitimate messages at 16 elements. But the ring buffer supports up to 63, and there is no enforcement on the receiver's side. The driver simply trusts what GSP writes. That implicit trust is a potential vulnerability: not a flaw that manifests under normal conditions, but one that could become exploitable the moment an attacker can write into the status queue.

Lin et al. appear to achieve just that with Rowhammer. By flipping aperture bits in a GPU PTE, they redirect a GPU write, without the GPU knowing, into the status queue pages inside pSharedMemDesc. The firmware is not compromised. The signing checks pass. The shared memory is written to from an unexpected source, and the driver has no way to distinguish it from a legitimate GSP response. The "trusted data" becomes attacker-controlled, and the missing bounds check on elemCount becomes the entry point for a kernel heap overflow.

Our Theory of Root Cause: The Trusted Driver State

Repository: https://github.com/NVIDIA/open-gpu-kernel-modules Commit/Tag: db0c4e65c8e34c678d745ddb1317f53f90d1072b / 595.58.03

The shared memory block is allocated at boot as system memory and DMA-mapped for GPU access:

File: src/nvidia/src/kernel/gpu/gsp/message_queue_cpu.c
// allocated in host RAM
236     NV_ASSERT_OK_OR_GOTO(nvStatus,
237         memdescCreate(&pMQCollection->pSharedMemDesc, pGpu, sharedBufSize,
238             RM_PAGE_SIZE, NV_MEMORY_NONCONTIGUOUS, ADDR_SYSMEM, NV_MEMORY_CACHED,
239             flags),
240         error_ret);
...
// IOMMU maps it for GPU DMA access
244     memdescSetPageSize(pMQCollection->pSharedMemDesc, AT_GPU, RM_PAGE_SIZE_HUGE);

Its layout is three contiguous regions:

pSharedMemDesc (single SYSMEM allocation)
├── [0 .. pageTableSize)          page table  — IOVAs for GSP to locate the buffer
├── [pageTableSize ..)            command queue — CPU -> GSP
└── [pageTableSize + cmdSize ..)  status queue  — GSP -> CPU  - attacker's target

The status queue is where GSP writes its RPC responses. After Rowhammer corrupts a GPU PTE, the GPU's next write to what it believes is framebuffer memory lands here instead.

The Overflow: elemCount with No Bounds Check

When the CPU driver processes a status queue response in GspMsgQueueReceiveStatus, it:

  1. Reads the first status queue page into a staging buffer (pCmdQueueElement, a separate CPU-only heap allocation of exactly 16 × 4096 = 65536 bytes)
  2. Reads elemCount from that staging area, a value the GSP wrote, now potentially attacker-controlled
  3. Loops, copying one 4096-byte page per iteration, with no check that elemCount is within bounds
File: src/nvidia/src/kernel/gpu/gsp/message_queue_cpu.c
// elemCount read from GPU-written memory, no validation
669                 nElements = pMQI->pCmdQueueElement->elemCount;

The staging buffer holds 16 elements. What makes the overflow dangerous is the work area layout. Both pCmdQueueElement and pMetaData are carved out of the same single portMemAllocNonPaged allocation (pWorkArea), with pMetaData placed immediately after the staging buffer — no padding, no gap:

File: src/nvidia/src/kernel/gpu/gsp/message_queue_cpu.c
148     pMQI->pCmdQueueElement = (GSP_MSG_QUEUE_ELEMENT *)
149         NV_ALIGN_UP((NvUPtr)pMQI->pWorkArea, 1 << pMQI->queueElementAlign);
150     pMQI->pMetaData = (void *)((NvUPtr)pMQI->pCmdQueueElement + pMQI->queueElementSizeMax);
//                                                                   ^^^^^^^^^^^^^^^^
//                                                           exactly 16 × 4096 = 65536 bytes

So pMetaData sits at pCmdQueueElement + 65536. It is the msgqMetadata struct, which contains the queue's function pointer table:

File: src/nvidia/inc/libraries/msgq/msgq_priv.h
 67 // Internal tracking structure (handle)
 68 typedef struct
 69 {
...
100     // notifications
101     msgqFcnNotifyRemote   fcnNotify;        // function pointer
102     void                 *fcnNotifyArg;
103     msgqFcnBackendRw      fcnBackendRw;     // function pointer
104     void                 *fcnBackendRwArg;
105     msgqFcnCacheOp        fcnInvalidate;    // function pointer
106     msgqFcnCacheOp        fcnFlush;         // function pointer
107     msgqFcnCacheOp        fcnZero;          // function pointer
108     msgqFcnBarrier        fcnBarrier;       // function pointer
109 } msgqMetadata;

Iterations 0–15 fill the staging buffer exactly. The 17th iteration (i=16) writes directly onto these function pointers, with content copied from attacker-controlled status queue pages.

As the full paper is not yet public, we don't have visibility into the exact exploit internals. From the code alone, fcnNotify i appears to be the most immediately useful target: it is called at the end of every msgqTxSubmitBuffers, which fires on every message the CPU sends to GSP, any GPU operation suffices. The pre-release PoC documentation specifically names nvidia-smi as the trigger, though that is the PoC's chosen method, not necessarily what the full exploit uses. Either way, the moment the corrupted fcnNotify pointer is called, control could transfer to attacker-supplied code.

The loop isn't truly unbounded, there is an implicit ceiling, just not one the driver enforces. msgqRxGetReadBuffer returns NULL when the number of elements requested exceeds msgCount, the ring buffer capacity. The concrete values came from our kernel instrumentation (covered in the next section):

msgqRxLink success: size=262144 msgSize=4096 entryOff=4096 msgCount=63

msgCount = (statusQueueSize - entryOff) / msgSize = (262144 - 4096) / 4096 = 63. The effective maximum is msgCount - 1 = 62. A standard ring buffer invariant where one slot is kept empty to distinguish full from empty. So the loop runs at most 62 iterations before msgqRxGetReadBuffer naturally returns NULL. With a staging buffer of 16, that leaves (62 - 16) × 4096 = 188KB of controlled heap corruption past pMetaData and into the surrounding heap.

We Verified The Behavior With Instrumentation

We instrumented the receive path with kernel-level logging and measured elemCount values and loop counts on a live system:

INSTRUMENT msgqRxLink success: size=262144 msgSize=4096
  entryOff=4096 msgCount=63 (overflow loop bound = msgCount-1 = 62)

INSTRUMENT GspStatusQueueInit: msgqRxLink succeeded after 91766 retries.
  statusQueueSize=262144 queueElementSizeMin=4096
  implicit msgCount (approx)=64

INSTRUMENT call #219526 retry 0: elemCount=1 maxSafe=16

Under normal operation, elemCount=1. The staging buffer capacity is 16. The ring buffer supports 62. Any value above 16 overflows pCmdQueueElement into pMetaData. The gap between the safe limit and the implicit bound is 46 elements, 188KB of controlled write.

Is This Fixed?

No. The elemCount field is read from GPU-written memory with no bounds check in the current open-source release, 595.58.03 (commit db0c4e65).

Detection: How Stealthium Catches GPUBreach Today

This class of attack has a clear, observable signature at the kernel driver level. GspMsgQueueReceiveStatus is called on every synchronous RPC that requires a GSP response (memory allocation, context creation, and similar control-plane operations) as well as from the GPU interrupt bottom-half handler, which fires independently of what userspace is doing. Our instrumentation recorded over 219,000 calls during a normal session. Under normal operation, elemCount is typically 1. Values above 16 are never legitimate.

The Stealthium sensor continuously monitors the elemCount field read from the staging buffer on every invocation. Any value exceeding pMQI->queueElementSizeMax / pMQI->queueElementSizeMin (16) triggers an alert. Our security instrumentation validates elemCount in real time, proactively preventing out-of-bounds writes before they can occur. By deploying our security solutions, you gain both protection and immediate notification if a malicious program attempts to execute GPUBreach.

Because Stealthium's sensor continuously monitors GPU RPC telemetry, this is not only a forward-looking detection. If you are an existing customer, you can query your historical data right now for anomalous elemCount values, before this vulnerability was public, before you knew to look for it. If this class of exploit has been used against your systems, the signal is already there.

Conclusion

GPU kernel drivers occupy a privileged and under-scrutinized position in the software stack. The NVIDIA GSP architecture introduces an ARM co-processor running firmware that the host kernel trusts unconditionally, including the message metadata it writes into shared memory. When that trust is violated—whether through Rowhammer or any future technique that corrupts GSP-written memory—the absence of input validation in the receive path turns a single field into a kernel code execution primitive.

The vulnerability is not in the signed firmware. It is not in the IOMMU configuration. It is in the assumption that shared memory is never written by anyone other than the firmware that owns it, and the lack of even a single bounds check that would have rendered that assumption irrelevant. The vulnerabilities discovered by security researchers in NVIDIA's GPU kernel drivers demonstrate the expanding attack surface of GPU-accelerated computing infrastructure. As AI workloads become increasingly critical to business operations, the security implications of GPU driver vulnerabilities grow correspondingly severe.

Traditional security approaches focused solely on patching are insufficient given the lag between vulnerability discovery and patch deployment, the complexity of GPU driver ecosystems across multiple branches, the sophistication of modern exploitation techniques, and the multi-tenant nature of cloud GPU environments.

Acknowledgement

"It's encouraging to see Stealthium taking attacks like GPUBreach seriously and developing telemetry-driven mechanisms to detect and mitigate them. GPUBreach is a significant vulnerability, providing adversaries a powerful entry point for system compromise and privilege escalation that protections like IOMMU cannot prevent. I hope cloud service providers adopt best practices, such as enabling ECC, and deploy robust detection-based mitigations to safeguard their systems against such attacks."

  • Prof. Gururaj Saileshwar, University of Toronto