The Trillion-Dollar Blind Spot: Inside the Black Box of AI Factories

One of us was on conference floors in San Jose and San Francisco reading the macro. The other was in customer environments running POCs and seeing what actually breaks. What we both found pointed to the same gap.

I. Two Conferences, One Uncomfortable Truth

The US government has committed $500 billion to AI infrastructure. The hyperscalers are spending another $300 billion this year alone on GPU buildouts. Enterprises are dropping eight figures on H100 clusters without a single tool that tells them what is actually executing inside those accelerators at runtime. That is not a security gap. That is a business risk sitting at the center of the most expensive infrastructure build in the history of enterprise technology.

The week of March 17th, I was on the floor at GTC in San Jose. The week after, I was at RSA in San Francisco. Ahmed was not at either. He was in customer environments running live proof-of-concepts on real GPU infrastructure. We were comparing notes every night. The contrast was difficult to ignore.

At GTC, Jensen Huang unveiled Vera Rubin, a next-generation inference architecture delivering up to 10x higher throughput per watt, built explicitly for agentic AI at industrial scale. AI factories were reframed as token-production plants. NemoClaw and OpenClaw introduced what amounts to an Android for agents operating layer. Physical AI, humanoid robotics and Omniverse powered digital twins filled the demo floor. AI sovereignty was positioned not as compliance theater but as strategic infrastructure for nations and the hyperscalers building for them. The message was clear: the infrastructure era of AI is here, it is industrial, and it is moving faster than anyone's roadmap anticipated.

A week later at RSA, the conversation was different in almost every way. The stages were packed with agentic AI for security operations, autonomous SOCs, shadow AI governance, identity as the new perimeter and platform consolidation. The booths promised compliance automation, responsible AI programs and governance frameworks built on policies, committees and checklists. Every conversation assumed AI was just another application layer sitting on top of familiar infrastructure that existing tools already understood.

Two of the most important technology conferences in the world. Running back to back. Talking past each other completely.

And then there was what Ahmed was actually finding on the ground.

The infrastructure teams he was working with, real operators running real GPU clusters, had almost no visibility into what their accelerators were doing at runtime. No one could tell which workloads were sharing which GPUs, which CUDA kernels were executing, or whether anything anomalous had occurred in the past week. The conference world was debating how to govern AI applications and racing to scale AI factories. The actual environments had no visibility at the layer that mattered most.

That is the gap this article is about.

II. What the Conferences Said vs. What the POCs Showed

From the conference floor:

The dominant RSA framing was that security teams need to get ahead of AI risk at the application and governance layer. The assumption baked into almost every session was that the infrastructure running AI workloads is handled. Teams were being sold frameworks for auditing AI models, detecting prompt injection at the API layer, and governing who can use which AI tools inside the enterprise. Reasonable concerns, but they presuppose a monitoring foundation that largely does not exist one layer down.

GTC told a different story about the infrastructure itself. The scale Jensen described was not abstract. Vera Rubin architecture, NVLink-connected GPU fabrics spanning entire data centers, Blackwell clusters running inference at previously unimaginable throughput. The ambition was real and the investment behind it is enormous. What was absent from both conferences was any serious discussion of what it actually takes to see and secure that infrastructure while it runs.

From the customer environments:

The pattern across engagements is consistent. Security and infrastructure teams are confident about their CPU-side coverage. They have EDR deployed, eBPF-based kernel sensors running, SIEM correlated and tuned. When we start asking about GPU-layer visibility, the conversation shifts. Most teams cannot tell us which processes are running inside their GPU workloads, what is happening in GPU memory between jobs, whether their multi-tenant environments are actually isolated at the hardware level, or when a workload's behavior drifted from its baseline.

In one engagement, a customer with a mature security posture and a well-staffed SOC had 45 GPUs running deprecated CUDA versions with known vulnerabilities. They had no alert for it. In another, we found a GPU that had been running continuously for over five days with 18 critical-severity anomalies in the prior 24 hours. The SOC had no visibility. The GPU was shared across tenants. No one knew.

The research validates what we see in the field. In July 2025, the University of Toronto demonstrated GPUHammer, a RowHammer-style attack targeting NVIDIA A6000 GPUs with GDDR6 memory. Bit flips induced in floating-point weights caused model accuracy to collapse from 80 percent to 0.1 percent. NVIDIA's mitigation reduces performance by up to 10 percent and decreases memory capacity by 6.25 percent. Operators are forced to choose between security and economics.

III. Why the Endpoint Playbook Is Failing

From the field:

My background before Stealthium was incident response on the CPU side: EDR, CNAPP, kernel-level telemetry, eBPF, SIEM, threat intelligence. If you gave our teams process trees, syscall traces, network flows, container runtime logs, and solid threat intel, we could forensically reconstruct an intrusion with surgical precision. Who did what, when, from where, and how bad it really was.

That entire mental model rests on one assumption: the CPU and kernel are where interesting behavior lives, and where you can observe it. In the AI world being built today, that assumption is breaking fast.

From the POCs:

When we deploy Stealthium into a new environment, the first thing we do is map what the existing security stack can actually see. EDR agents on the hosts: yes. Kernel-level syscall tracing: yes. Network flows between containers: yes. VRAM allocation patterns across a shared A100: no. CUDA kernel execution sequences: no. Cross-tenant GPU access events: no. Driver-layer behavior when a job terminates abnormally: no.

The gap is not theoretical. In one POC, we traced a resource-hijacking pattern that had been running for 11 days before we instrumented the environment. The CPU-side tools saw nothing because nothing unusual happened at the CPU layer. The process looked legitimate from the kernel's perspective. The abuse was entirely within the GPU execution context.

The threat surface that has shifted to the GPU layer:

Adversarial inputs at inference time: prompt injection against agents, adversarial images and audio designed to mislead models, crafted query patterns for model extraction, and GPU cycle theft for unauthorized workloads including cryptocurrency mining.
Multi-tenant abuse on shared GPU clusters: lateral movement through high-speed interconnects like NVLink, data leakage across tenants sharing the same accelerator, and inter-GPU behavior patterns that CPU-oriented tools do not see.
Compliance and governance pressure to demonstrate operational control over AI systems in production, not just publish model cards and risk frameworks.

Three principles from the CPU era, applied to a layer where almost no one is enforcing them:

Logs lie or go missing. Runtime behavior tells the truth. Real-time telemetry from the GPU execution layer is the only reliable ground truth.

Defense in depth is meaningless if runtime telemetry stops at the kernel. Layered controls built on incomplete visibility create a false sense of security while attackers operate in blind spots.

Governance only becomes real when you can show operational evidence. Auditors, regulators, and boards increasingly demand proof of continuous monitoring and control, not policy documents.

IV. The Research That Should Concern You

The security research community has been sounding alarms on GPU security last year. What is new is the direct applicability to production AI infrastructure.

Hardware-Level Exploits

GPUHammer (2025) is the first successful RowHammer attack against GPUs, targeting GDDR6 memory in NVIDIA A6000 cards. Bit flips induced in the exponent portion of a floating-point weight caused model accuracy to collapse from 80 percent to 0.1 percent. NVIDIA's mitigation forces a performance and capacity penalty. Operators choose between security and economics.

CUDA Toolkit Vulnerabilities

In January 2026, NVIDIA disclosed four high-severity vulnerabilities in the CUDA Toolkit (CVE-2025-33228 through CVE-2025-33231). Attackers can exploit command injection flaws in installation paths, inject OS commands through malicious input strings, or abuse uncontrolled DLL search paths to execute arbitrary code with escalated privileges. All CUDA Toolkit versions prior to 13.1 are vulnerable. Ahmed has found versions well below 13.1 running in production across multiple POC environments.

Multi-Tenant Isolation Failures

Research confirms that soft isolation strategies — Kubernetes namespaces and vClusters — are fundamentally inadequate for GPU workloads. Only hard isolation provides real protection: dedicated Kubernetes clusters, MIG-based GPU partitioning, VPCs, VxLAN, VRFs, KVM virtualization, InfiniBand P-KEYs, and NVLink partitioning. In practice, MIG adoption is low. Most shared GPU environments Ahmed encounters still rely on soft isolation and are unaware of the exposure.

Driver and Kernel Module Exploits

An October 2025 analysis of NVIDIA's Linux Open GPU Kernel Modules revealed exploitable use-after-free bugs allowing local unprivileged processes to achieve kernel read and write primitives. A February 2026 GPU driver vulnerability (CVE-2025-47397) stems from unchecked IOMMU mapping errors during scatter-gather DMA operations, which can enable privilege escalation, unauthorized data access, or system instability.

Side-Channel Attacks

GPU dynamic voltage and frequency scaling creates detectable electromagnetic signatures that can fingerprint websites, infer keystroke timing, and identify which neural networks are executing, even through walls and at distance.
Intensive GPU processes induce detectable power fluctuations in USB and HDMI ports, leaking information about matrix multiplications and neural network execution.
By reverse-engineering NVIDIA GPU scheduling parameters, attackers can carry out timing-based side channels across both graphics and compute workloads.

V. What the Dashboards Reveal: The POC Reality

The following dashboards are drawn from active Stealthium deployments. This is not a demo environment. These are signals we are seeing in real GPU fleets during current engagements.

GPU-Native Incident Response: Reconstructing the Kill Chain

When an attacker exploits shared GPU buffers or manipulates clock domains to bypass integrity checks, nothing lights up in kernel sensors, SIEM, or EDR. Alert sequencing views reconstruct attack narratives directly from GPU telemetry: which CUDA kernel triggered the exploit, which memory buffer was abused, which tenant's workload initiated the malicious behavior, and how the attack chain progressed through GPU memory regions and driver layers.

The correlated alerts view spans the severity levels Critical (interrupt handler manipulation in AI/ML training), High (GPU clock domain manipulation, memory buffer overflow, anomalous power consumption), and Medium (suspicious texture access, unauthorized context switching). Each alert carries precise timestamps, affected resources, GPU allocation details, PID, execution path, and behavioral context.

Asset Inventory and Posture Management: CTEM for AI Factories

In Ahmed's engagements, the answer to "what do I own and is it monitored?" is consistently incomplete. A typical fleet snapshot from a recent deployment:

GPUs running unsupported CUDA versions with known vulnerabilities
GPUs on driver versions predating critical security patches
GPUs allocated but idle for more than seven days
Multiple misconfigured nodes with mixed frameworks and missing nvidia-smi

None of these were visible to the existing security stack. Asset coverage in one current engagement: 394 GPUs at 98 percent coverage, 30 nodes at 65 percent coverage, and an active visibility and security gap the customer did not know existed.

Workload Analytics and Inference Economics: Security Meets Tokens Per Watt

What Ahmed finds in the field is that economic waste and security exposure are often the same signal. Workloads that never terminate are either stalled legitimate jobs or stealth workloads. Sudden churn rate drops signal resource hijacking. Unusual spikes in persistence ratio indicate hidden crypto-miners or data-hoarding tasks consuming GPU capacity without business justification.

GPU Usage and Multi-Tenancy Risk

The per-GPU telemetry for a single A100 in a current engagement: 5 days, 18 hours of continuous runtime, 18 critical alerts in the prior 24 hours, confirmed cross-tenant usage, 93 percent utilization baseline. None of this was surfaced by the existing security or infrastructure tooling before we deployed. The CPU-side stack saw a busy server running containers. We saw an A100 operating well outside safe parameters with active exposure across tenant boundaries.

Risk Mitigation in Numbers

Abuse of shared GPU buffers: several incidents caught in a recent window, costing significant GPU hours, and potential breach costs avoided.

VI. Closing the Gap

What the conferences missed:

RSA and GTC were not parallel events that happened to share a week on the calendar. They were two halves of the same structural problem. RSA builds governance frameworks for AI applications that run on infrastructure no one is monitoring. GTC scales that infrastructure to a point where the blind spot becomes an existential liability.

The organizations that win the next phase of AI will not just build the most powerful models or the largest clusters. They will be the ones who can look at a wall of GPUs, across data centers, regions, and sovereign boundaries, and answer a deceptively simple question: what are they doing right now, and should they be doing it?

What the POCs tell us:

The teams we work with are not negligent. They are applying the right mental model to the wrong layer. The tooling they have is good tooling for the CPU world. The problem is that the CPU world is no longer where the risk lives.

Four shifts that close the gap:

Secure the accelerator runtime itself, not just the hosts around it. GPU memory regions, inter-GPU links via NVLink and NVSwitch, driver-level behavior, CUDA kernel execution patterns, tensor core utilization, framework-level telemetry, and workload lifecycle from allocation through execution to teardown.
Map tenants, models, and workloads to specific GPUs. Which tenant's workload is running on which GPU at any given moment, which models and inference pipelines are running, and when behavior drifts from baseline.
Build narrative reconstruction for GPU-based incidents. Alert sequences that trace attacks through GPU memory, kernels, and workflows. Forensic timelines showing who did what, when, at the GPU layer.
Move toward hard isolation for multi-tenant environments. MIG-based partitioning for hardware-level separation, dedicated clusters, VPCs, VxLAN, VRFs, KVM virtualization, InfiniBand P-KEYs, and NVLink partitioning. Soft isolation is inadequate, and most environments still rely on it.

VII. What This Means for You

Founders and CTOs Building AI Infrastructure

Competitive advantage depends on inference economics, tokens per watt, cost per query, and latency. Undetected abuse and misconfiguration are quietly destroying those margins.
Investors and customers increasingly demand proof of AI governance and security, not pitch decks about responsible AI.
The GPU fleet being scaled today will either become a strategic asset or a liability, depending on whether it can be seen and secured.

Heads of Infrastructure

The observability stack inherited from the CPU era was built for servers and containers. SRE and platform teams are operating blind at the layer where most operational risk now lives.
Capacity planning, cost optimization, and uptime SLAs all depend on GPU-level visibility that most organizations currently lack.

Security Leaders (CISOs and VP Security)

EDR, CNAPP, and SIEM investments stop at the kernel boundary, exactly where AI workloads start.
Compliance frameworks, including SOC 2, ISO 27001, GDPR, and emerging AI-specific regulations, increasingly require demonstrable control over AI systems in production.
The next breach that matters will not come from a phished credential. It will come from GPU-level exploitation that current tools cannot detect.

VIII. The Closing Gap

The speed of AI workload growth and GPU scale-out has outpaced our ability to observe and secure the accelerated runtime. RSA was debating application-level governance. GTC was announcing the next generation of infrastructure. In between, in the actual environments Ahmed and his team are working in right now, the compute fabric running those agents and producing those tokens is operating without the monitoring it needs.

Bridging that gap is not a future problem. It is a current one. The question is whether organizations treat GPU observability and security as a late-stage compliance chore, or as a design constraint built into how they scale AI from the beginning.

In a world where AI factories are measured in tokens per watt, every unobserved GPU is a liability you are paying for but cannot control.