Deploying large language models in production has become primarily an engineering challenge. The models work – the hard part lies in serving them at scale without compromising the security of proprietary model weights or sensitive user data.
In practice, large-scale LLM deployments must balance three competing requirements: performance, scalability, and confidentiality.
Consider the following scenario. You have fine-tuned MedGemma 27B for a clinical decision-support system and validated its outputs. Now you need to move it to production.
MedGemma 27B in FP16 requires ~54 GB of VRAM for weights alone. On a single NVIDIA H100 GPU (80GB), this supports only a limited number of concurrent requests with a 32K token context window, leaving minimal headroom for KV cache growth or sustained batching – well below realistic production threshold.
To serve real traffic, you need more than just raw GPU power. You need
In other words, you need a high-performance inference engine running on cloud-scale GPU infrastructure.
However, once execution moves to infrastructure you do not control, scalability introduces a trust dependency.
Your model weights represent months of training on confidential medical data, and your core IP now resides on third-party hardware. Each inference request may contain sensitive inputs: patient records, diagnostic queries, test results. Encryption protects data at rest and in transit, but during inference both weights and inputs must be decrypted in GPU memory – precisely at the moment they are being executed. This creates a data-in-use exposure that TLS and disk encryption cannot address.
This is the Inference Trust Gap:
Traditional cloud deployments reliably provide the first two – but not all three simultaneously. The third is typically implemented within the provider’s security boundary rather than through independently verifiable, owner-controlled execution.
Closing the Inference Trust Gap requires solving all three dimensions simultaneously – performance, scalability and confidentiality – without one degrading the others. To achieve this, we need a high-performance engine that can scale across cloud infrastructure while remaining completely isolated from the infrastructure provider.
Several production-grade inference engines exist today: HuggingFace TGI, NVIDIA TensorRT-LLM, and vLLM among them. TGI offers a solid out-of-the-box experience but lacks the memory efficiency needed for large-context workloads. TensorRT-LLM delivers excellent raw speed through NVIDIA-specific kernel optimizations, but requires model compilation and tightly couples the deployment to a specific hardware stack.

vLLM hits the right balance for confidential inference. Its PagedAttention mechanism eliminates memory fragmentation, multiplying the number of concurrent requests a single GPU can handle. Continuous batching ensures zero idle cycles between requests. The architecture is hardware-flexible and containerizes cleanly – a critical property when the runtime environment must remain strictly isolated from the underlying infrastructure. In practice, this turns the typical two-concurrent-request bottleneck into production-level throughput.
By choosing vLLM, we satisfy the performance requirements of large-scale inference. Scalability at the infrastructure layer is provided by cloud GPU fleets. This leaves the remaining dimension: confidentiality.
Confidentiality requires more than access policies or network isolation; it requires attested hardware-based guarantees that can be independently verified. Trusted Execution Environments (TEEs) provide this foundation by isolating workloads at the processor level.
Modern CPU-level TEE technologies such as Intel TDX and AMD SEV-SNP allow workloads to run inside a hardware-isolated execution environment where memory is encrypted by the processor itself. The key distinction from traditional cloud security is that the confidentiality boundary sits below the hypervisor.
This means the cloud provider's administrators, the host operating system, and even a compromised hypervisor cannot read or tamper with the plaintext memory inside the execution environment during runtime. Model weights and user prompts remain encrypted in memory and are decrypted only within TEE during computation. For GPU-intensive workloads like LLM inference, this isolation must extend beyond the CPU to include GPU memory and the encrypted data path between CPU and GPU. This requires compatible CPU and GPU support operating in confidential mode. See our GPU+CPU TEE requirements guide for supported configurations.
However, hardware isolation alone does not define who is authorized to execute workloads or how execution policies are enforced.
A TEE isolates data at the hardware level and enables cryptographic verification of the execution environment. However, hardware isolation alone does not address execution governance and operational control. Critical questions remain:
These concerns extend beyond hardware isolation into orchestration and governance – domains that TEE hardware was never designed to control.
In cloud-based TEE deployments, memory is protected by hardware, but execution authority remains tied to infrastructure control. This is where an additional execution layer becomes necessary – one that governs execution through verifiable authorization and policy enforcement.
Super Protocol implements this execution layer on top of TEE-enabled infrastructure. The cloud provider continues to supply GPU capacity and operate hardware. What changes is how execution is authorized, verified, and enforced.
Before execution begins, the environment is cryptographically verified across hardware, runtime, and policy layers. Only attested TEE infrastructure running the intended workload image under approved execution policies is permitted to proceed. Decryption keys are released exclusively to such verified environments, ensuring that model weights and user prompts are exposed only inside a cryptographically validated runtime.
Execution authorization is automatically gated by cryptographic verification rather than discretionary administrative approval. Crucially, execution authorization is decoupled from infrastructure ownership: hardware providers supply compute resources, but they cannot influence when, how, or under which conditions a workload executes. Once defined, execution policies cannot be overridden by any party – including the cloud provider or Super Protocol itself.
Model owners therefore retain verifiable control over when and where their models run without sacrificing performance or scalability. At the same time, execution governance is separated from data visibility: model owners define execution conditions, while user inputs remain confined to the verified execution environment and are not accessible to model owners, cloud providers, their administrators, or Super Protocol.
Super adds a verifiable execution layer on top of TEE isolation, enabling confidential inference that remains policy-bound even on third-party infrastructure. The developer experience remains familiar – containers, APIs, cloud GPUs – but the trust model is fundamentally different.
At the core of the stack, vLLM provides the performance engine. Super Protocol ensures that this performance operates inside a verifiable confidential runtime.
PagedAttention: Solving the KV Cache Problem
Traditional inference systems allocate KV cache memory (the keys and values stored for attention computation) in contiguous blocks, reserving space for the maximum context length regardless of actual prompt size. This can waste 60–80% of memory due to internal fragmentation – a critical inefficiency when every gigabyte of VRAM matters.
vLLM implements PagedAttention, an algorithm inspired by virtual memory in operating systems. Instead of reserving one large contiguous block per request, memory is divided into small fixed-size "pages" that can be stored in non-contiguous physical locations and allocated on demand. This results in near-zero fragmentation, significantly improving effective memory utilization.
For MedGemma 27B (~54GB weights in FP16), this improvement resulted in an order-of-magnitude increase in concurrent requests compared to naive allocation strategies. However, improved allocation efficiency does not increase physical VRAM capacity.
A standard cloud H100 (80GB) leaves approximately 26GB of effective headroom after loading model weights, while H100 NVL (94GB) leaves roughly 40GB. Under sustained 32K-context confidential batching at production-level concurrency, both configurations remained memory-constrained. Additional VRAM headroom was therefore required for stable high-concurrency confidential inference.
In our MedGemma 27B confidential inference deployment (see Phase 4 of the case study), the required headroom was achieved on H200 GPUs (141GB) operating in TEE mode.
In a separate MedGemma 27B deployment using vLLM in TEE mode, performance was validated on a single NVIDIA Blackwell B200 (192GB) GPU. The B200 GPU was hosted inside an HGX 8× B200 system, without aggregating multiple GPUs for a single inference stream.
Using vLLM, MedGemma-27B reached 64.2 tokens/sec at 128K context, confirming production-grade throughput for large-context inference on Blackwell hardware in TEE mode.

Static batching – the traditional approach – groups incoming requests into a fixed batch and waits for the entire batch to finish generation before accepting new ones. Since different prompts produce outputs of different lengths, shorter responses leave GPU cycles idle while the batch completes.
vLLM uses continuous batching (iteration-level scheduling) instead. As soon as one request in the batch finishes generation, a new request from the queue immediately takes its slot. This keeps GPU utilization consistently high and makes throughput linear and predictable.
Multi-GPU and Distributed Inference
For models or context windows that exceed the capacity of a single GPU, vLLM supports tensor parallelism (splitting model weights across GPUs) and pipeline parallelism (distributing layers across devices). Combined with Ray integration for cluster management, this allows teams to scale inference horizontally when a single card is not enough – while maintaining a single API access point.
In Super Protocol deployments, every parallel inference node is launched inside a verified TEE-enabled environment. Super orchestrates attestation, workload authorization, and policy enforcement across all nodes, ensuring that horizontal scaling preserves execution integrity and does not expand the trusted computing base.
vLLM handles what happens inside the GPU, but inference does not end there. The results need to reach your application, and the connection itself must maintain the same confidentiality guarantees as TEE. This is where Super Protocol confidential tunnels come in.
Confidential tunnels extend TEE guarantees beyond the compute boundary, establishing an encrypted and attested communication path between the inference runtime and external applications. Each tunnel consists of two components: a tunnel client, which sits alongside vLLM inside the TEE and has no public IP, and a tunnel server, which provides the public HTTPS endpoint your application connects to. Both components run inside their own TEEs, so the data remains encrypted and verified at every hop – from model output to API response. Since the tunnel client does not have a public address, the actual inference server is completely hidden from external threats: network-level attacks, such as DDoS, are absorbed by the tunnel server rather than exposing the inference workload itself. Tunnel servers and clients can be deployed with different cloud providers and scaled independently of each other via DNS, with round-robin load balancing distributing traffic across tunnel clients, providing fault tolerance and decentralization without changing the application code.
Together, these components form a unified confidential inference runtime: vLLM provides performance mechanics, while Super Protocol enforces hardware-level isolation, execution governance, and secure networking across the entire deployment surface.
The Inference Trust Gap has long forced teams to balance performance and cloud scalability against cryptographic confidentiality.
vLLM has become the default choice for production LLM inference for a reason: it delivers production-grade performance at scale. But this does not solve the deployment challenge for teams working with sensitive data and proprietary weights.
Super Protocol turns cloud GPU fleet into verifiable confidential AI cloud. Execution integrity is enforced through TEE-based isolation and protocol-level verification.
Performance (vLLM), Scalability (cloud GPU fleets), and Confidentiality (TEE-based isolation with protocol-level verification) are unified within a single architecture rather than treated as trade-offs.
This architecture applies wherever sensitive models and proprietary data must operate at scale.
Whether you already deploy vLLM in production or are evaluating your inference stack, explore how to run it with verifiable confidential execution on Super Protocol.
Start with the deployment guide and review the resources below: