DeepSeek-V4 Tackles Million-Token Context on NVIDIA HGX B200

DeepSeek-V4, launched by Together AI, is reshaping how AI handles ultra-long context windows by introducing a 1-million-token capacity. Rather than simply a model architecture breakthrough, V4 transforms this into a systems-level challenge, focusing on efficient inference and memory management. This innovation runs on NVIDIA HGX B200 hardware, leveraging advanced techniques like compressed Key-Value (KV) layouts, prefix caching, and hybrid attention mechanisms to address the bottlenecks of long-sequence processing.

Architectural Shifts: Compressing the Token Axis

At the core of DeepSeek-V4's advancements is a hybrid attention mechanism that compresses the token axis before KV storage. Key techniques include Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and Sliding Window Attention (SWA). This approach reduces the size of the KV cache—a critical factor for managing long-context workloads.

For context, a traditional 70-billion-parameter model in BF16 precision can require substantial KV cache per token, becoming unmanageable at million-token lengths. V4’s compression techniques shrink this footprint significantly, making 1M-token contexts feasible without overwhelming memory or bandwidth. Specifically, the compressed cache allows NVIDIA HGX B200 hardware to manage up to 3.7 million tokens in testing—well beyond prior limits.

Serving Challenges: Multiple Cache Layouts

DeepSeek-V4’s design necessitates managing three distinct cache types—CSA, HCA, and SWA—within the inference engine. Each cache type has unique characteristics, such as size, read patterns, and lifetimes, requiring sophisticated memory management. For example, CSA provides fine-grained sparse access to compressed regions, while HCA enables a coarse global read over the entire context. SWA, on the other hand, preserves exact recent context but demands higher storage costs for long sequences.

The serving engine must juggle these cache objects, balancing eviction policies and batching strategies to maintain decode throughput. Together AI’s early implementation opts for storing the full SWA cache to simplify prefix reuse, though this increases memory pressure. Future iterations may explore recompute-on-hit strategies to further optimize efficiency.

Workload-Specific Gains

DeepSeek-V4's benefits manifest most strongly in long-context, decode-heavy workloads, such as coding agents and research models that accumulate state over extended tasks. These use cases rely on reduced KV cache sizes to improve throughput and concurrency. However, short-context applications like chatbots see fewer immediate gains, as they expose latency and kernel maturity issues rather than benefiting from cache compression.

For workloads like reinforcement learning (RL) rollouts, where cost per trajectory is the key metric, V4’s architecture could redefine economic efficiency. Developers are advised to benchmark specific workloads before transitioning to V4, as workload shape heavily influences performance outcomes.

NVIDIA HGX B200: The Hardware Backbone

NVIDIA HGX B200 serves as the launch platform for DeepSeek-V4, providing native support for the model's compressed KV layouts and MXFP4 precision format. This hardware is optimized for the memory-intensive demands of long-context decode tasks, allowing multiple concurrent requests to operate within an efficient serving regime. The partnership between Together AI and NVIDIA also highlights co-design efforts to maximize hardware-software synergy, improving cost-per-token efficiency.

Next Steps: Measurement and Optimization

While DeepSeek-V4 lays the groundwork for million-token contexts, its full potential depends on further optimization. Together AI is focusing on refining cache policies, kernel maturity, and endpoint configurations for different traffic profiles. Developers should evaluate their workloads across metrics like cache hit rate, decode throughput, and cost per task before migrating to V4.

This marks a significant step forward in AI serving systems, turning the promise of ultra-long context windows into a practical reality—provided the inference stack is up to the task.