Copied


MiniMax-M3 Launches 1M-Token Model With Sparse Attention

Rebeca Moen   Jun 02, 2026 19:58 0 Min Read


MiniMax has unveiled its latest AI model, the M3, which boasts a groundbreaking 1 million-token context window and native multimodality. Together AI, the preferred cloud partner for MiniMax, has optimized the model for large-scale deployment, introducing performance improvements that cut costs and reduce latency. The model is set to be released with open weights in the coming days, offering developers direct access through Together AI’s endpoints.

The M3 model pushes the boundaries of large language models (LLMs) with features such as MiniMax Sparse Attention (MSA), a novel architecture designed to address the bottlenecks of long-context processing. This innovation enables a 9x speedup during prefill stages and a 15x improvement during decoding, making it a standout among its peers. MiniMax-M3 is engineered for high-efficiency real-world applications, from processing extensive documents and codebases to handling multimodal tasks that include images and videos.

What Makes MiniMax-M3 Different?

The M3 model integrates MSA, which caps the number of tokens each query attends to, reducing the computational complexity of long-context inference from quadratic (N²) to linear. This represents a significant leap over its predecessor, MiniMax M2.7. Additionally, the model includes enhanced multimodal capabilities, enabling seamless integration of vision and text data.

However, supporting a 1M-token context comes with unique challenges. Together AI tackled these through various optimizations, such as re-architecting the kernel to reduce memory overhead and integrating sparse attention with paged attention for more efficient KV cache management. The result? A scalable solution that delivers over 80% throughput improvements under typical workloads.

Performance in Context

The M3 launch aligns with a broader trend in AI inference optimization. Innovations like NVIDIA’s Guess-Verify-Refine algorithm (May 2026) and Sakana AI’s TwELL sparse kernels (May 2026) have demonstrated how sparse attention can dramatically speed up large-model inference. KV-block-major architectures, like the one used in M3, are increasingly seen as essential for scaling long-context models efficiently.

For instance, Google’s TurboQuant (March 2026) compresses KV caches to 3 bits, achieving up to 8x computational speedups. Similarly, NVIDIA’s Blackwell GPUs are now optimized for sparse attention workloads, offering nearly 2x faster performance in decoding. These advancements are helping models like MiniMax-M3 meet the growing demand for efficient, high-context AI applications without compromising accuracy.

Why It Matters

Long-context models are becoming critical as enterprises demand AI solutions capable of handling complex, context-heavy tasks. From summarizing legal documents to processing multimodal data for autonomous systems, the ability to efficiently manage extended context windows is a competitive differentiator.

MiniMax-M3’s combination of sparse attention, multimodality, and efficient inference positions it as a frontrunner in this space. For developers, its open-weight availability through Together AI could lower barriers to entry, while enterprise users stand to benefit from reduced compute costs and improved performance.

Looking Ahead

The open weights for MiniMax-M3 are expected to be available in the next few days, marking an important milestone for the AI community. Together AI’s infrastructure optimizations will play a key role in enabling widespread adoption of the model, particularly in applications requiring high concurrency and long-context workloads.

As the AI landscape evolves, the focus is shifting from merely scaling model size to optimizing inference efficiency. MiniMax-M3 exemplifies this paradigm shift, setting a new benchmark for what’s possible in long-context and multimodal AI.


Read More