MiniMax M3 makes a million-token context cheap — and ships the weights

// ME TECH angle

On 1 June MiniMax released M3, an open-weight model that pairs a claimed frontier-level coding score with a million-token context window and native multimodality. Treat the benchmark claim with the usual scepticism until weights and an independent harness confirm it. The part that survives the marketing is the attention rewrite underneath: a sparse design that cuts the per-token cost of very long context by roughly an order of magnitude, in a model you can host yourself.

sourceThe Decoder — MiniMax M3 challenges proprietary leaders sourceVentureBeat — MiniMax teases M3's sparse attention and 15.6x speed boost sourceMarkTechPost — MiniMax M3 with MSA architecture and 1M-token context sourceHugging Face — Decoding M3's sparse attention from a single diagram

01Thesis

The interesting question with a model like M3 is not whether it edges out a proprietary leader on one benchmark this week. It is whether the architecture changes the economics of the thing you actually want to build — and here a long-context, multimodal agent stops being a budget line you flinch at, because the model both runs cheaper at length and ships under open weights you can deploy on your own infrastructure.

02What shipped

MiniMax, the Shanghai lab, released M3 on 1 June. It is positioned as the first open-weight model to combine frontier-level coding, a one-million-token context window, and native multimodality in a single architecture, trained on image and text interleaved from step zero rather than bolted together afterwards. MiniMax reports 59.0% on SWE-Bench Pro — which it claims edges past GPT-5.5 and Gemini 3.1 Pro — and 70.06% on OSWorld-Verified for computer use. Weights and a technical report were slated to follow within about ten days of launch.

The number to hold loosely is the benchmark, and the number to hold onto is the speed claim. M3's headline coding score is vendor-reported, and several outlets flagged it as unverified pending the public weights and an independent harness. The architecture, by contrast, is concrete: MiniMax says its new attention design runs roughly 15.6x faster decoding and 9.7x faster prefill at a one-million-token context versus standard attention, cutting per-token compute at that length to around a twentieth of the previous generation.

03The architecture is the story, not the leaderboard

What makes this more than another release is how the long context is paid for. M3 uses what MiniMax calls Sparse Attention (MSA): a standard grouped-query backbone that does block-level top-k selection over real, uncompressed key-values, rather than the linear or lightning-attention route the lab took in earlier models. Crucially it does not throw away the KV cache to win speed — it picks the blocks that matter and skips the rest, keeping full-precision attention where it counts. That is the difference between a million-token window that is a demo and one you can afford to use on every request.

This matters because long context is where most real agent and RAG work quietly breaks the budget. The honest failure mode is not that the model cannot read 200 pages — it is that doing so on every turn of a multi-step agent is too slow and too expensive to ship. An attention design that makes length cheap attacks exactly that constraint, and it does so in weights you can run behind your own perimeter rather than metering through someone else's API.

04What it signals

Two trends are converging. Open-weight models keep arriving within striking distance of the proprietary frontier, and the architectural work is increasingly aimed at the operating cost of context rather than raw capability on a short prompt. Put together, they make the self-hosted, long-context agent a more defensible default — not because open weights are automatically better, but because the gap is now small enough that owning the deployment, the data path, and the cost curve can outweigh a few points on a leaderboard.

For teams, the move is to test against your own workload, not the press release. Run M3 once the weights are public on a real long-context task — your codebase, your document set, your tool loop — and measure latency and cost at the context length you would actually use, not at 4k tokens. If the sparse-attention speedup holds outside the vendor's harness, it reshapes what a private long-context agent costs to operate, and that is worth knowing before the next model cycle resets the conversation.

05What to do next

Open weights and cheap long context only pay off if they survive contact with your real data and latency budget. If your team is trying to move an AI use case from demo to deployment, METECH helps scope, build, and validate the first working system in 2-3 weeks.