Jun 1, 2026

MiniMax M3: The First Open-Weight Model With Frontier Coding, 1M Context, and Native Multimodality

TLDR

MiniMax just released M3 — the first open-weight model to combine frontier-level coding, a 1-million-token context window, and native multimodality in a single package. It outperforms GPT-5.5 and Gemini 3.1 Pro on coding benchmarks, costs 5–10% of competing proprietary models, and ships with MiniMax Speech 2.6 Turbo for voice workloads. For audio pipelines that use Onepin to orchestrate TTS across 100+ voice models, M3 changes what agentic voice production can do.

What is MiniMax M3?
The MSA Architecture: How M3 Scales to 1M Tokens
Frontier Coding and Agentic Benchmarks
Native Multimodality from Step Zero
What Does M3 Mean for AI Voice Production?
MiniMax Speech 2.6 Turbo: The Voice Side of the Stack
Pricing and Access
How Onepin Fits into This Picture
Frequently Asked Questions

What is MiniMax M3?

MiniMax M3, released June 1, 2026, is the latest flagship model from Shanghai-based AI lab MiniMax. It is the first and only open-weight model to deliver three capabilities that were previously exclusive to closed-source frontier systems:

Frontier coding and agentic performance — 59.0% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro
1-million-token context window — powered by a new sparse attention architecture called MSA
Native multimodality — trained on interleaved text, image, and video data from the ground up

M3 is available today via the MiniMax API at $0.30 per million input tokens (50% off for the first 7 days). Open weights are planned for release within 10 days of launch. VentureBeat estimates that makes M3 5–10% of the cost of comparable proprietary alternatives.

The MSA Architecture: How M3 Scales to 1M Tokens

Most large language models use full attention — a mechanism where every token attends to every other token in the context. That approach carries quadratic computational cost, so context windows beyond 128K tokens become prohibitively expensive.

MiniMax's answer is MiniMax Sparse Attention (MSA): a new architecture that partitions the key-value cache into blocks and selects only the relevant blocks for each query. Compared to alternatives like DSA and MoBA, MSA achieves higher effective context coverage because its block partitioning is more precise.

The practical gains are significant:

Per-token compute at 1M context is 1/20th that of the previous-generation model
Prefill speed improves more than 9×
Decoding speed improves more than 15×
Memory access is contiguous — each KV block is read exactly once
Performance across ablations matches full attention on the vast majority of tasks

For audio production workflows involving long documents — think full-length audiobook manuscripts, entire video scripts, or large-scale dubbing projects — a 1M context window means the entire source file fits in one call. No chunking, no context-loss bugs, no stitching artifacts from broken continuity.

Frontier Coding and Agentic Benchmarks

MiniMax trained M3 specifically for real-world developer collaboration, not just single-turn code completion. The model uses an interactive user simulator framework that exposes it — during both training and evaluation — to multi-turn scenarios: requirement clarification, solution iteration, task switching, and continuous feedback loops.

The benchmark results from MiniMax's internal testing:

Benchmark	M3 Score
SWE-Bench Pro	59.0%
Terminal-Bench 2.1	66.0%
SWE-fficiency	34.8%
KernelBench Hard	28.8%
MCP Atlas	74.2%

On SWE-Bench Pro, M3 surpasses GPT-5.5 and Gemini 3.1 Pro and approaches Opus 4.7. On SVG-Bench and Claw-Eval (end-to-end autonomous agent evaluation), M3 scores above all tested alternatives.

The CUDA kernel optimization task illustrated what long-horizon agentic behavior looks like in practice: M3 ran for 24 hours, made 1,959 tool calls and 147 benchmark submissions, and improved hardware peak utilization on an NVIDIA Hopper GPU from 7.6% to 71.3% — a 9.4× speedup — with zero human intervention. Unlike most other models that stopped progressing after 30 submissions, M3's best solution appeared on submission 145.

For voice production engineers building agentic systems on top of TTS APIs, this class of long-horizon capability is directly relevant: automated script pipelines, audio quality validation loops, retry-on-failure flows, and multi-provider orchestration all benefit from a model that does not give up mid-task.

Native Multimodality from Step Zero

Many models add multimodal adapters on top of a text-only foundation. M3 is different: it was trained on interleaved text, image, and video data from the beginning.

MiniMax found in extensive experiments that interleaved data — sequences where text and images are naturally interwoven, as they are in real documents — improves performance more than commonly assumed. Their rebuilt data pipeline scaled training data to 100 trillion tokens using this approach.

Current modality support:

Text input and output
Image input (screenshots, charts, figures, documents)
Video input (clip-level understanding)
Computer use — M3 can operate a desktop application autonomously

On OmniDocBench (multimodal document understanding), M3 scores above Gemini 3.1 Pro. This matters for voice workflows that need to parse visual assets: slide decks, formatted scripts, annotated PDFs, or subtitles embedded in video frames.

What Does M3 Mean for AI Voice Production?

MiniMax is not only an LLM lab. Its product stack includes MiniMax Speech 2.6 Turbo, one of the top-ranked TTS models on the Artificial Analysis Speech Arena and the Hugging Face TTS Arena leaderboards. That voice stack runs on the same API and token plan as M3.

For teams building production audio pipelines, the combination creates new possibilities:

Long-document voice production: A podcast transcript, full audiobook chapter, or dubbed film script can now fit entirely in context. M3 reasons across the full document before instructing downstream TTS models — preserving character consistency, emotional arc, and pacing cues across thousands of tokens.

Multimodal script extraction: Got a PDF slide deck, a scanned screenplay, or a subtitle file embedded in video? M3 reads it directly and converts it to a clean TTS-ready script — no separate OCR pipeline required.

Agentic retry loops: When a TTS render fails a quality check, M3 can diagnose the issue (wrong emotion, pacing mismatch, pronunciation error) and rewrite the prompt before retrying — closing the validation loop without human review.

Multilingual dubbing at scale: MiniMax Speech supports 40+ languages. M3's 1M context window holds a full feature-length script in one pass, making coherent multilingual dubbing across an entire production more tractable.

MiniMax Speech 2.6 Turbo: The Voice Side of the Stack

MiniMax Speech 2.6 Turbo is the current production-grade TTS model in the MiniMax ecosystem. Key specs:

Sub-250ms latency (TTFB) for real-time streaming
40+ languages supported
Zero-shot voice cloning from a short reference audio
300+ voices out of the box
Compatible with OpenAI-standard REST API calls
MP3, PCM, WAV, and FLAC output formats

The Speech 2.6 architecture uses an autoregressive Transformer with a hybrid Flow-VAE module for audio quality. The learnable speaker encoder extracts timbre from a reference clip without requiring its transcription — enabling one-shot voice cloning with high speaker similarity.

Speech 2.6 Turbo and M3 share the same MiniMax Token Plan pool: $20/month for ~1.7B tokens, $50/month for ~5.1B tokens, $120/month for ~9.8B tokens. Text, image, speech, and music credits are unified.

Pricing and Access

M3 is available via the MiniMax API starting today:

Tier	Input (≤512K context)	Output
Standard (50% off, first 7 days)	$0.30/M tokens	$1.00/M tokens
Standard (after promo)	$0.60/M tokens	$2.00/M tokens
Long-context (>512K)	Higher rate	Higher rate

For subscription access:

Plus — $20/month, ~1.7B tokens
Max — $50/month, ~5.1B tokens
Ultra — $120/month, ~9.8B tokens

Thinking mode can be toggled on or off at request time — useful for switching between deep reasoning tasks and low-latency completions without changing the model endpoint.

Open weights are expected within 10 days of the June 1 launch, at which point enterprise self-hosting becomes available at no per-token cost.

How Onepin Fits into This Picture

Onepin is an AI voice production agent — a meta-orchestration and validation layer that runs on top of 100+ TTS models worldwide, including MiniMax Speech. It plans, runs, validates, retries, and ships publish-ready audio.

M3 changes what an orchestration layer like Onepin can do:

Full-script context: Rather than chunking a long audiobook into segments, Onepin can pass the full manuscript to M3 for planning — then break it into optimally sized TTS calls that preserve emotional continuity from beginning to end.
Smarter model selection: M3's multimodal understanding means it can read a visual brief (a PDF, a video storyboard, an annotated slide deck) and use that context to select the right voice model and parameters for each section.
Fewer human review cycles: M3's agentic loop closes quality feedback automatically. When a rendered clip fails a voice-match or pacing check, M3 rewrites the TTS prompt and retries — rather than flagging for manual intervention.

For content creators, video producers, and localization teams who want broadcast-quality audio without the overhead of managing individual model APIs, Onepin combined with M3-class intelligence is the next step.

Frequently Asked Questions

What makes MiniMax M3 different from other open-weight models?

M3 is the first open-weight model to combine all three of: frontier coding performance, a 1-million-token context window, and native multimodality. Previous open-weight models delivered one or two of these, but not all three in a single production-ready release.

Is MiniMax M3 available now?

Yes. The API is live at platform.minimax.io as of June 1, 2026. Open weights are expected within 10 days.

How does MiniMax M3 compare to GPT-5.5 on coding?

On SWE-Bench Pro, M3 scores 59.0% — above GPT-5.5 and Gemini 3.1 Pro, and close to Opus 4.7. It costs approximately 5–10% of what those proprietary models charge per token. Source: VentureBeat

Can MiniMax M3 handle voice and TTS workflows?

M3 is an LLM, not a TTS model. But it runs on the same platform as MiniMax Speech 2.6 Turbo, which is one of the top-ranked TTS models globally. Used together — or through an orchestration layer like Onepin — they form a complete AI voice production stack.

What is MiniMax Sparse Attention (MSA)?

MSA is a new attention architecture that partitions the KV cache into blocks and selects only the relevant blocks per query. At a 1M-token context, M3's per-token compute is 1/20th that of full attention, with more than 9× prefill speedup and more than 15× decode speedup.

When will MiniMax M3 open weights be released?

MiniMax confirmed the weights and technical report will be released within 10 days of the June 1, 2026 launch — expected by approximately June 10–11, 2026.

Frequently asked questions

What makes MiniMax M3 different from other open-weight models?: M3 is the first open-weight model to combine frontier coding performance, a 1-million-token context window, and native multimodality in one release. Previous open-weight models delivered one or two of these, but not all three in a single production-ready package.
Is MiniMax M3 available now?: Yes. The API went live at platform.minimax.io on June 1, 2026, with a 50% discount for the first 7 days. Open weights are expected within 10 days of that launch.
How does MiniMax M3 compare to GPT-5.5 on coding?: On SWE-Bench Pro, M3 scores 59.0%, above GPT-5.5 and Gemini 3.1 Pro and close to Opus 4.7. VentureBeat estimates it costs roughly 5 to 10% of what those proprietary models charge per token.
Can MiniMax M3 handle voice and TTS workflows?: M3 is an LLM, not a TTS model. It runs on the same platform and token plan as MiniMax Speech 2.6 Turbo, and paired through an orchestration layer like Onepin the two form a complete AI voice production stack.
What is MiniMax Sparse Attention (MSA)?: MSA is an attention architecture that partitions the key-value cache into blocks and selects only the relevant blocks per query. At a 1M-token context, M3's per-token compute is 1/20th that of full attention, with more than 9x prefill speedup and more than 15x decode speedup.