DeepSeek V4 Preview: The 1M-Token Model That Makes Long Context Cheap

DeepSeek V4 Preview is a launch with an argument inside it. The argument: the next long-context race will not be won by whoever advertises the largest window, but by whoever makes that window cheap enough, stable enough, and boring enough to use every day.

On April 24, 2026, DeepSeek announced two open-weight preview models, DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both expose a one-million-token context window across official DeepSeek services. That is the headline. It is not the story.

The story is cost. A million tokens are worth very little if every request gets priced like a physics experiment. A long window only becomes a product capability when prefill, attention compute, KV cache memory, and repeated serving all stop exploding at the same time. The technical report frames V4 exactly that way: not a bigger checkpoint, but an architecture and systems release built around efficient million-token intelligence.

Paper at a glance

DeepSeek V4 Preview: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI

DeepSeek technical report and preview release · 2026

Paper / PDF →

DeepSeek presents V4-Pro and V4-Flash as preview MoE language models with one-million-token context support, hybrid CSA/HCA attention, mHC residual connections, Muon optimization, FP4-aware deployment work, and post-training modes for different reasoning budgets.

The product surface, at least, is simple: deepseek-v4-pro and deepseek-v4-flash are available through the API, both support Thinking and Non-Thinking usage, and the legacy names deepseek-chat and deepseek-reasoner are scheduled to disappear after July 24, 2026, 15:59 UTC. The open weights are distributed through the DeepSeek V4 Hugging Face collection.

One framing note before the details. This article treats V4 as what it says on the label — a preview — and leans on DeepSeek's own release note, technical report, and Hugging Face collection as primary sources. The benchmark numbers are useful signals, but most of them are official or internal evaluations, so the right posture is interest with discipline: test it, measure it, and do not outsource judgment to a launch chart.

What Changed Since Launch Day

As of April 27, 2026, the picture has settled enough to separate the durable facts from the launch-day noise:

DeepSeek positions V4 Preview as the start of "cost-effective 1M context length," not merely a bigger checkpoint.
The public family has two operating points: V4-Pro for maximum reasoning and agentic work, V4-Flash for fast and economical production usage.
The API migration path is deliberately small: keep the same base_url, change the model ID to deepseek-v4-pro or deepseek-v4-flash.
DeepSeek says both models support 1M context and both Thinking / Non-Thinking modes.
The older aliases deepseek-chat and deepseek-reasoner are temporary compatibility routes, gone after July 24, 2026, 15:59 UTC.

Which means the interesting question is no longer "does DeepSeek have a 1M-token model?" It does. The question is whether V4 makes that context cheap and reliable enough for agents, large-codebase work, compliance review, long document analysis — the workloads where repeated prefix cost actually shows up on an invoice.

The Release in One Sentence

DeepSeek V4 is a two-model MoE family designed to make one-million-token inference less absurdly expensive.

That sentence is worth holding onto, because it sidesteps the usual launch-day trap. This is not "bigger model, bigger context, better scores." It is a coordinated set of choices that only make sense together:

V4-Pro for the hardest reasoning, agentic, coding, and long-context workloads.
V4-Flash for lower-cost production paths that still need serious context length.
Hybrid CSA/HCA attention to compress long history instead of attending to everything at full resolution.
mHC residual connections and Muon optimization to improve training stability and depth.
FP4/QAT and serving work to make deployment economics part of the model story.
1M context as a service default, not a lab demo.

If this holds up in real deployments, it changes how people design agents. More transcript stays live. More documents sit inside the working set. More logs, plans, pull requests, and file summaries remain reachable without bolting a retrieval stack onto every workflow.

That does not make retrieval obsolete. It changes what retrieval is for.

The Model Family

V4-Pro is the flagship: 1.6T total parameters, 49B active. V4-Flash is the smaller serving-oriented model: 284B total, 13B active. Both are mixture-of-experts models, so the active parameter count is the number that actually tracks per-token compute — the total is mostly a storage problem.

The split is smart. A single giant checkpoint creates a benchmark story. A Pro/Flash family creates a deployment story.

Pro is what you reach for when failure is expensive: multi-step reasoning, large-codebase sessions, agentic coding, difficult synthesis, and long-context tasks where the answer hangs on subtle connections buried deep in the prompt.

Flash is what you test when the workflow is frequent, latency-sensitive, or cost-sensitive: extraction, classification, routine coding assistance, customer-facing assistants — long-context workloads that live behind validators, tests, schemas, or human review.

Architecturally, V4 keeps the DeepSeek lineage — DeepSeekMoE in the feed-forward layers, Multi-Token Prediction inherited from the V3 family — but the center of gravity has moved: long-context attention, residual-stream routing, optimizer choice, quantization-aware deployment, and cache reuse are where the new work lives.

The Real Breakthrough Is KV Cache Economics

Long context has always had an unglamorous problem: the bill arrives in compute and memory.

Every generated token has to interact with the prefix. Every request needs a KV cache. Every shared-prefix workload risks paying the prefill cost again and again. Somewhere between thousands of tokens and hundreds of thousands, the marketing number stops mattering and the systems problem takes over.

DeepSeek's answer is hybrid attention. The report describes two interleaved mechanisms:

Compressed Sparse Attention (CSA) compresses KV entries and selects relevant compressed blocks.
Heavily Compressed Attention (HCA) compresses more aggressively where coarse access is acceptable.
A sliding-window branch keeps recent context in higher resolution.

This is a practical compromise, and an honest one. A coding agent three hours into a session does not need full-resolution attention over every token it has ever seen. It needs sharp access to the current turn, reliable access to old decisions, and enough structure to recover plans, file summaries, logs, and constraints without drowning the serving stack.

That is why KV cache compression matters more than the raw context number. A one-million-token window is only valuable if the model can afford to keep looking back.

CSA, HCA, and the Shape of Memory

CSA and HCA are not just sparse attention with nicer names. They encode a theory of agent memory.

Recent context is fragile. The model needs exact local detail: the current function, the latest tool output, the last instruction, the active bug. Compress that too aggressively and the agent starts making small, expensive mistakes.

Older context is different. There the model usually needs a recoverable representation — what was decided, which files were touched, what failed, what remains of the plan — and that history compresses well, as long as the model still has a way to route attention back to the right pieces.

V4's hybrid design is built around exactly that distinction: local detail stays close, long history gets stored cheaply, sparse selection keeps the distant parts reachable.

The report also discusses on-disk KV cache reuse, and this deserves more attention than it will get. In production, long-context requests share prefixes constantly: documentation corpora, repository snapshots, policy manuals, logs, long conversations, agent workspaces. If compressed CSA/HCA cache entries can be stored and reused, the system stops paying the same prefill tax on every request.

That is the difference between "the model supports 1M tokens" and "the product can afford 1M tokens."

mHC and Muon: The Quiet Parts of the Release

The attention story will get most of the attention. It should not get all of it.

First, DeepSeek introduces mHC — Manifold-Constrained Hyper-Connections — as an upgrade to conventional residual connections. Deep models are not only about parameter count; they are about signal routing. The residual stream has to carry information through depth without collapsing into noise or losing useful transformations along the way. mHC is DeepSeek's attempt to make that path more stable and more expressive.

Second, DeepSeek trains most parameters with the Muon optimizer, keeping AdamW for selected modules — embeddings, output heads, mHC modules, RMSNorm weights. The report frames this as a convergence and stability decision at V4 scale.

Neither of these makes a keynote slide. Together, they are what makes V4 read less like a context-window patch and more like a full-stack redesign around long reasoning trajectories — because million-token context stresses everything at once: attention, residual flow, optimizer behavior, quantization, cache management, serving, post-training.

Training Scale and Reasoning Modes

DeepSeek reports more than 32T pre-training tokens for the series: Flash on 32T tokens, Pro on 33T. Training sequence length is extended in stages through 16K, 64K, and 1M.

After pre-training, both models are post-trained into several reasoning modes.

This is not cosmetic. Reasoning mode changes what the product actually is.

Non-Think is for fast direct answers. High is the normal deliberate mode. Max is the mode you use to go hunting for the score, with a larger reasoning budget and an evaluation setup to match.

So "V4" is not one thing in practice. V4-Flash Non-Think inside a high-volume product and V4-Pro Max inside a benchmark are different operating points on the same family curve. Any serious evaluation needs to name the model, the mode, the context size, the output budget, the latency, and the cache conditions.

Otherwise the comparison is theatre.

Benchmarks: Good Signals, Not Gospel

DeepSeek's report presents V4-Pro-Max as a leading open model across several categories and competitive with closed frontier systems on selected reasoning, coding, long-context, and agentic evaluations.

The long-context results are the ones that bear on the architecture claim. On MRCR 1M and CorpusQA 1M, Pro Max reports 83.5 and 62.0 respectively, ahead of Flash Max. That matches the expected hierarchy: Flash punches above its active-parameter budget, but Pro remains the safer choice when the task depends on subtle retrieval and synthesis across a very large prompt.

The grouped comparison above uses DeepSeek report Table 6 rather than the YouTube screenshot, because those values are traceable to the technical report. It compares DeepSeek-V4-Pro-Max with Claude Opus 4.6 Max, GPT-5.4 xHigh, and Gemini-3.1-Pro High on benchmarks where the report lists all four values.

For agentic work, the gap opens up. Terminal Bench 2.0 comes in at 67.9 for Pro Max against 56.9 for Flash Max. SWE Verified is much tighter: 80.6 versus 79.0.

My read: Flash is the impressive product model. Pro is the model I would trust first when the task is ambiguous, high-value, long-running, or hard to verify.

What Developers Should Actually Do

The API migration is intentionally small. DeepSeek says to keep the same base_url and update the model name:

const model = "deepseek-v4-pro"; // or "deepseek-v4-flash"

The release note also says the API supports OpenAI Chat Completions and Anthropic APIs, which makes V4 cheap to trial inside existing agents, IDE tools, retrieval systems, evaluation harnesses, and backend services.

The urgent part is the retirement timeline. deepseek-chat and deepseek-reasoner currently route to V4-Flash non-thinking/thinking, but DeepSeek says they become inaccessible after July 24, 2026, 15:59 UTC. If those names are hard-coded anywhere, migrate them now — not the week of the shutdown.

For evaluation, I would run four tracks:

| Track | What to measure | | --- | --- | | Long-context retrieval | Does it find the right facts inside real messy documents, not just benchmark needles? | | Agentic coding | Does it keep plans, file edits, tests, and tool traces coherent over long sessions? | | Cost and latency | What happens at 50K, 200K, 500K, and 1M tokens with realistic output lengths? | | Cache reuse | Do shared-prefix workloads actually get cheaper in your serving path? |

Do not evaluate a million-token model only on short prompts. And do not conclude that a million-token window means you should paste everything. The win is optionality: the system can carry more context when the workflow genuinely benefits — and skip the ceremony when it does not.

Limits and Caveats

This is a preview release, and adoption decisions should be shaped accordingly.

The strongest performance claims are DeepSeek claims, not independent replications.
Many benchmark results come from DeepSeek's own evaluation setup.
The current public scope is text-first; this is not a multimodal release.
A 1M-token window does not remove the need for retrieval, summarization, memory design, or good agent state.
Real cost depends on reasoning mode, output length, prefix reuse, cache behavior, and latency targets.
Long-context benchmark success does not automatically mean reliable reasoning over chaotic repositories, logs, tickets, PDFs, and tool traces.

V4 makes million-token workflows more plausible. It does not make them solved.

The distinction matters, because the teams that benefit most will not be the ones that paste the largest prompt. They will be the ones that design the cleanest long-running workspace: structured files, explicit plans, compact summaries, stable tool traces, and measured cache reuse.

Practical Assessment

DeepSeek V4 Preview is best understood as an efficiency release wearing the clothes of a frontier-model release. The parameter counts are large, but the load-bearing argument is economic: make extreme context stop being exotic.

Use V4-Pro where a bad answer is expensive: deep reasoning, multi-document synthesis, large codebase sessions, long-running agents, complex planning, high-value automation.

Use V4-Flash where throughput matters and the workflow has guardrails: production assistants, structured extraction, classification, routine code help, customer support, long-context pipelines protected by validators or human review.

The preview label is real, and independent replication still matters. But the direction is hard to argue with: 2024 and 2025 made context windows larger. V4 is a bet that 2026 is about making them economically usable.

That is the right problem to solve.

Sources: DeepSeek release note, DeepSeek V4 technical report, DeepSeek V4 Hugging Face collection, and the supporting YouTube discussion.

DeepSeek V4 Preview: The 1M-Token Model That Makes Long Context Cheap

Paper at a glance

DeepSeek V4 Preview: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-AI

DeepSeek technical report and preview release · 2026

Paper / PDF →

What Changed Since Launch Day

As of April 27, 2026, the picture has settled enough to separate the durable facts from the launch-day noise:

DeepSeek positions V4 Preview as the start of "cost-effective 1M context length," not merely a bigger checkpoint.
The public family has two operating points: V4-Pro for maximum reasoning and agentic work, V4-Flash for fast and economical production usage.
The API migration path is deliberately small: keep the same base_url, change the model ID to deepseek-v4-pro or deepseek-v4-flash.
DeepSeek says both models support 1M context and both Thinking / Non-Thinking modes.
The older aliases deepseek-chat and deepseek-reasoner are temporary compatibility routes, gone after July 24, 2026, 15:59 UTC.

The Release in One Sentence

DeepSeek V4 is a two-model MoE family designed to make one-million-token inference less absurdly expensive.

V4-Pro for the hardest reasoning, agentic, coding, and long-context workloads.
V4-Flash for lower-cost production paths that still need serious context length.
Hybrid CSA/HCA attention to compress long history instead of attending to everything at full resolution.
mHC residual connections and Muon optimization to improve training stability and depth.
FP4/QAT and serving work to make deployment economics part of the model story.
1M context as a service default, not a lab demo.

That does not make retrieval obsolete. It changes what retrieval is for.

The Model Family

The split is smart. A single giant checkpoint creates a benchmark story. A Pro/Flash family creates a deployment story.

The Real Breakthrough Is KV Cache Economics

Long context has always had an unglamorous problem: the bill arrives in compute and memory.

DeepSeek's answer is hybrid attention. The report describes two interleaved mechanisms:

Compressed Sparse Attention (CSA) compresses KV entries and selects relevant compressed blocks.
Heavily Compressed Attention (HCA) compresses more aggressively where coarse access is acceptable.
A sliding-window branch keeps recent context in higher resolution.

That is why KV cache compression matters more than the raw context number. A one-million-token window is only valuable if the model can afford to keep looking back.

CSA, HCA, and the Shape of Memory

CSA and HCA are not just sparse attention with nicer names. They encode a theory of agent memory.

V4's hybrid design is built around exactly that distinction: local detail stays close, long history gets stored cheaply, sparse selection keeps the distant parts reachable.

That is the difference between "the model supports 1M tokens" and "the product can afford 1M tokens."

mHC and Muon: The Quiet Parts of the Release

The attention story will get most of the attention. It should not get all of it.

Training Scale and Reasoning Modes

DeepSeek reports more than 32T pre-training tokens for the series: Flash on 32T tokens, Pro on 33T. Training sequence length is extended in stages through 16K, 64K, and 1M.

After pre-training, both models are post-trained into several reasoning modes.

This is not cosmetic. Reasoning mode changes what the product actually is.

Otherwise the comparison is theatre.

Benchmarks: Good Signals, Not Gospel

For agentic work, the gap opens up. Terminal Bench 2.0 comes in at 67.9 for Pro Max against 56.9 for Flash Max. SWE Verified is much tighter: 80.6 versus 79.0.

My read: Flash is the impressive product model. Pro is the model I would trust first when the task is ambiguous, high-value, long-running, or hard to verify.

What Developers Should Actually Do

The API migration is intentionally small. DeepSeek says to keep the same base_url and update the model name:

const model = "deepseek-v4-pro"; // or "deepseek-v4-flash"

For evaluation, I would run four tracks:

Limits and Caveats

This is a preview release, and adoption decisions should be shaped accordingly.

The strongest performance claims are DeepSeek claims, not independent replications.
Many benchmark results come from DeepSeek's own evaluation setup.
The current public scope is text-first; this is not a multimodal release.
A 1M-token window does not remove the need for retrieval, summarization, memory design, or good agent state.
Real cost depends on reasoning mode, output length, prefix reuse, cache behavior, and latency targets.
Long-context benchmark success does not automatically mean reliable reasoning over chaotic repositories, logs, tickets, PDFs, and tool traces.

V4 makes million-token workflows more plausible. It does not make them solved.

Practical Assessment

Use V4-Pro where a bad answer is expensive: deep reasoning, multi-document synthesis, large codebase sessions, long-running agents, complex planning, high-value automation.

That is the right problem to solve.

Sources: DeepSeek release note, DeepSeek V4 technical report, DeepSeek V4 Hugging Face collection, and the supporting YouTube discussion.