DeepSeek V4 Preview: The 1M-Token Model That Makes Long Context Cheap
DeepSeek V4 Preview is not just another model-card launch. It is DeepSeek making a specific engineering bet: the next long-context race will not be won by whoever advertises the largest window, but by whoever makes that window cheap enough, stable enough, and boring enough to use every day.
On April 24, 2026, DeepSeek announced two open-weight preview models, DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both expose a one-million-token context window across official DeepSeek services. That is the headline. It is not the story.
The story is cost. A million tokens are not useful if a single request behaves like a research-lab event. A long window only becomes a product capability when prefill, attention compute, KV cache memory, and repeated serving all stop exploding at the same time. The technical report frames V4 exactly that way: as an architecture and systems release around efficient million-token intelligence.
Paper at a glance
DeepSeek V4 Preview: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-AI
DeepSeek technical report and preview release · 2026
Paper / PDF →DeepSeek presents V4-Pro and V4-Flash as preview MoE language models with one-million-token context support, hybrid CSA/HCA attention, mHC residual connections, Muon optimization, FP4-aware deployment work, and post-training modes for different reasoning budgets.
The product surface is simple: deepseek-v4-pro and deepseek-v4-flash are available through the API, both support Thinking and Non-Thinking usage, and the legacy names deepseek-chat and deepseek-reasoner are scheduled to disappear after July 24, 2026, 15:59 UTC. The open weights are distributed through the DeepSeek V4 Hugging Face collection.
This article treats V4 as a preview release, not as a final verdict. DeepSeek's own release note, technical report, and Hugging Face model collection are the primary sources here. The benchmark numbers are useful signals, but most of them are official or internal evaluations, so the right posture is interest with discipline: test it, measure it, and do not outsource judgment to a launch chart.
What Changed Since Launch Day
As of April 27, 2026, the official picture is clear enough to separate the durable facts from the launch-day noise:
- DeepSeek positions V4 Preview as the start of "cost-effective 1M context length," not merely a bigger checkpoint.
- The public family has two operating points: V4-Pro for maximum reasoning and agentic work, V4-Flash for fast and economical production usage.
- The API migration path is deliberately small: keep the same
base_url, change the model ID todeepseek-v4-proordeepseek-v4-flash. - DeepSeek says both models support 1M context and both Thinking / Non-Thinking modes.
- The older aliases
deepseek-chatanddeepseek-reasonerare temporary compatibility routes and are scheduled to be retired after July 24, 2026, 15:59 UTC.
That means the right question is no longer "does DeepSeek have a 1M-token model?" It does. The right question is whether V4 makes that context cheap and reliable enough for agents, large-codebase work, compliance review, long document analysis, and production systems where repeated prefix cost actually matters.
The Release in One Sentence
DeepSeek V4 is a two-model MoE family designed to make one-million-token inference less absurdly expensive.
That sentence matters because it avoids the usual launch-day trap. The release is not just "bigger model, bigger context, better scores." It is a coordinated set of choices:
- V4-Pro for the hardest reasoning, agentic, coding, and long-context workloads.
- V4-Flash for lower-cost production paths that still need serious context length.
- Hybrid CSA/HCA attention to compress long history instead of attending to everything at full resolution.
- mHC residual connections and Muon optimization to improve training stability and depth.
- FP4/QAT and serving work to make deployment economics part of the model story.
- 1M context as a service default, not just a lab demo.
If this works in real deployments, it changes how people design agents. More transcript can stay live. More documents can sit inside the working set. More logs, plans, pull requests, and file summaries can remain reachable without rebuilding a retrieval stack for every workflow.
That does not make retrieval obsolete. It changes what retrieval is for.
The Model Family
V4-Pro is the flagship model: 1.6T total parameters, 49B active parameters. V4-Flash is the smaller serving-oriented model: 284B total parameters, 13B active parameters. Because both are mixture-of-experts models, the active parameter count is the more useful number for thinking about per-token compute.
The split is smart. A single giant checkpoint creates a benchmark story. A Pro/Flash family creates a deployment story.
Pro is the model you reach for when failure is expensive: multi-step reasoning, large-codebase sessions, agentic coding, difficult synthesis, high-value automation, and long-context tasks where the answer depends on subtle connections across the prompt.
Flash is the model you test when the workflow is frequent, latency-sensitive, or cost-sensitive: extraction, classification, routine coding assistance, customer-facing assistants, and long-context workloads protected by validators, tests, schemas, or human review.
Architecturally, V4 keeps the DeepSeek lineage: DeepSeekMoE in the feed-forward layers and Multi-Token Prediction inherited from the V3 family. The new center of gravity is elsewhere: long-context attention, residual-stream routing, optimizer choice, quantization-aware deployment, and cache reuse.
The Real Breakthrough Is KV Cache Economics
Long context has always had an unglamorous problem: the bill arrives in compute and memory.
Every generated token has to interact with the prefix. Every request needs a KV cache. Every shared-prefix workload risks paying the prefill cost again and again. Once the prompt moves from thousands of tokens to hundreds of thousands, the system problem becomes more important than the marketing number.
DeepSeek's answer is hybrid attention. The report describes two interleaved mechanisms:
- Compressed Sparse Attention (CSA) compresses KV entries and selects relevant compressed blocks.
- Heavily Compressed Attention (HCA) compresses more aggressively where coarse access is acceptable.
- A sliding-window branch keeps recent context in higher resolution.
This is a practical compromise. A long-running coding agent does not need full-resolution attention over every token of a three-hour session. It needs sharp access to the recent turn, reliable access to old decisions, and enough structure to recover plans, file summaries, logs, and constraints without drowning the serving stack.
That is why KV cache compression matters more than the raw context number. A one-million-token window is only valuable if the model can afford to keep looking back.
CSA, HCA, and the Shape of Memory
CSA and HCA are not just "sparse attention with nicer names." They encode a theory of agent memory.
Recent context is fragile. The model needs exact local detail: the current function, the latest tool output, the last instruction, the active bug. Compress that too aggressively and the agent starts making small, expensive mistakes.
Older context is different. The model usually needs a recoverable representation: what was decided, what files were touched, what constraints were discovered, what failed, what plan remains. That history can be compressed if the model still has a way to route attention toward the right pieces.
V4's hybrid design is built around that distinction. It keeps local detail close, stores long history more cheaply, and uses sparse selection to make distant context reachable.
The report also discusses on-disk KV cache reuse. That is not a side detail. In production, many long-context requests share prefixes: documentation corpora, repository snapshots, policy manuals, logs, long conversations, agent workspaces. If compressed CSA/HCA cache entries can be stored and reused, the system can avoid paying the same prefill tax repeatedly.
That is the difference between "the model supports 1M tokens" and "the product can afford 1M tokens."
mHC and Muon: The Quiet Parts of the Release
The attention story will get the most attention, but V4 is broader than attention.
First, DeepSeek introduces mHC, or Manifold-Constrained Hyper-Connections, as an upgrade to conventional residual connections. Deep models are not only about parameter count; they are about signal routing. The residual stream has to carry information through depth without collapsing into noise or losing useful transformations. mHC is DeepSeek's attempt to make that path more stable and expressive.
Second, DeepSeek uses the Muon optimizer for most parameters while keeping AdamW for selected modules such as embeddings, output heads, mHC modules, and RMSNorm weights. The report frames this as a convergence and stability decision at V4 scale.
Together, these choices make V4 feel less like a context-window patch and more like a full-stack redesign around long reasoning trajectories. Million-token context stresses everything: attention, residual flow, optimizer behavior, quantization, cache management, serving infrastructure, and post-training.
Training Scale and Reasoning Modes
DeepSeek reports more than 32T pre-training tokens for the series: Flash on 32T tokens and Pro on 33T tokens. Training sequence length is gradually extended through 16K, 64K, and 1M.
After pre-training, both models are post-trained into several reasoning modes.
This is not cosmetic. Reasoning mode changes the product behavior.
Non-Think is for fast direct answers. High is the normal deliberate mode. Max is the capability-seeking mode, where the model is allowed to spend more reasoning budget and the evaluation setup changes accordingly.
That means "V4" is not one thing in practice. V4-Flash Non-Think inside a high-volume product and V4-Pro Max inside a benchmark are different operating points on the same family curve. Any serious evaluation needs to name the model, mode, context size, output budget, latency, and cache conditions.
Otherwise the comparison is theatre.
Benchmarks: Good Signals, Not Gospel
DeepSeek's report presents V4-Pro-Max as a leading open model across several categories and competitive with closed frontier systems on selected reasoning, coding, long-context, and agentic evaluations.
The long-context results are the most relevant to the architecture claim. On MRCR 1M and CorpusQA 1M, Pro Max reports 83.5 and 62.0 respectively, ahead of Flash Max. That supports the expected hierarchy: Flash is strong for its active parameter budget, but Pro remains the safer choice when the task depends on subtle retrieval and synthesis across a very large prompt.
The grouped comparison above uses DeepSeek report Table 6 rather than the YouTube screenshot, because those values are traceable to the technical report. It compares DeepSeek-V4-Pro-Max with Claude Opus 4.6 Max, GPT-5.4 xHigh, and Gemini-3.1-Pro High on benchmarks where the report lists all four values.
For agentic work, the split is clearer. Terminal Bench 2.0 is reported at 67.9 for Pro Max and 56.9 for Flash Max. SWE Verified is tighter: 80.6 versus 79.0.
My read: Flash is the impressive product model. Pro is the model I would trust first when the task is ambiguous, high-value, long-running, or hard to verify.
What Developers Should Actually Do
The API migration is intentionally small. DeepSeek says to keep the same base_url and update the model name:
const model = "deepseek-v4-pro"; // or "deepseek-v4-flash"The release note also says the API supports OpenAI Chat Completions and Anthropic APIs, which makes V4 easy to test inside existing agents, IDE tools, retrieval systems, evaluation harnesses, and backend services.
The urgent part is the retirement timeline. deepseek-chat and deepseek-reasoner currently route to V4-Flash non-thinking/thinking, but DeepSeek says they will become inaccessible after July 24, 2026, 15:59 UTC. If those names are hard-coded anywhere, migrate them now. Do not wait for the week of the shutdown.
For evaluation, I would test V4 with four tracks:
| Track | What to measure | | --- | --- | | Long-context retrieval | Does it find the right facts inside real messy documents, not just benchmark needles? | | Agentic coding | Does it keep plans, file edits, tests, and tool traces coherent over long sessions? | | Cost and latency | What happens at 50K, 200K, 500K, and 1M tokens with realistic output lengths? | | Cache reuse | Do shared-prefix workloads actually get cheaper in your serving path? |
Do not evaluate a million-token model only on short prompts. Also do not assume a million-token window means you should paste everything. The win is optionality: the system can carry more context when the workflow genuinely benefits from it.
Limits and Caveats
This is still a preview release, and that should shape how teams adopt it.
- The strongest performance claims are DeepSeek claims, not independent replications.
- Many benchmark results come from DeepSeek's own evaluation setup.
- The current public scope is text-first; this is not a multimodal release.
- A 1M-token window does not remove the need for retrieval, summarization, memory design, or good agent state.
- Real cost depends on reasoning mode, output length, prefix reuse, cache behavior, and latency targets.
- Long-context benchmark success does not automatically mean reliable reasoning over chaotic repositories, logs, tickets, PDFs, and tool traces.
V4 makes million-token workflows more plausible. It does not make them solved.
That distinction matters. The teams that benefit most will not be the ones that paste the largest prompt. They will be the ones that design the cleanest long-running workspace: structured files, explicit plans, compact summaries, stable tool traces, and measured cache reuse.
Practical Assessment
DeepSeek V4 Preview is best understood as an efficiency release wearing the clothes of a frontier-model release. The model sizes are large, but the important argument is economic: make extreme context less exotic.
Use V4-Pro for the work where a bad answer is expensive: deep reasoning, multi-document synthesis, large codebase sessions, long-running agents, complex planning, and high-value automation.
Use V4-Flash where throughput matters and the workflow has guardrails: production assistants, structured extraction, classification, routine code help, customer support, and long-context pipelines protected by validators or human review.
The preview label is real. Independent replication still matters. But the direction is hard to ignore: 2024 and 2025 made context windows larger. V4 is a bet that 2026 is about making those windows economically usable.
That is the right problem to solve.
Sources: DeepSeek release note, DeepSeek V4 technical report, DeepSeek V4 Hugging Face collection, and the supporting YouTube discussion.