DeepSeek V4: A domestically produced trillion level model, deeply adapted to Ascend chips

1、 Core breakthrough: Trillion parameters+dynamic activation, dual optimization of efficiency and cost

DeepSeek V4 adopts the MoE (Mixed Expert) architecture, with a total parameter count of 1 trillion. However, relying on dynamic parameter activation technology, only about 37 billion parameters are activated per inference, with an activation ratio of less than 4%. This design of "super large parameter pool+sparse activation" breaks the industry inertia of "the larger the parameters, the higher the cost", making the inference cost of trillion level models the same as the previous generation V3, achieving a technological leap of "scale upgrade and cost unchanged".

The model is divided into two versions: Pro and Flash. The Pro version is designed for high-performance scenarios, while the Flash version focuses on the ultimate cost-effectiveness. Both versions support 1 million tokens of ultra long context and can process entire technical documents, large contracts, or long video content at once. The ability to understand and reason about long texts is leading in China.

2、 Autonomous computing power: full link adaptation and upgrading, eliminating external dependencies

As the first large-scale model in China to deeply adapt to Huawei Ascend chips, DeepSeek V4 has completed the full migration of underlying code from Nvidia CUDA ecosystem to Huawei CANN architecture. Engineers rewrite core operators, refactor communication protocols, and deploy toolchains to achieve full stack native adaptation from frameworks, operator libraries, to inference engines, completely bridging the "last mile" of domestic computing power.

At present, V4 fully supports the Ascend 950, A3 supernode and other series of products. By integrating the kernel and multi stream parallel technology, attention computing and memory access overhead are significantly reduced. In the 8K input scenario, the Ascend 950 single card decode throughput can reach 4700TPS, and the inference performance is significantly improved compared to the previous generation. This means that domestic large models no longer rely on high-end overseas chips, and can achieve efficient training and deployment through Ascend computing power, setting a benchmark for the industry of "domestic models+domestic computing power".

3、 Industry significance: Refactoring the AI industry ecosystem, accelerating independent and controllable development

The release of DeepSeek V4 is a key implementation of the domestic AI industry's independent and controllable strategy. Against the backdrop of increasingly strict external computing power regulation, V4 has validated the feasibility of trillion level models operating efficiently without overseas computing power and relying on domestic chips through underlying technological innovation, providing a replicable technological path for domestic AI enterprises.

At the same time, V4 continues the open source concept, and the preview version has been synchronized with open source. Developers can directly deploy and fine tune it on Ascend devices, reducing the application threshold of trillion level models. Its high cost-effectiveness feature (API input price as low as 1 yuan/million tokens) will promote the large-scale implementation of AI technology in fields such as finance, healthcare, and industrial manufacturing, accelerating the upgrading of industrial intelligence.