Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI. learn more
DeepSeek, a Chinese AI startup known for challenging major AI vendors with innovative open source technology, today released a new supersized model, DeepSeek-V3.
available via hug face Under the company’s license agreement, the new model comes with 671B parameters, but activates only selected parameters using an expert mix architecture to handle specific tasks accurately and efficiently. Masu. According to benchmarks shared by DeepSeek, the product is already at the top of the charts, outperforming leading open source models such as Meta’s Llama 3.1-405B and matching the performance of Anthropic and OpenAI’s closed models.
This release marks another major development bridging the gap between closed and open source AI. After all, Deep Seek started as an offshoot of a Chinese quantitative hedge fund. Highflyer Capital Managementhopes that these developments will pave the way for artificial general intelligence (AGI), where models will be able to understand or learn any intellectual task that humans are capable of.
What does DeepSeek-V3 bring to you?
Like the previous generation DeepSeek-V2, the new super-large model uses the same basic architecture centered around: Multihead Latent Attention (MLA) and DeepSeekMoE. This approach maintains efficient training and inference with dedicated and shared “experts” (individual small neural networks within the larger model) activating 370B of 671B parameters for each token. will be done.
While the basic architecture ensures DeepSeek-V3’s robust performance, the company has also introduced two innovations that raise the bar even further.
The first is a load balancing strategy without auxiliary losses. This allows the expert load to be dynamically monitored and adjusted, allowing experts to be utilized in a balanced manner without compromising overall model performance. The second is multi-token prediction (MTP), which allows the model to predict multiple future tokens simultaneously. This innovation not only improves training efficiency, but also increases model execution speed by 3x, generating 60 tokens per second.
“During pre-training, we trained DeepSeek-V3 with 14.8T of high-quality and diverse tokens… We then conducted a two-step context length expansion of DeepSeek-V3,” the company writes in a document. . technical paper We will explain the details of the new model. “In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K. This is followed by supervised fine-tuning (SFT) and reinforcement learning ( RL) to further bring out its potential according to human preferences. It extracts inferential features from a series of models while carefully maintaining a balance between model accuracy and generation length.”
In particular, during the training phase, DeepSeek used multiple hardware and algorithm optimizations to reduce the cost of the process, including the FP8 mixed-precision training framework and the DualPipe algorithm for pipeline parallelism.
Overall, we claim to have completed the entire DeepSeek-V3 training in approximately 2788,000 H800 GPU hours, or approximately $5.57 million, assuming a rental price of $2 per GPU hour. This is much lower than the hundreds of millions of dollars typically spent on pre-training large language models.
For example, Rama-3.1 is estimated to have been trained at an investment of over $500 million.
The most powerful open source model available today
Despite its economical training, DeepSeek-V3 has emerged as the most powerful open source model on the market.
The company ran multiple benchmarks to compare AI’s performance and noted that it consistently outperforms leading open models such as Llama-3.1-405B and Qwen 2.5-72B. It outperforms closed source GPT-4o on most benchmarks, except for English-focused SimpleQA and FRAMES. In these benchmarks, the OpenAI model dominates with scores of 38.2 and 80.5 (vs. 24.9 and 73.3), respectively.
In particular, DeepSeek-V3’s performance was outstanding on the Chinese and math-centric benchmarks, where it scored better than all other benchmarks. I scored 90.2 on the Math-500 test, and the next highest score was 80 from Qwen.
The only model that could compete with DeepSeek-V3 was Anthropic’s Claude 3.5 Sonnet, which outperformed DeepSeek-V3 with high scores in MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified, and Aider-Edit. Ta.
🚀 Introducing DeepSeek-V3!
— DeepSeek (@deepseek_ai) December 26, 2024
Biggest leap forward yet:
⚡ 60 tokens/second (3x faster than V2!)
💪 Enhanced capabilities
🛠 API compatibility intact
🌍 Fully open-source models & papers
🐋 1/n pic.twitter.com/p1dV9gJ2Sd
This study shows that open source is moving closer to closed source models, promising roughly equivalent performance across a variety of tasks. The development of such a system is a great thing for the industry, as it potentially eliminates the possibility of one large AI player dominating the game. It also gives companies the ability to choose from multiple options when orchestrating their stacks.
The code for DeepSeek-V3 is currently available below. GitHub Models are provided under the Company’s Model License, but are provided under the MIT License. Companies can also test new models by: deep seek chatuse platforms like ChatGPT to access the API for commercial purposes. DeepSeek provides APIs at the following locations: Same price as DeepSeek-V2 Until February 8th. After that, you will be charged $0.27 per million input tokens ($0.07 per million tokens for cash hits) and $1.10 per million output tokens.