DeepSeek, a Chinese AI startup, has released DeepSeek-V3, a new ultra-large AI model with 671B parameters that outperforms leading open-source competitors while approaching the capabilities of prominent closed-source models.
Key innovations: DeepSeek-V3 employs a mixture-of-experts architecture that selectively activates only 37B of its 671B parameters for each task, enabling efficient processing while maintaining high performance.
- The model introduces an auxiliary loss-free load-balancing strategy that optimizes expert utilization without compromising performance
- A new multi-token prediction feature allows the model to generate 60 tokens per second, three times faster than previous versions
- The system uses multi-head latent attention (MLA) and DeepSeekMoE architectures for efficient training and inference
Technical specifications: The model underwent extensive training on 14.8T high-quality tokens and features significant context length capabilities.
- DeepSeek-V3’s context length was extended in two stages, first to 32K and then to 128K
- The training process included supervised fine-tuning and reinforcement learning to align with human preferences
- The company implemented various optimizations, including FP8 mixed precision training and the DualPipe algorithm
Cost efficiency: DeepSeek achieved remarkable cost savings in the training process compared to industry standards.
- The entire training process required approximately 2788K H800 GPU hours, costing about $5.57 million
- This represents a significant reduction from typical training costs, such as the estimated $500 million spent on Llama-3.1
Performance benchmarks: DeepSeek-V3 demonstrates superior performance across multiple evaluation metrics.
- The model outperforms open-source competitors like Llama-3.1-405B and Qwen 2.5-72B
- It shows particular strength in Chinese language and mathematical tasks, scoring 90.2 on the Math-500 test
- While matching or exceeding GPT-4o in most areas, it falls behind in specific English-focused tests like SimpleQA and FRAMES
Accessibility and pricing: The model is available through multiple channels with competitive pricing structure.
- The code is accessible via GitHub under an MIT license
- Users can access the model through DeepSeek Chat or via API for commercial applications
- API pricing is set at $0.27/million input tokens and $1.10/million output tokens after February 8
Market implications: The emergence of DeepSeek-V3 signals a significant shift in the competitive landscape between open-source and closed-source AI models, potentially democratizing access to advanced AI capabilities while challenging the dominance of established players in the field.
DeepSeek-V3, ultra-large open-source AI, outperforms Llama and Qwen on launch