Why DeepSeek is cheap at scale but expensive to run locally
SMRTR summary
Large language models like DeepSeek-V3 balance throughput and latency during inference. AI providers often use batch processing across multiple requests to maximize efficiency, allowing for larger matrix multiplications that GPUs excel at. This boosts overall throughput but increases latency as individual requests may queue.
Mixture-of-experts models and those with many layers particularly benefit from large batch sizes. While batching improves server-side performance, it makes these models impractical for single-user local deployment due to low efficiency without concurrent requests.
This explains why some AI models appear slow to start but fast once running - they're optimized for high-throughput batch processing rather than low-latency individual responses. The model's architecture and serving approach affect how this tradeoff manifests.
SMRTR provides this summary for quick context. The original article belongs to Hacker News.
Read the original article