Blog.

Deepseek open days. finance day

Image for a Deepseek open days. finance day post

Deepseek open days. The finance day

All last week, DeepSeek held “open house days” — each day they published a repository with code that is in some way used for creating and deploying their advanced LLMs. Each mini-release is quite technical, focusing on various engineering optimizations, but today’s is more high-level. They shared details about how online inference works for DeepSeek-V3/R1 and how much they earn from markup. Here are some interesting details:

• Inference is divided into two stages: Prefilling and Decoding. First, the model tokenizes the request and forms embeddings, and then it generates output tokens one by one. Prefilling can be easily parallelized across multiple GPUs, so there are no issues there.

However, decoding is more complex. Delays arise due to data exchange between processors and MoE experts. This is where the DualPipe system comes into play — it overlaps data transfer stages with computations.

• The stages run on physically separate servers: the first one on 4 nodes (a mini-box with 8 GPUs), the second one on 18 nodes.

• Across all GPUs, there are 32 more experts (parts of each model layer) than strictly necessary. These are redundant but help when certain GPUs are overloaded (i.e., when one expert receives more computations than others → it becomes slower → the entire process slows down).

Now about the economics:

• The day before yesterday DeepSeek had ~275 nodes running, or about 2,200 GPUs dedicated to inference. All of them operate at peak hours (~16 hours per day), but when demand is lower, the number drops to around 60 nodes. This explains why the company started offering 50–75% discount promotions at certain times this week.

• At a cost of $2 per GPU per day (a normal market rate), the total daily expenses would be $87,072.

• In one day, the system processed 608 billion input tokens, of which 56.3% (342 billion) hit the on-disk KV cache. It generated 168 billion output tokens.

• The average generation speed is 20–22 tokens per second, which is significantly lower than competitors. This happens because the system processes huge batches of requests at once, which helps achieve: a) high GPU utilization efficiency, b) low prices

• However, the per-request response speed is slow, which negatively affects certain use cases.

• Under the standard R1 pricing, input tokens cost $0.14 per million with cache hits or $0.55 without cache, while output tokens cost $2.19 per million.

• This results in total revenue of $562,027, leading to a 545% margin.

The hype around this number:

• This margin figure is being widely circulated in media and on Twitter (as always, for the hype). Of course, it’s inaccurate, and DeepSeek themselves acknowledge this, but who actually reads that part?

• The figure is inflated because:

1. The chat model (v3) is significantly cheaper.

2. The model is free to use in the browser and mobile app.

3. The new night-time discounts are not accounted for.

• The real margin is much lower, but it’s hard to estimate without knowing the ratio of chat vs. non-chat and paid vs. free usage.

• Most likely, DeepSeek is:

(a) Happy with their financials.

(b) Profitable rather than operating at a loss.

• According to Semianalysis, OpenAI’s Gross Margin on inference is 65–75%.

• For GPT-4.5 and o1, it’s probably much higher.

The bottom line:

Competition will continue to work in our favor.