DeepSeek V4 Pro has about 25GB worth of active parameters, so if you can fit the...

sylware · 2026-05-02T19:29:50 1777750190

Let's say I get 32GB of RAM, with a lean elf(glibc)/linux system, for which 7GB is beyond enormous to run.

Let's book 8/16 cores/threads to run a prompt.

What are the timing figures I am looking at to run an "average" coding prompt?

zozbot234 · 2026-05-02T20:27:42 1777753662

The basic bottleneck with 32GB RAM would be your storage, so for a baseline estimate you'd be looking at anything from ~2 secs per token (if you had really high performance PCIe 5.0 SSD at ~14 GB/s max) to ~5 secs per token (for an average PCIe 4.0 SSD, ~7 GB/s max). This would then be boosted by being able to keep the shared model layers in RAM, since these are part of the 25GB active parameters. I'm not sure what fraction of the active params that makes up for DeepSeek V4 Pro, but in a typical MoE it's about half, so you could approximately halve those secs-per-token figures. That's acceptable if you care about unattended inference for testing purposes or simple Q&A (leveraging the model's vast world knowledge); it doesn't look very good for interactive use. But the flip side is that you can batch a large amount of model queries together, since the KV cache for very short prompts is quite negligible. AIUI, that's basically unique to this series of models and a huge selling point.

sylware · 2026-05-03T11:11:31 1777806691

Alright, I don't understand anything, but you said ~5secs per token, then for prompts with hundreds to a thousand tokens, we are in the orders of tens of minutes to hours. I would be targetting coding prompts.

Well, it means one day I would have to get into the real thing: the real inference code, and actually run the inference of a small model.