Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DeepSeek V4 Pro has about 25GB worth of active parameters, so if you can fit the whole ~870GB weights + cache in RAM your tok/s is bounded above by 25GB divided into your system memory bandwidth in GB/s. If you can't fit your whole model in RAM you'll be bottlenecked to some degree by storage bandwidth which is in the single or low double digits in GB/s.

Mind you, it's an absolutely sensible setup either way if you are just testing a few queries and are willing to run them unattended/overnight. Especially since the KV-cache size is apparently really low (~10GB is said to be typical) so you get a lot of batching potential even in consumer setups, which amortizes the cost of fetching weights.



Let's say I get 32GB of RAM, with a lean elf(glibc)/linux system, for which 7GB is beyond enormous to run.

Let's book 8/16 cores/threads to run a prompt.

What are the timing figures I am looking at to run an "average" coding prompt?


The basic bottleneck with 32GB RAM would be your storage, so for a baseline estimate you'd be looking at anything from ~2 secs per token (if you had really high performance PCIe 5.0 SSD at ~14 GB/s max) to ~5 secs per token (for an average PCIe 4.0 SSD, ~7 GB/s max). This would then be boosted by being able to keep the shared model layers in RAM, since these are part of the 25GB active parameters. I'm not sure what fraction of the active params that makes up for DeepSeek V4 Pro, but in a typical MoE it's about half, so you could approximately halve those secs-per-token figures. That's acceptable if you care about unattended inference for testing purposes or simple Q&A (leveraging the model's vast world knowledge); it doesn't look very good for interactive use. But the flip side is that you can batch a large amount of model queries together, since the KV cache for very short prompts is quite negligible. AIUI, that's basically unique to this series of models and a huge selling point.


Alright, I don't understand anything, but you said ~5secs per token, then for prompts with hundreds to a thousand tokens, we are in the orders of tens of minutes to hours. I would be targetting coding prompts.

Well, it means one day I would have to get into the real thing: the real inference code, and actually run the inference of a small model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: