More

ColonelPhantom · 2026-06-06T20:45:48 1780778748

Deepseek V4 Flash still has 13B active params though? That is about half as many as Qwen3.6-27B (and much more than Qwen3.6-35B-A3B). Given that RAM (even on a base M4 or 'regular' Intel/AMD system) is like an order of magnitude faster than an SSD, even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading. And the MoE will be much faster still.

Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.

Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.

zozbot234 · 2026-06-06T21:05:09 1780779909

> even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading.

If you have reasonable amounts of RAM to cache the most likely experts, that's not true at all. Qwen 27B is marginally faster on a nearly empty context, then falls behind as context length increases due to the different attention mechanisms. Prefill for Qwen is much faster, but you're still comparing vastly different model sizes and capabilities. DeepSeek Flash is the best deal overall.

> completely fit in a high-end consumer or mid-end pro GPU

Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.

ColonelPhantom · 2026-06-06T21:42:49 1780782169

> the most likely experts

Is that how MoEs work? I though that an important constraint for MoEs is that experts need to be uniformly used to make sure they can be used effectively. If there is a 'common subset' that, if anything, sounds like a symptom of undertraining (i.e. the same trick will not work as well for Deepseek V4.1).

Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!

Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal; but I would like to try what you say with llama.cpp which uses mmap to also potentially do SSD streaming. (I can maybe try the large Qwen3.5 MoEs?)

> as context length increases

What kind of context length do you consider reasonable, though? From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens. So realistically, limiting context size might even improve quality, especially if you use token-efficient harnesses.

> Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.

Your point about consumer hardware was that it would be "borderline unusable" when running Qwen 3.6 27B. However, you need much less hardware to run a 27B than DSv4 Flash. In addition, you can do the same 'trick' with low-end GPUs and small MoEs: my desktop with 32 GB DDR4-3200 and an RTX 2070 8GB can run the ~30B class MoEs at 20-30 tokens per second and similar speeds to my laptop.

zozbot234 · 2026-06-06T22:01:23 1780783283

> Is that how MoEs work?

For any given workload/session? Empirically, yes, that's what has been found across different models. There's quite a bit of predictability that makes caching helpful.

> Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!

There are ways of masking some of that latency, though it requires some architecture-specific cleverness which is less directly applicable to a generic engine like llama.cpp.

> Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal

The llama.cpp folks are working on adding support, and the ds4 project is working on CUDA support for streaming inference, targeting the DGX Spark.

> From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens.

DeepSeek V4 seems to do quite well on recall tasks even with large context. That's one plausible benefit of its compressed attention mechanism, compared to earlier models. Some degradation will likely still be there, but it's not necessarily obvious.

As for why people are calling Qwen 27B "borderline unusable" that may have to do with it being a dense model which makes for an increased compute intensity and pushes users towards discrete GPU platforms, since those tend to have the most compute overall as far as consumer hardware is concerned. I might agree that Qwen 27B is quite ideally tailored towards these platforms, but that does come with some limitations.

ColonelPhantom · 2026-05-10T08:32:08 1778401928

Carp is memory safe via linear types + references, similar to Rust, so I would not describe it as C-like but rather Rust-like.

ColonelPhantom · 2026-05-04T05:40:08 1777873208

But what _is_ a "Text User Interface"? Google Images just returns what is being discussed here: "GUIs" that run in some kind of text mode. And to me, that's also what a TUI is.

A more textually oriented environment (like a normal Unix shell) is, in my experience, usually referred to as a CLI: Command Line Interface.

I did find an interesting hybrid in the Pi coding agent: it seems to leverage the normal terminal scrollback, while still enhancing it with things like transient input fields and status lines, so that it can display those without cluttering scrollback.

bitwize · 2026-05-04T08:36:12 1777883772

That's pretty much the distinction. A CLI is stream-oriented, may accept user input from standard in and/or write to standard out with buffered I/O; a TUI is full-screen, interactive (responds immediately to keystrokes), and uses text characters to represent visual elements.

MBCook · 2026-05-04T19:52:00 1777924320

And that’s where my comment came in.

I see TUI as a more general term: a program based in a terminal.

In my mind it does not imply it has to be an ASCII GUI style thing and not just a stream oriented non-interactive program.

ColonelPhantom · 2026-04-21T19:34:09 1776800049

You mentioned Strix Halo, which also has off-die memory. Strix Halo does have a real advantage from its wider memory bus (four channels for 256 bit instead of 128 bit), but Strix Point is equivalent-ish to Intel's platforms like Panther Lake or Arrow Lake in terms of memory setup.

In fact, Intel also had Lunar Lake, which had on-package memory. However, it was still limited to 128-bit dual-channel, so there weren't really many performance benefits; it did however help with power efficiency.

ColonelPhantom · 2026-04-21T19:30:58 1776799858

Hilariously, those AMD chips are way behind the Intels in terms of memory.

First off, I believe that Intel has its memory far more "unified". AMD typically has a stricter VRAM/RAM 'tradeoff' setting that does not exist on Intel in the same way to my knowledge. (See how on Strix Halo systems, there is a thing about "allocating" 96 GB to the GPU, which seems to be needed sometimes but prevents the CPU from accessing that memory.)

Secondly, the Panther Lake board has LPDDR5X LPCAMM2 memory at 7467 MT/s, while the AMD boards are stuck with DDR5 SODIMMs at a meagre 5600 MT/s. In other words, the Intel board gets a third more memory bandwidth!

blm126 · 2026-04-21T19:48:07 1776800887

I’ve got the Framework desktop with strix halo. You can reserve memory for the GPU, but it’s straightforward at least on Linux to have the GPU dynamically grab memory as needed. I’ve got my VRAM set to 512MB and regularly use 120GB+for AI stuff.

ColonelPhantom · 2026-04-08T02:34:23 1775615663

Nvidia Turing (RTX 20) definitely marked a major shift IMO.

- It was the first card to enable real-time ray-traced effects. - Mesh shaders are a significant overhaul of the geometry pipeline that's only recently getting real traction. - Its tensor cores enabled a new generation of AI-driven upscaling/antialiasing. DLSS 2, FSR 4 and XeSS are all some variation of "TAA + neural networks", and these all rely on specialized matrix hardware to get optimal performance.

Obviously all of these features are supported across all vendors. Intel Arc Alchemist has all of these features as well, and AMD got RT and mesh shader support with RDNA2 along with slowly building up to tensor cores with RDNA3/4. But Turing clearly debuted these feature which have majorly changed the landscape of realtime 3D graphics.

ColonelPhantom · 2026-03-26T20:44:05 1774557845

838 seems to be the real INT8 TOPS number for the 5090; going from 800 to 3400 takes an x2 speedup for sparsity (so skipping ops) and another x2 speedup for FP4 over INT8.

So it's closer to half the speed than a tenth. Intel also seems to be positioning this card against the RTX PRO 4000 Blackwell, not the 5090, and that one gets more like 300 INT8 TOPS. It also has less memory but at a slightly higher bandwidth. The 5090 is much faster and IIRC priced similarly to the PRO 4000, but is also decidedly a consumer product which, especially for Nvidia, comes with limitations (e.g. no server-friendly form factor cards available, and there are or used to be driver license restrictions that prevented using a consumer card in a data center setup).

jauntywundrkind · 2026-03-26T20:52:20 1774558340

Thank you for the correction. That seemed way too lopsided to be believed. This assessment balances the memory to tops ratio much much more evenly, which is to be expected! I was low key hoping someone would help me make sense of how wildly disparate figures were, but I wasn't seeing.

AMD R9700 is 378/766 tops int8 dense/sparse. 644GB/s of 32GB memory. ~$1400. To throw one more card into the mix. Intel undercutting that nicely here.

You're right that for companies, the pro grade matters. For us mere mortals, much less so. Features like sr-iov however are just fantastic so see! Good job Intel. AMD has been trickling out such capabilities for a decade (cards fused for "MxGPU" capability) & it makes it such an easier buy to just offer it straight up across the models.

ColonelPhantom · 2026-03-25T19:01:05 1774465265

Aren't Intel Xeon Rapids and Intel Xeon Forest just different target markets? Rapids has fewer but faster cores in general, and more special-purpose accelerators (e.g. AMX, QAT), while Forest is focused on maximum compute density (just pack in as many fast-enough cores as you can).

IIRC Granite Rapids is also not _that_ old, and either current or a single generation behind. (Has its successor landed yet? IIRC GNR is the same generation as Sierra Forest).

ColonelPhantom · 2026-03-23T12:49:38 1774270178

Very cool! I am wondering one thing: how fast is it? Much of the "secret sauce" of the Voodoo is its high speed: a first-gen Verite or (God forbid) any ViRGE takes many more cycles for common operations like, say, Z-buffered pixels.

I'm guessing this isn't fully cycle-accurate, but is it at least somewhat "IPC-accurate"? I'm guessing yes? But much of that was also derived from Voodoo's (for the time) crazy high memory bandwidth AFAIK.

mmustapic · 2026-03-23T13:46:30 1774273590

The Voodoo was fast but also expensive, and you needed an additional VGA card. I think it was around USD 300 back then, that's more than USD 600 today and you'll still need another card.

rasz · 2026-03-24T11:08:35 1774350515

$299 release price, down to ~$$199 in 1997 when Glide games started dropping. Consider Virge was aslo $300 and offered pathetic performance.

ColonelPhantom · 2026-03-23T00:14:40 1774224880

GPT-OSS is tailored to be extremely memory efficient. Not only is it natively using the 4.25 bit per token MXFP4 format, but it also uses sliding window attention for half of its layers. It also doesn't have that many layers, only 36 for the 120B version and 24 for the 120B version. (The 120B is also much much sparser than the 20B.)

I found a Reddit comment claiming only 36 KiB per token. With that, half a million tokens fits in 18 GB, which is less than one GPU. And three GPUs fit the parameters with room to spare (64 out of 72 GB).