Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes... I could see that approach being desirable on a CPU, it would make all context switches cheaper. Are we re-inventing register windows?

Of course, we are on a GPU here. Skimming through the RDNA 4 ISA document [1], looks like they expect waves using the dynamic VGPRs features to keep track of how many VGPRs they have requested in a Scalar register (SGPR). Each wave always has 106 SGPRs allocated (32-bit each), so the overhead of a single SGPR isn't an issue. You wouldn't even need the jump table, as AMD allows you to index into the VGPR register file. A small loop would allow you to push all active VGPRs onto the stack.

Shame they don't also allow you to index into the SGPR registers too.

But GPUs aren't really designed with coroutines or context switching in mind. Modern GPUs can do it, but GPUs have deep historical assumptions that every single wave on every single core will be running the exact same code, and that 99.99% of waves will run to completion without needing a context switch. When preemptive context switches are needed (no more than once every few ms, and only with long-running compute shaders), they can probably get away with simply DMAing the whole register file to memory.

Though, part of the reason this feature was implemented is that the historic assumptions are no-longer true. DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.

[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech...



> DXR allows each material to register a unique hit shader, so Modern GPUs need to dynamic dispatch based on the result of ray hits.

That's not how it works in practice. Even with hardware accelerated raytracers (like Intel Arc).

AMD systems push the hit/miss onto various buffers and pass them around.

Intel systems push the entire call-stack and shuffles them around.

Lets say your 256 thread-group chunk has 30% "metalic hits", 15% "diffuse hits", and the remaining 55% are misses. You cannot "build up" a new group with just one thread-group (!!!!).

To efficiently run things, you'll need ~4 thread groups (aka: 1024 rays) before you can run a full 256-thread group again for hits, and you'll need ~2 thread-groups (aka: 512 rays) before you get a full 256-thread group again for misses. And finally you'll need ~7 thread-groups (aka: 1792 rays to pass through) before you have the 256-diffuse hits needed to fill up a SIMD Unit.

In all cases, you need to dynamically grow a buffer and "build up" enough parallelism before running the recursive hits (or miss) handlers. The devil is in the details.

Intel has very dedicated and specialized accelerators that moves the stacks around (!!!!) so that all groups remain fully utilized. I believe AMD's implementation is "just" an append buffer followed by a consume buffer, simple enough really. Probably inside of shared memory but who knows what the full implementation details are. (The AMD systems have documented ISAs so we know what instructions are available. AMD's "raytracers" are BVH tree traversal accelerators but don't seem to have stack-manipulation or stack-movements like Intel's raytracing implementation)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: