I always felt that the neural engine was wasted silicon, they could add more gpu cores in that die space and redirect the neural processing api to the gpu as needed. But I'm no expert, so if anyone here has a different opinion I'd love to learn from it.
I'm not a ML guy, but when I needed to train a NN I thought that the my Mac's ANE would help. But actually, despite it being way easier to setup tensorflow + metal + M1 on Mac than to setup tensorflow + cuda + nvidia on Linux, the neural engine cores are not used. Not even for classification, which are their main purpose. I wouldn't say they are wasted silicon, but they are way less useful than what we expect
Eyeballing 3rd party annotated die shots [1], it’s about the size of two GPU cores, but achieves 15.8 tflops. Which is more than the reported 14.7 tflops of the 32-core GPU in the binned M4 Max.
Not really. That's 15.8 fp16 ops compared to 14.7 fp32 ops (that are actually useful outside AI). It would be interesting to see if you can configure the ANE to recover fp32 precision at lower throughput [1].
It seems intuitive that if they design hardware very specifically for these applications (beyond just fast matmuls on a GPU), they could squeeze out more performance.
I was trying to figure the same thing out a couple months ago, and didn't find much information.
It looked like even ANEMLL provides limited low level access to specifically direct processing toward the Apple Neural Engine, because Core ML still acts as the orchestrator. Instead, flags during conversion of a PyTorch or TensorFlow model can specify ANE-optimized operations, quantization, and parameters hinting at compute targets or optimization strategies. For example `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` during conversion would disfavor the GPU cores.
Anyway, I didn't actually experiment with this, but at the time I thought maybe there could be a strategy of creating a speculative execution framework, with a small ANE-compatible model to act as the draft model paired with a larger target model running on GPU cores. The idea being that the ANE's low latency and high efficiency could accelerate results.
However, I would be interested to hear the perspective of people who actually know something about the subject.
If you did that, you'd stumble into the Apple GPU's lack of tensor acceleration hardware. For an Nvidia-like experience you'd have to re-architecture the GPU to subsume the NPU's role, and if that was easy then everyone would have done it by now.