I always felt that the neural engine was wasted silicon, they could add more gpu...

lucasoshiro · on May 3, 2025

I'm not a ML guy, but when I needed to train a NN I thought that the my Mac's ANE would help. But actually, despite it being way easier to setup tensorflow + metal + M1 on Mac than to setup tensorflow + cuda + nvidia on Linux, the neural engine cores are not used. Not even for classification, which are their main purpose. I wouldn't say they are wasted silicon, but they are way less useful than what we expect

mr_toad · on May 3, 2025

Does Apple care about third party use of the ANE? There are many iOS/iPadOS features that use it.

sroussey · on May 3, 2025

Not really. Apple software uses the neural engine all over the place, but rarely do others. Maybe this will change [1]

There was a guy using it for live video transformations and it almost caused the phones to “melt”. [2]

[1] https://machinelearning.apple.com/research/neural-engine-tra...

[2] https://x.com/mattmireles/status/1916874296460456089

brigade · on May 3, 2025

Eyeballing 3rd party annotated die shots [1], it’s about the size of two GPU cores, but achieves 15.8 tflops. Which is more than the reported 14.7 tflops of the 32-core GPU in the binned M4 Max.

[1] https://vengineer.hatenablog.com/entry/2024/10/13/080000

Archit3ch · on May 3, 2025

Not really. That's 15.8 fp16 ops compared to 14.7 fp32 ops (that are actually useful outside AI). It would be interesting to see if you can configure the ANE to recover fp32 precision at lower throughput [1].

[1] https://arxiv.org/abs/2203.03341

brigade · on May 3, 2025

Apple GPUs run fp16 at the same rate as fp32 except on phones, so it is comparable for ML. No one runs inference from fp32 weights.

But the point was about area efficiency

xiphias2 · on May 3, 2025

I guess it's a hard choice as it's 5x more energy efficient than GPU because it uses systolic array.

For laptops, 2x GPU cores would make more sense, for phones/tablets, energy efficiency is everything.

1W6MIC49CYX9GAP · on May 3, 2025

You're completely right, if you already have a GPU in a system adding tensor cores to it gives you better performance per area.

GPU + dedicated AI HW is virtually always the wrong approach compared to GPU+ tensor cores

ks2048 · on May 3, 2025

At least one link/benchmark I saw said the ANE can be 7x faster than GPU (Metal / MPS),

https://discuss.pytorch.org/t/apple-neural-engine-ane-instea...

It seems intuitive that if they design hardware very specifically for these applications (beyond just fast matmuls on a GPU), they could squeeze out more performance.

astrange · on May 3, 2025

Performance doesn't matter. Nothing is ever about performance.

It's about performance/power ratios.

rz2k · on May 3, 2025

I was trying to figure the same thing out a couple months ago, and didn't find much information.

It looked like even ANEMLL provides limited low level access to specifically direct processing toward the Apple Neural Engine, because Core ML still acts as the orchestrator. Instead, flags during conversion of a PyTorch or TensorFlow model can specify ANE-optimized operations, quantization, and parameters hinting at compute targets or optimization strategies. For example `MLModelConfiguration.computeUnits = .cpuAndNeuralEngine` during conversion would disfavor the GPU cores.

Anyway, I didn't actually experiment with this, but at the time I thought maybe there could be a strategy of creating a speculative execution framework, with a small ANE-compatible model to act as the draft model paired with a larger target model running on GPU cores. The idea being that the ANE's low latency and high efficiency could accelerate results.

However, I would be interested to hear the perspective of people who actually know something about the subject.

bigyabai · on May 3, 2025

If you did that, you'd stumble into the Apple GPU's lack of tensor acceleration hardware. For an Nvidia-like experience you'd have to re-architecture the GPU to subsume the NPU's role, and if that was easy then everyone would have done it by now.

sroussey · on May 3, 2025

M1/M2 shared a GPU design, same with M3/M4. So maybe M5 will have a new design that includes tensor cores in the GPU.