This is exactly why single-model evaluation is dangerous.
Benchmarks are gamed, but disagreement between models is harder to fake.
Multi-model consensus catches what individual benchmarks miss.
Thanks for this project.
Prioritizing MoE models and adding an intelligent NVMe cache could improve efficiency, especially on the M4 Max where bandwidth makes usage more realistic.
Agreed. The practical implications are often
more interesting than the math anyway — smaller
models running locally means you can afford to
run multiple models in parallel for cross-validation,
which changes how you approach tasks like code
analysis or bug detection.
Nice work on the scheduler. Have you benchmarked
parallel inference across multiple models?
Running GPT, Claude and Gemini simultaneously
on the same input is where latency becomes
a real constraint.
as of now, no threshold but that is planned in the future.
for example, for now if i search "cybertruck" in my indexed dashcam footage, i don't have any cybertrucks in my footage, so it'll return a clip of the next best match which is a big truck, but not a cybertruck
One pattern I've noticed: the apps that work best
combine multiple models rather than relying on one.
Single-model outputs have too much variance for
production use cases.