Hacker Newsnew | past | comments | ask | show | jobs | submit | dev_tools_lab's commentslogin

This is exactly why single-model evaluation is dangerous. Benchmarks are gamed, but disagreement between models is harder to fake. Multi-model consensus catches what individual benchmarks miss.


Thanks for this project. Prioritizing MoE models and adding an intelligent NVMe cache could improve efficiency, especially on the M4 Max where bandwidth makes usage more realistic.


Agreed. The practical implications are often more interesting than the math anyway — smaller models running locally means you can afford to run multiple models in parallel for cross-validation, which changes how you approach tasks like code analysis or bug detection.


Nice work on the scheduler. Have you benchmarked parallel inference across multiple models? Running GPT, Claude and Gemini simultaneously on the same input is where latency becomes a real constraint.


GPT-OSS exists but Claude and Gemini aren't available locally, lol.


True, Claude and Gemini aren’t local yet — I mostly meant running all available local models in parallel.

Even with just open-source LLMs, you can see interesting differences in flagged issues when cross-validating outputs.


Good reminder to pin dependency versions and verify checksums. SHA256 verification should be standard for any tool that makes network calls.


Nice use of native video embedding. How do you handle cases where Gemini's response confidence is low? Do you have a fallback or threshold?


as of now, no threshold but that is planned in the future.

for example, for now if i search "cybertruck" in my indexed dashcam footage, i don't have any cybertrucks in my footage, so it'll return a clip of the next best match which is a big truck, but not a cybertruck


Makes sense for now. Thresholding becomes critical at scale though — good luck with the next iteration!


One pattern I've noticed: the apps that work best combine multiple models rather than relying on one. Single-model outputs have too much variance for production use cases.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: