dev_tools_lab's comments

dev_tools_lab · 2026-04-14T10:26:15 1776162375

This is exactly why single-model evaluation is dangerous. Benchmarks are gamed, but disagreement between models is harder to fake. Multi-model consensus catches what individual benchmarks miss.

dev_tools_lab · 2026-03-26T11:21:27 1774524087

Thanks for this project. Prioritizing MoE models and adding an intelligent NVMe cache could improve efficiency, especially on the M4 Max where bandwidth makes usage more realistic.

dev_tools_lab · 2026-03-25T10:10:03 1774433403

Agreed. The practical implications are often more interesting than the math anyway — smaller models running locally means you can afford to run multiple models in parallel for cross-validation, which changes how you approach tasks like code analysis or bug detection.

dev_tools_lab · 2026-03-25T10:06:35 1774433195

Nice work on the scheduler. Have you benchmarked parallel inference across multiple models? Running GPT, Claude and Gemini simultaneously on the same input is where latency becomes a real constraint.

zozbot234 · 2026-03-25T11:41:57 1774438917

GPT-OSS exists but Claude and Gemini aren't available locally, lol.

dev_tools_lab · 2026-03-25T13:34:04 1774445644

True, Claude and Gemini aren’t local yet — I mostly meant running all available local models in parallel.

Even with just open-source LLMs, you can see interesting differences in flagged issues when cross-validating outputs.

dev_tools_lab · 2026-03-24T16:14:39 1774368879

Good reminder to pin dependency versions and verify checksums. SHA256 verification should be standard for any tool that makes network calls.

dev_tools_lab · 2026-03-24T16:11:48 1774368708

Nice use of native video embedding. How do you handle cases where Gemini's response confidence is low? Do you have a fallback or threshold?

sohamrj · 2026-03-24T16:37:37 1774370257

as of now, no threshold but that is planned in the future.

for example, for now if i search "cybertruck" in my indexed dashcam footage, i don't have any cybertrucks in my footage, so it'll return a clip of the next best match which is a big truck, but not a cybertruck

dev_tools_lab · 2026-03-25T09:37:42 1774431462

Makes sense for now. Thresholding becomes critical at scale though — good luck with the next iteration!

dev_tools_lab · 2026-03-24T16:09:07 1774368547

One pattern I've noticed: the apps that work best combine multiple models rather than relying on one. Single-model outputs have too much variance for production use cases.