Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.



I run it with Llama.cpp on my RTX 3090. Also using the same Unsloth model.

My config is similar to: https://github.com/noonghunna/club-3090/blob/master/docs/eng...

I need to try out some of the other set ups mentioned in this repo for increased TPS.


Both. I usse modified jinja template that optimized toolcall , tested on production , none of them works.

Both 27b and A3B done all my production works pbeautifuly (At Q8) i dont think any model are good for Q4.

Qwen 3.5 122b surpasses both of them tho.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: