I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.
Running locally is the bar; it's hard to make these things a service which scales.
HTTP content negotiation was a good idea which decouples content from form, but only as far as format selection.
Generative models are able to transform content between media types, and it feels like the original intention can be completed -- the server generates the appropriate form at request time, rather than serving a pre-rendered one.
A concrete example: an Image2Depth model estimates the depth of a scene from a standard image, encodes that information in the response, and returns it to clients capable of rendering depth -- 3D displays, VR headsets, and so on. The content is the same; the form is specialised to the client capabilities.
I've been thinking of LLM prompting as execution, which makes context building a compile step -- different build systems, different output quality. Chunking is the simplest compiler; artefact construction that infers schema and resolves contradictions is a much richer one. The article maps the current spectrum.
Running locally is the bar; it's hard to make these things a service which scales.
reply