Self hosting LLMs on a remote VPS

EmbarrassedDrum@lemmy.dbzer0.com · 5 days ago

Self hosting LLMs on a remote VPS

just_another_person@lemmy.world · 5 days ago

Do you have lots of money? Cuz that’s going to cost lots of money. Just get a cheap GPU and run it locally.

EmbarrassedDrum@lemmy.dbzer0.com · 5 days ago

No, but I have free instance on Oracle Cloud and that’s where I’ll run it. If it’s too slow or no good I’ll stop using it but there’s no harm trying.

ddh@lemmy.sdf.org · 1 day ago

I’d be interested to see how it goes. I’ve deployed Ollama plus Open WebUI on a few hosts and small models like Llama3.2 run adequately (at least as fast as I can read) on even an old i5-8500T with no GPU. Oracle Cloud free tier might work OK.

EmbarrassedDrum@lemmy.dbzer0.com · 1 day ago

then I’ll let you know when I deploy it. didn’t do it yet, might do it today, maybe later.

hendrik@palaver.p3x.de · edit-2 5 days ago

That depends on the use-case. An hour of RTX 4090 compute is about $0.69 while the graphics card is like $1,600.00 plus computer plus electricity bill. I’d say you need to use it like 4000h+ to break even. I’m not doing that much gaming and AI stuff, so I’m better off renting some cloud GPU by the hour. Of course you can optimize that, buy an AMD card, use smaller AI models and pay for less VRAM. But there is a break even point for all of them which you need to pass.

just_another_person@lemmy.world · 5 days ago

Yes, but running an LLM isn’t an on-demand workload, it’s always on. You’re paying for a 24/7 GPU instance if going that route over CPU.

hendrik@palaver.p3x.de · edit-2 5 days ago

Well, there’s both. I’m with runpod and they bill me for each second I run that cloud instance. I can have it running 24/7 or 30min on-demand or just 20 seconds if I want to generate just one reply/image. Behind the curtains, it’s Docker containers. And one of the services is an API that you can hook into. Upon request, it’ll start a container, do the compute and at your option either shut down immediately, meaning you’d have payed like 2ct for that single request. Or listen for more requests until an arbitrary timeout is reached. Other services offer similar things. Or a fixed price per ingested or generated token with some other (ready-made) services.

just_another_person@lemmy.world · 5 days ago

Runpod is a container service. OP asked about remote server.

hendrik@palaver.p3x.de · edit-2 5 days ago

What’s the difference regarding this task? You can rent it 24/7 as a crude webserver. Or run a Linux desktop inside. Pretty much everything you could do with other kinds of servers. I don’t think the exact technology matters. It could be a VPS, virtualized with KVM, or a container. And for AI workloads, these containers have several advantages. Like you can spin them up within seconds. Scale them etc. I mean you’re right. This isn’t a bare-metal server that you’re renting. But I think it aligns well with OP’s requirements?!

just_another_person@lemmy.world · 5 days ago

Well I think the difference is what they asked about.

ddh@lemmy.sdf.org · 1 day ago

Running an LLM can certainly be an on-demand service. Apart from training, which I don’t think we are discussing, GPU compute is only used while responding to prompts.