Please, don't!

Chemical Wonka@discuss.tchncs.de · 1 month ago

Please, don't!

PlzGivHugs@sh.itjust.works · 1 month ago

Really? When I was trying to get it to run a little while ago, I kept running out of memory with my 3060 12GB running 20B models, but prehaps I had it configured wrong.

Arkthos@pawb.social · 1 month ago

You can offload them into ram. The response time gets way slower once this happens, but you can do it. I’ve run a 70b llama model on my 3060 12gb at 2 bit quantisation (I do have plenty of ram so no offloading from ram to disk at least lmao). It took like 6-7 minutes to generate replies but it did work.