• Arkthos@pawb.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    2 days ago

    You can offload them into ram. The response time gets way slower once this happens, but you can do it. I’ve run a 70b llama model on my 3060 12gb at 2 bit quantisation (I do have plenty of ram so no offloading from ram to disk at least lmao). It took like 6-7 minutes to generate replies but it did work.