• 0 Posts
  • 1 Comment
Joined 1 year ago
cake
Cake day: June 12th, 2023

help-circle
  • Pumpkin Escobar@lemmy.worldto196@lemmy.blahaj.zoneThe Rule
    link
    fedilink
    English
    arrow-up
    9
    ·
    2 months ago

    There’s quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

    There’s also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc… It’s a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

    Then there are some neat projects to distribute models across multiple computers like exo and petals. They’re more targeted at a p2p-style random collection of computers. I’ve run petals in a small cluster and it works reasonably well.