Run 70Bn Llama 3 Inference on a Single 4GB GPU

Поділитися
Вставка
  • Опубліковано 15 вер 2024
  • Code : github.com/roh...
    🐦 Connect with me in Twitter : / rohanpaul_ai
    Airllm Github - github.com/lyo...
    Checkout the MASSIVELY UPGRADED 2nd Edition of my Book (with 1300+ pages of Dense Python Knowledge) 🐍🔥
    Covering 350+ Python 🐍 Core concepts ( 1300+ pages ) 🚀
    🟠 Book Link - rohanpaul.gumr...
    -----------------
    Hi, I am a Machine Learning Engineer | Kaggle Master. Connect with me on 🐦 TWITTER: / rohanpaul_ai - for daily in-depth coverage of Large Language Model bits
    ----------------
    You can find me here:
    **********************************************
    🐦 TWITTER: / rohanpaul_ai
    👨🏻‍💼 LINKEDIN: / rohan-paul-ai
    👨‍🔧 Kaggle: www.kaggle.com...
    👨‍💻 GITHUB: github.com/roh...
    🧑‍🦰 Facebook : / rohan.paul.562
    📸 Instagram: / rohan_paul_2020
    **********************************************
    Other Playlist you might like 👇
    🟠 MachineLearning & DeepLearning Concepts & interview Question Playlist - bit.ly/380eYDj
    🟠 ComputerVision / DeepLearning Algorithms Implementation Playlist - bit.ly/36jEvpI
    🟠 DataScience | MachineLearning Projects Implementation Playlist - bit.ly/39MEigt
    🟠 Natural Language Processing Playlist : bit.ly/3P6r2CL
    ----------------------
    #LLM #Largelanguagemodels #Llama2 #LLMfinetuning #opensource #NLP #ArtificialIntelligence #datascience #textprocessing #deeplearning #deeplearningai #100daysofmlcode #neuralnetworks #datascience #generativeai #generativemodels #OpenAI #GPT #GPT3 #GPT4 #chatgpt #genai

КОМЕНТАРІ • 54

  • @scottmiller2591
    @scottmiller2591 4 місяці тому +2

    Good writeup - covered when it's applicable, and pros and cons. I would recommend using it on a machine w a lot of RAM, setting up a RAM disk, and using that for your cache - that would knock the latency down somewhat.

  • @javiergimenezmoya86
    @javiergimenezmoya86 4 місяці тому +14

    Is it possible configure that library for use of RAM instead if SSD? It would be useful if you have a computer with many RAM (p.e 64GB of RAM) because all layers would be able in memory in 4 bit quantization.

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +4

      I was thinking the same about offloading to RAM, as it has become so much cheap. However on my quick search could not find that option yet with that lib. Will need to investigate more. If you find please let me know as well.

    • @i6od
      @i6od 4 місяці тому +4

      ... isnt this question ironic? doesnt LLM naturually load into RAM / VRAM, and the whole reason of this project is to switch it to an Actual Storage Drive so You can use the 70B in the Drive instead of having issues with over loading VRAM / RAM

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      @@i6od Indeed, this project brings a completely new way to deal with LLMs beyond RAM/VRAM.

    • @brianlink391
      @brianlink391 4 місяці тому

      Really simple to do just create a ram Drive a simple application you can download and then put your model into the Rand Drive and load it from there and you're all set

    • @poldiderbus3330
      @poldiderbus3330 4 місяці тому

      I would then just try to use a RAM-disk..

  • @honestgoat
    @honestgoat 4 місяці тому +4

    Using this method then, is it possible to run say a 350b model on a gpu with 20/24gb vram?
    Say running Grok-1 which is 314b could run on a 3090/4090 using this method?
    I know it would be slow af, but it could work right?

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +2

      theoretically possible . The layered inference approach will just do the sequential loading and unloading of model layers. Ofcourse, the latency will accumulate and result in super super slow inference.

  • @tshawtshi3040
    @tshawtshi3040 4 місяці тому

    I was thinking about this for a while. Im glad someone did it. O think if done properly you can have similar performance to all weights in vram

  • @perelmanych
    @perelmanych 4 місяці тому

    There are many comments about loading layers from RAM instead of SSD. Basically, it doesn't make sense. You will have a better performance doing all the computations on CPU. Why? Very simple, when you run LLM on CPU the main bottleneck is not a CPU speed, but the bandwidth of RAM and that is why it is much faster to run LLM on GPU because it has much higher bandwidth. With this lib you will have to copy each time a layer form RAM to VRAM and then compute output of a layer on GPU. That doesn't make sense, since your CPU makes computations faster than it gets data from RAM. So no magic here, if you want to run very big model and it fits to the RAM then just run it on CPU.

  • @alig.morgado9394
    @alig.morgado9394 Місяць тому

    hi, if are four gpu of 3060 12GB run easy the 405b?

  • @jnchacon
    @jnchacon 4 місяці тому +4

    Why ssd? Why not RAM?
    If i have enough RAM to save the entire LLM? Can the layer be read from RAM? (RAM to VRAM)

    • @brianmi40
      @brianmi40 4 місяці тому +1

      google "RAM disk" still a thing in Win 11...

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      Yes offloading to RAM will always be better given how cheap it is.
      But this library wanted a new way to deal with LLM bypassing RAM/VRAM as much as possible.

  • @lou.later269
    @lou.later269 4 місяці тому +2

    damn, imagine the same optmization for an 8B model, the speeds would rival Groq

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      Yes, but actual speed may not improve much as you still have to do Disk IO. So you will always be bottlenecked by your SSD read speed.

    • @dinoscheidt
      @dinoscheidt 4 місяці тому +4

      Exactly. The smaller the model the higher the proportional IO overhead compared to compute… at 8B paging memory like this in and out makes it far slower than it is right now. That is because the compute time needed per additional parameter in an XB model grows exponentially. So large models are so slow in compute, that IO overheads like these can become neglectable. But there are interesting developments like vLLMs that use something like virtual memory management to pack a very large model still in small GPU memory. Skipping the need for IO speed (since there is no IO to disk), since everything is still in memory on the graphics card.

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      @@dinoscheidt very well explained. Thanks.

    • @damien2198
      @damien2198 4 місяці тому

      To my understanding, Groq uses a similar trick as their LPU has only 250MB(yes MB) memory

    • @lostpianist
      @lostpianist 4 місяці тому

      @@dinoscheidt can't wait for vLLM Llama 3 400B. For a few years I've been hoping for something like that, then really top level AI can be run locally by anyone with a reasonable computer and ok graphics card... Will be amazing for productivity, gaming, etc.

  • @gaborcsurke6937
    @gaborcsurke6937 4 місяці тому

    The question is if we have more VRAM like 16 or 24GB that can be used and mitigate the SSD bottleneck more? Maybe that way can read not only one layer but multiple and that way can be even faster

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      Yes, I think its possible i.e. you can managing the num of layers to allocate to GPU reducing the frequency of SSD reads.
      Here's the long ans.
      - In the current implementation of their code (check the github repo), the `AirLLMBaseModel` class in `airllm_base.py` loads and processes one layer at a time during the forward pass. However, you can modify the `forward` method to load and cache a certain number of layers based on the available GPU memory.
      For example, you can introduce a configuration parameter to specify the number of layers to cache in GPU memory. Then, in the `forward` method, you can load and store the layers in a cache until the specified number of layers is reached. When processing the next layer, you can check if it is already in the cache before loading it from SSD.
      Here's a simplified example of how you could modify the `forward` method to cache multiple layers:
      ```python
      def forward(self, ...):
      ...
      cached_layers = []
      max_cached_layers = 4 # Specify the maximum number of layers to cache
      for i, (layer_name, layer) in enumerate(zip(self.layer_names, self.layers)):
      if layer_name in cached_layers:
      # Layer is already cached, use it directly
      layer = self.cached_layers[layer_name]
      else:
      # Load the layer from SSD and add it to the cache
      state_dict = self.load_layer_to_cpu(layer_name)
      self.move_layer_to_device(state_dict)
      cached_layers.append(layer_name)
      self.cached_layers[layer_name] = layer
      # Remove the oldest cached layer if the cache size exceeds the maximum
      if len(cached_layers) > max_cached_layers:
      oldest_layer = cached_layers.pop(0)
      del self.cached_layers[oldest_layer]
      # Process the layer
      ...
      ```
      In this example, the `max_cached_layers` variable determines the maximum number of layers to cache in GPU memory. The `cached_layers` list keeps track of the currently cached layers. When processing a layer, it first checks if it is already cached. If not, it loads the layer from SSD, adds it to the cache, and removes the oldest cached layer if the cache size exceeds the maximum.
      - By caching multiple layers in GPU memory, you can reduce the number of SSD reads required during inference.
      Additionally, you may need to handle the case where a single layer itself exceeds the available GPU memory. In such scenarios, you might need to explore other techniques like tensor parallelism or model sharding to distribute the layer across multiple GPUs or devices.

  • @krisKrag
    @krisKrag 4 місяці тому

    is there a paper of apple in 2023 doing this the difference is that Apple targets efficiency in reading chunks specifically on its own hardware. UA-cam censor my previous comment where i paste the link and tittle of the paper :/

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому

      Yes I think you are talking about "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory"
      twitter.com/rohanpaul_ai/status/1737425137451073573

  • @Linuslkm
    @Linuslkm 4 місяці тому

    have you tried it on a Ramdisk? If so, could you make another video comparing perfomance?

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      No haven't tried on that yet, but will try.

  • @meyou7041
    @meyou7041 3 місяці тому

    I'm completely new to this. Is this something I can use with oobagooba?

    • @johnsummerlin7630
      @johnsummerlin7630 3 місяці тому

      Seconding this question, as I'm interested in the answer too.

  • @nexusphreez
    @nexusphreez 4 місяці тому

    So my only question is can this be integrated with ollama?

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому

      Dont think ollama supports this.

  • @RobertMcGovernTarasis
    @RobertMcGovernTarasis 4 місяці тому

    How much disk space would this all need?

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому

      You just need to be able to accomodate the entire model into your SSD.

  • @ROKKor-hs8tg
    @ROKKor-hs8tg 22 дні тому

    Ask can it run on igpu?

    • @glmconseil
      @glmconseil 21 день тому

      If it can do compute, it will work. iGPU are GPU too. All it needs is to have the compute functions (not all GPUs have them)

  • @BrokenOpalVideos
    @BrokenOpalVideos 4 місяці тому

    How many tokens per second would you get though

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +1

      Depends on SSD read speed. It may vary but in Mac hardware was getting 1 tok/2sec.

    • @Gatrehs
      @Gatrehs 4 місяці тому

      @@RohanPaul-AI is this an regular SSD or an NVME?

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому

      @@Gatrehs its NVME

  • @MuhammadAdnan-tq3fx
    @MuhammadAdnan-tq3fx 4 місяці тому +1

    It's possible offline?

    • @RohanPaul-AI
      @RohanPaul-AI  4 місяці тому +2

      Yes, you can use the locally downloaded model's local path like below
      model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

    • @caseyhoward8261
      @caseyhoward8261 4 місяці тому

      ​@@RohanPaul-AIThank you! ❤