LocalAI LLM Testing: Viewer Questions using mixed GPUs, and what is Tensor Splitting AI lab session

Поділитися
Вставка
  • Опубліковано 31 лип 2024
  • Attempting to answer good viewer questions with a bit of testing in the lab.
    We will be taking a look at using different GPUs in a mixed scenario, along with going down the route of tensor splitting to best effect your mixed GPU machines
    We will be using LocalAI, and an Nvidia 4060 Ti with 16GB VRAM along with a Tesla M40 24GB.
    Grab your favorite after work or weekend enjoyment tool and watch some GPU testing
    Recorded and best viewed in 4K
  • Наука та технологія

КОМЕНТАРІ • 26

  • @six1free
    @six1free 23 дні тому +1

    hands down one of the best youtube chanels out there - and i'm not just saying that for flashing my question :D I really do love how thoroughly you've taken to answering it.
    .. this being the pause point... I'm going to guess that cuda will do it all for you ("as if" - I'm sure :D)
    I am so envious of your test rig... as it is though I need a data center for power... as for adding the other cards, further research tensors and rewatch this video when applicable :D - downloaded and saved to my good tutorials (very long) playlist... enjoy the well deserved follow-through.

    • @RoboTFAI
      @RoboTFAI  21 день тому +1

      Thanks for the idea!

  • @jackflash6377
    @jackflash6377 20 днів тому +1

    Outstanding !
    Glad I found this channel.
    Thank you sir.

    • @RoboTFAI
      @RoboTFAI  18 днів тому

      Thanks for watching!

  • @246rs246
    @246rs246 22 дні тому +2

    I'm blown away by this comprehensive answer to my question. Thumbs up and I'm looking forward to more interesting videos.

    • @RoboTFAI
      @RoboTFAI  21 день тому +1

      Awesome, thank you!

  • @kevinclark1466
    @kevinclark1466 7 днів тому +1

    Great video! Looking forward to trying this…

  • @SphereNZ
    @SphereNZ 18 днів тому

    Great video, great info, really appreciate it, thanks.

  • @AkhilBehl
    @AkhilBehl 23 дні тому +3

    This is absolutely awesome stuff.

  • @CoderJon
    @CoderJon 8 днів тому

    Love your videos. I appreciate that you leave the interpretation of the results to us, but I would love a video talking about your interpretations of the data. For example: Why your results for Prompt tokens per second were higher with the 90/10 split. I can assume its because there is some sort of parallel processing happening on the interpretation of the prompt, but I am still new to the AI world so would love the education.

    • @RoboTFAI
      @RoboTFAI  8 днів тому

      Much appreciated! I attempt to keep my mouth shut and let the data show the info. Definitely not an expert and just learning like everyone else. I never intended on creating an actual channel, the first video was to prove a conversation with friends out with hard data, the testing app is for other uses in my lab, etc. Just turning into a place where we can all share some data and learn from it, or at least burn some of my power bill together!

  • @andre-le-bone-aparte
    @andre-le-bone-aparte 21 день тому +1

    Question: @03:14 - NVTOP is showing - 90+ Degrees (86 on the M40) Fahrenheit on each of those cards... WITHOUT any active usage?
    - That seems excessive. Currently running a 4x3090 setup at 79-degrees or lower, in-between queries.

    • @RoboTFAI
      @RoboTFAI  20 днів тому +1

      the 4060's are stacked with each other on the bench node in this test (I don't recommend that, they could use space between them since side facing fans, and why I use a lot of pcie extenders normally) and don't run their fans unless there is a load - the M40 in this test has an active fan on all the time. Also I live in a hot climate and it's been 85-100 degrees (75+ in the workshop as it's not conditioned)🔥

    • @andre-le-bone-aparte
      @andre-le-bone-aparte 20 днів тому +2

      @@RoboTFAI 👍- Just looking to learn ways to extend the life of these GPUs and increase performance for LLM usage when running 10 hours a day (work day, remote-work, as a code assistant)

  • @mbike314
    @mbike314 12 днів тому

    Thank you for creating this valuable content. I am pleased to have discovered it. I am interested in some 4060's you mentioned. I sent an email.
    Please keep going with this channel!
    Wonderful stuff!

    • @RoboTFAI
      @RoboTFAI  8 днів тому

      Thanks a ton! Didn't see any email - reach out robot@robotf.ai or ping me on reddit/etc

    • @mbike314
      @mbike314 2 дні тому

      Thank you. I did send it to the wrong address. Just resent it to the correct address.

  • @tbranch227
    @tbranch227 11 днів тому

    Can you run a larger model when you span cards? Or does your model need to be able to fit on each card that you tensor split across? What happens to performance then, if you can run larger models by aggregating card ram?

    • @RoboTFAI
      @RoboTFAI  9 днів тому

      You can absolutely span the larger model between cards! These tests are actually doing that, performance depends on cards you are splitting between - but will be between your lowest end card, and highest end cards (if different models). Running multiple cards doesn't necessarily increase performance, it's really for expanding your VRAM capacity.

  • @tsclly2377
    @tsclly2377 22 дні тому

    I think loading is still an important factor, so do you use NVMe drives, like the large, high write level Octane p900 series for the fast load? and FPGAs for pre-setting data (like video, pictures) reconstructed in a faster use mode?

    • @RoboTFAI
      @RoboTFAI  21 день тому

      I normally leave the unloaded model test off as it doesn't allow as much resolution in the smaller charts. I use Gen 4 NVMe M.2 drives in each of these systems (rated up to 5000/4800 MB/s...yea right).

    • @Zeroduckies
      @Zeroduckies 15 днів тому

      Or you can get 1tb ram and have 500gb ramdisk ^^

    • @tsclly2377
      @tsclly2377 13 днів тому

      @@Zeroduckies Using HP ML 350p machines one only gets up to 768GB of dram that has to be LRdram, but that ram is running on three channels that actually slows it down from the 2 channel 256GB because of the required 'blocking in' and processing. It is all in the specification PDF from HP.. It is only when going to The G11 model that one actually get significantly faster (PCIe 5.0.. HP skipped the 4.0 architecture in these machines) ram and a larger capacity at a astronomically increase in price.. So when getting a 'loaded' 256 dram ML 350p G8 afor a trade of on older gamer machine with a at GTX 1660ti and a less than tenth geni7 (about a 300$ value) one must be looking for a fast economical memory solution and that is where the Optane P900 card come in (with their 4000GB/s bust) and one must also compare that at the rate that the GPU actually can take in, so this is a cheap way to run data in (and out) in a comparable manner as dram... plus you are only occupying a PCIe 4 lane. Now this is al gfine and dandy, but in dual cpu chip-sets, the PCIe lanes go all over the place and that is a major consideration as the right and left side are controlled by different CPUs and SLI or VRLinking can be required for OS recondition of the linked GPU cards that is inherently required for proper function logging.... and PCIe controllers on these machines. They are going to be slower than single CPU specifically designed mother boards that are made other companies such as the multi-PCIe 16x SuperMicro or Gigabyte professions models... that have come out specifically designed for this type of application that use NVME arrays for storage.. and then you are back to the amount of writes that are going to be applied to the storage.