REALITY vs Apple’s Memory Claims | vs RTX4090m

Поділитися
Вставка
  • Опубліковано 29 січ 2024
  • I put Apple Silicon memory bandwidth claims to the test against the nVidia RTX4090 powerhouse.
    Run Windows on a Mac: prf.hn/click/camref:1100libNI (affiliate)
    Use COUPON: ZISKIND10
    🛒 Gear Links 🛒
    * 🍏💥 New MacBook Air M1 Deal: amzn.to/3S59ID8
    * 💻🔄 Renewed MacBook Air M1 Deal: amzn.to/45K1Gmk
    * 🎧⚡ Great 40Gbps T4 enclosure: amzn.to/3JNwBGW
    * 🛠️🚀 My nvme ssd: amzn.to/3YLEySo
    * 📦🎮 My gear: www.amazon.com/shop/alexziskind
    🎥 Related Videos 🎥
    * 💰 MacBook Machine Learning | M3 Max - • Cheap vs Expensive Mac...
    * 🤖 INSANE Machine Learning on Neural Engine - • INSANE Machine Learnin...
    * 👨‍💻 M1 DESTROYS a RTX card for ML - • When M1 DESTROYS a RTX...
    * 🌗 RAM torture test on Mac - • TRUTH about RAM vs SSD...
    * 👨‍💻 M1 Max VS RTX3070 - • M1 Max VS RTX3070 (Ten...
    * 🛠️ Developer productivity Playlist - • Developer Productivity
    - - - - - - - - -
    ❤️ SUBSCRIBE TO MY UA-cam CHANNEL 📺
    Click here to subscribe: www.youtube.com/@azisk?sub_co...
    - - - - - - - - -
    📱LET'S CONNECT ON SOCIAL MEDIA
    ALEX ON TWITTER: / digitalix
    #m3max #m2max #machinelearning
  • Наука та технологія

КОМЕНТАРІ • 554

  • @AZisk
    @AZisk  2 місяці тому

    JOIN: youtube.com/@azisk/join

  • @Navierstokes256
    @Navierstokes256 4 місяці тому +317

    RTX 4090 in the laptop is an TDP limited RTX 4080.

    • @gliderman9302
      @gliderman9302 4 місяці тому +5

      That shouldn’t impact memory right?

    • @headmetwall
      @headmetwall 4 місяці тому

      @@gliderman9302 somewhat, but not due to the TDP, the laptop version uses GDDR6 memory chips instead of GDDR6x (Bandwidth limit of 576.0 GB/s VS 716.8 GB/s)

    • @matbeedotcom
      @matbeedotcom 4 місяці тому +67

      @@gliderman9302 whats limiting the memory is the terrible ass laptop

    • @Architek1
      @Architek1 4 місяці тому +10

      I was so confused as to why it only had 16GB of VRAM

    • @kahaneck
      @kahaneck 4 місяці тому +36

      IT IS a 4080, its the same AD103 chip. The desktop 4090 uses the AD102.

  • @mdxggxek1909
    @mdxggxek1909 4 місяці тому +545

    My bro the dedication of just "casually" buying a brand new laptop with a 4090 for the tests, my wallet could never

    • @synen
      @synen 4 місяці тому +50

      Most places in the US have a comfortable return window where you get 100% of your money back.

    • @petersuvara
      @petersuvara 4 місяці тому +13

      He can just return it after a few days for this. Apple has 2 week return window.

    • @habsanero2614
      @habsanero2614 4 місяці тому +8

      Also resell value on these machines is very high in short windows

    • @matbeedotcom
      @matbeedotcom 4 місяці тому +13

      because its not an actual 4090

    • @Jeannemarre
      @Jeannemarre 4 місяці тому +10

      @@petersuvara it’s cool you guys can do it, I’m Europe once you open the box you cannot return it unless it’s faulty

  • @RichWithTech
    @RichWithTech 4 місяці тому +476

    4:08 when you've been rocking Mac for so long you forget you need to plug in gaming laptops to get full power

    • @NguyenTran-eq2wg
      @NguyenTran-eq2wg 4 місяці тому +14

      Oh righttttttt!

    • @CHURCHISAWESUM
      @CHURCHISAWESUM 4 місяці тому +12

      Any windows laptop is like this

    • @eulehund99
      @eulehund99 4 місяці тому +56

      ​@@CHURCHISAWESUM*gaming laptops with a discrete GPU. Any AMD mobile chip from 6th gen and up and any Intel Core Ultra Chip have great battery life.

    • @rafewheadon1963
      @rafewheadon1963 4 місяці тому +61

      too bad you cant play any fucking games on a mac.

    • @NguyenTran-eq2wg
      @NguyenTran-eq2wg 4 місяці тому +21

      @@rafewheadon1963 You actually can lmao. Stop throwing blaket and inaccurate comments around.

  • @Momi_V
    @Momi_V 4 місяці тому +102

    In "non unified memory land" aka PC world there is a huge difference between the CPUs memory bandwith, the GPUs memory bandwith and the link in between.
    50-70 GiB/s seems reasonable for dual channel DDR5 at limited clock speeds (4800-5600 MT/s), so the CPU numbers are correct, but ~16GiB/s is atrocious in terms of GPU memory bandwith. This is not the actual GPU memory bandwith but rather the PCIe transfer bandwith between the CPU and GPU. It's probably only running a PCIe 4.0 x8 link with 8 * 16 GT/s - overhead. That test is using the CPUs memory to perform GPU operations and not even utilizing the GPUs dedicated RAM. That's madness and in no way representative of the GPUs capabilitys. The 4090 mobile has a theoretical memory bandwith of 576 GiB/s and should be able to reach around 400-500 GiB/s in those "memory access" microbenchmarks (if they were actually testing GPU memory). I am running a 3080 Ti mobile (512 GiB/s theoretical) and get around 400-450 GiB/s depending on the test. CPU to GPU bandwith is still important, but basically all real world workloads (including AI training and inference) either copy the working set to the GPUs memory upfront or stream relevant sections in and out. For the first method the interface bandwith is neglegible as it could only affect startup time (loading the model) and that's basically always bottlenecked by the storage performance (it does not matter if the CPU GPU link is 16 GiB/s or 200 GiB/s if your drive only reads at 4 GiB/s). For the second method it's a bit more relevant, but even in that case the bandwith required to move sections of, for example training data is orders of magnitude smaller than the bandwith required to actively perform calculations on that data. This is due to the nature of GPU workloads where a lot of parallel operations are performed repeatedly on a bounded dataset. For each of those operations that data has to move in and the result out of the GPU core, but it does not have to move back and forth between CPU and GPU every time. The communication is limited to instructions about what to do with the data and occasionally new pieces of data that are transferred once, insted of over and over again. The results of those calculations might also be streamed back, but thats usually equal to or smaller than the Input and does not compete for bandwith, as PCIe is full duplex. If souch a high processor to processor bandwith is actually required, Nvidias NVLink exists and can do up to 1.2 TiB/s per link. The main benefit of Apples unified memory is a more flexible and efficient allocation of RAM as data does not have to be duplicated between CPU and GPU and the amount of RAM available to each is not fixed but dynamic. You simply can not get a PC laptop with more that 24 GiB of VRAM right now.
    The 1 TiB/s number is due to the AD103's 64 MiB L2 cache. If the dataset of the test is small enough it just sits in the GPUs cache.

    • @gavinbad2371
      @gavinbad2371 4 місяці тому +2

      Thank you so much for the explanation!

    • @aquss33
      @aquss33 4 місяці тому +1

      dayum, that's one hell of an explanation, it's really interesting to hear that the max bandwidth the 4090 achieved was due to its large cache size in comparison with the size of a specific dataset being tested. But, yeah, your explanation made a lot of sense and I understood most of it, but I still don't understand how you get such info from watching this video with limited details, truly fascinating, I was trying to figure something out on my own, the best thing I got to was that the windows laptop wasn't plugged in, saw someone comment about that already, does that make any difference in bandwidth speed compared to it being plugged into the wall?

    • @Momi_V
      @Momi_V 4 місяці тому +5

      @@aquss33 honestly, I almost completely forgot that the laptop was actually not plugged in. It might even have used the iGPU for some tests, those

    • @sprockkets
      @sprockkets 4 місяці тому

      IDK, didn't like DirectX12 eliminate the whole need to copy memory around in the first place?

    • @nightthemoon8481
      @nightthemoon8481 4 місяці тому

      you can't get current gen pc laptops with more than 24gb of vram, but there are ones with last gen quadros with 48gb

  • @eivis13
    @eivis13 4 місяці тому +165

    Title is a bit missleading, since the RTX4090 and RTX4090M (RTX4080 nerfed?) are 2 different GPUs with different memory bandwidths and internal cache layouts.

    • @AZisk
      @AZisk  4 місяці тому +26

      This was comparisons of mobile machines. I haven’t done a desktop rtx4090 test yet.

    • @matbeedotcom
      @matbeedotcom 4 місяці тому +16

      @@AZisk have you seen the size of a 4090? It’s not gonna fit in a laptop

    • @AZisk
      @AZisk  4 місяці тому +54

      @@matbeedotcom I'll stuff it in.

    • @eivis13
      @eivis13 4 місяці тому +10

      @@AZisk After that please fit a Bugatti(VW) W16 into/onto a Vespa.
      Just food for future videos ;)

    • @eivis13
      @eivis13 4 місяці тому

      @@matbeedotcom sure it will, but it will have to sit on a dry ice block.

  • @mdxggxek1909
    @mdxggxek1909 4 місяці тому +120

    The read & write gfx ram is purely on the card executing opencl, while the peak write and peak read GFX ram measure the pci express bus speed. Transfering data over pci-e is a lot slower than just reading and writing to the memory on the gpu itself

    • @himynameisryan
      @himynameisryan 4 місяці тому +5

      Thank you for putting my exacts thoughts into words that are understandable lmao

    • @oloidhexasphericon5349
      @oloidhexasphericon5349 4 місяці тому +2

      so theoretically if we could have a direct gpu-cpu connection in a pc it would be 1 tBps as opposed to 800 gBps for m2 ultra ?

    • @himynameisryan
      @himynameisryan 4 місяці тому +9

      @@oloidhexasphericon5349 thats correct
      If nvidia VRAM was used like unified memory was in a Mac, it would be 1tbps
      But that extra latency is an issue apparently
      Which is mildly disappointing as a gaming pc owner

    • @Debilinside
      @Debilinside 4 місяці тому +3

      @@himynameisryan I think this is more of a problem for laptops. Desktops have usually much better bandwith, more PCI lanes etc...

    • @himynameisryan
      @himynameisryan 4 місяці тому +3

      @@Debilinside no the delay still occurs in my gaming PC. It's a real issue for some workloads but not mine

  • @Egor9090
    @Egor9090 4 місяці тому +30

    AppleGPUInfo isn't measuring bandwidth, it's just doing 2 * clock * (bus bits / 8)

  • @jihadrouani5525
    @jihadrouani5525 4 місяці тому +83

    Yeah I think that was pretty clear, Nvidia's GPU's tend to hit 1TB/s of VRAM bandwidth very easily, so if whatever you're trying to run is loaded on the VRAM then Nvidia would squash Apple any day of the week, bandwidth to system RAM however is much slower since it's running through PCI-ex. Games and things like that tend to load data to the VRAM so the GPU wouldn't sit idle waiting for meshes and textures to load from system RAM.

    • @Getfuqqedfedboy
      @Getfuqqedfedboy 4 місяці тому +26

      The desktop 4090 and the Radeon Vii both at around 1TB/second peak bandwidth.
      I stopped this video halfway because he spent $3grand plus buying an RTX 4090 laptop to use the integrated graphics and battery power for his testing. Should been a. Given to anyone technical, assign the GPU manually on both even if it’s working on one, always be testing on wall power to again, remove variables. But specially testing a PC because by default windows wants to dial down and save battery on battery. Many times with such a high performance GPU, only real way to do that is to turn off the dedicated card, else going have like half hour of battery.

    • @jihadrouani5525
      @jihadrouani5525 4 місяці тому +4

      @@Getfuqqedfedboy The bandwidth test was done on the dedicated GPU on the PC laptop, not the iGPU, and battery is irrelevant here because the bandwidth doesn't go down while on battery, it's a simple wide interface that transfers data, it has no power requirement of its own to throttle on battery. Basically, the bandwidth is 1TB/s no matter what you do.

    • @Getfuqqedfedboy
      @Getfuqqedfedboy 4 місяці тому

      @@jihadrouani5525 everything is dynamically clocked these days for various optimizations. And even before where we are today we had power states (still used today but differently) that all different clocks and power limits making up a curve. But these days GPUs can clock themselves dynamically and decouple their clocks with their memory based Temperature, power, power availability, thermal overhead, thermal saturation (more of a Radeon thing with STAPM/skin temp aware), lack of utilization (specifically how often memory is being accessed), etc.
      Plus yes windows in some deeper settings than just the power options in control panel will when you unplug it from a wall set the a lower power state on GPU and if you got a hybrid system it will almost always suspend the dGPU infavor of the iGPU. Either way it will noticeably reduce performance. Also at the point I stopped it was 48GB/sec. That is perfectly inline with what I expect out of dual channel system memory bandwidth….

    • @Syping
      @Syping 4 місяці тому

      @@jihadrouani5525 The 1 TB of bandwidth is not all the time, the L2 Cache is impacting the speed, as soon the data sizes get too big the speed will decrease to 500 GB/s, which is still good but not 1 TB anymore

    • @jihadrouani5525
      @jihadrouani5525 4 місяці тому +5

      @@Syping The bandwidth actually stays the same as long as needed, it is not limited by L2 cache, L2 cache is less than 40MB, if that was to be the limitation then the bandwidth would be crippled in milliseconds. 1TB/s can be sustained as long as the GPU itself can crunch that data in real-time. And in gaming and many other use cases 1TB/s can be sustained throughout the entire play session.

  • @RomPereira
    @RomPereira 4 місяці тому +25

    I mean, if you can't do it as a developer, you can always hang this MSI laptop in a silver chain and go be part of the 'hood.

  • @celderian
    @celderian 4 місяці тому +21

    I was definitely not expecting my local Microcenter to be featured in one of your video XD

    • @AZisk
      @AZisk  4 місяці тому +8

      hey neighbor

  • @sveinjohansen6271
    @sveinjohansen6271 4 місяці тому +23

    Next Alex goes undercover into NVIDIA HQ to buy a laptop with A100 under the hood ! Excellent content Alex. This is what separates your channel from all other review channels, the developer focused reviews of the machines, and not only Mac’s. I have 3090, had AMD R7 in the past with 1024 bit HBM2 memory. Man the R7 card were really great but couldn’t do cuda. A100 next Alex ? :):)

    • @AZisk
      @AZisk  4 місяці тому +13

      I think for the A100, I was considering renting one in the cloud to do some tests. Definitely not worth it for me to buy one yet.

  • @cyclone760
    @cyclone760 4 місяці тому +13

    I didn't know what unified memory did before this video. I thought they meant it was on mounted on silicon. Great to know about the GPU can access it too.

    • @lamelama22
      @lamelama22 4 місяці тому +5

      It's not just the GPU having access the to the RAM instead of having to go through the slower PCIe interface; the memory bandwidth is so much higher than it is in traditional PCs.
      I have seen *some* limited math / AI / etc workloads, just on the CPU, that had something like a 100-1000x speed increase b/c of the unified memory, and was the biggest speed increase they ever had in like 10-20 years of algorithm development. So it's not just GPU. Yes, for normal workloads it doesn't necessarily give you a speed increase, but for algorithms that are memory bound, not CPU bound, and aren't streaming data off your storage device, you can have a radically huge speedup. Even without the GPU / AI engines.
      The downside, of course, is that you can never upgrade it, without upgrading the CPU, and Apple has also made that impossible; they are intentionally making their products as hard to repair as possible so you have to buy a whole new machine every time. Though nothing's stopping, say AMD or Intel from making an all-in-one SoC that is socketed & upgradable....

    • @FernandoAES
      @FernandoAES 4 місяці тому +1

      I think every block in the package can access it. Remember reading in Anandtech that the bandwidth was not the same, but encoder/decoders, CPU, GPU, NPU all had direct access to the unified memory.

    • @s.i.m.c.a
      @s.i.m.c.a 4 місяці тому

      @@lamelama22 and now imagine - you have only 8 gb for everything and apple considering it enough with statement that it is like 16 gb lol )))

  • @TheDanEdwards
    @TheDanEdwards 4 місяці тому +66

    Apple gets their numbers simply from the LPDDR5 spec. Each of those LPDDR5 chips does 51GT/s, using 96 pins.

    • @chebrubin
      @chebrubin 4 місяці тому +16

      Agreed Unified Memory Architecture is so dumb down step back from the modular work intel did with South Bridge and North Bridge and PCI. Sure they can claim high speeds but there is no modular connectivity between the CPU and L2, L2, L3 cache. Wait until GPU is running native PCIE 5 with HMB memory aka decade ago AMD Vegas. That's why Elon Musk wants to acquire every new AMD Instinct™ MI300X Platform card with HBM3 memory interface. Apple Silicon is a joke meant to put a iPhone MAX Pro in a laptop enclosure.
      Want to build out a multiuser AI workstation cluster you need AMD.

    • @Getfuqqedfedboy
      @Getfuqqedfedboy 4 місяці тому +3

      @@chebrubin no with the IMC integrated into the SoC these days it’s the most efficient way of handling memory.

    • @chebrubin
      @chebrubin 4 місяці тому +7

      @iDoWayTooMuchAcid efficient is the word. These days?
      Why don't you go fanboy your TSLA and APPL trade somewhere else. Integrated memory does not scale to more than 1 CPU. All these "cores" are meaningless when the bus is saturated with network IO. Apples claims will not scale either.
      There is a reason why Intel worked on the bus and CPU and GPU managmet architecture for 25 years before you woke up. Take your single CPU / GPU code tests on 1 machine and take off your tinted Vision Pro headset.
      A MacPro from 10 years ago will scale better for network IO under multiple connections.

    • @Getfuqqedfedboy
      @Getfuqqedfedboy 4 місяці тому +7

      ⁠​⁠@@chebrubin LOL, man put down your handbook from two decades ago. Yes it is far more efficient, if you understand it at a low level like at silicon level, you would agree. there’s a lot less running in circles being done, fewer cycles are needed, less misses, less latency, etc, etc, etc. now what from that did you also see me say that it’s the fastest? I didn’t. Because outright it isn’t the fastest way, but a 4090 running at max bandwidth while it is capable of running rings around this M chip, will also suck down a 💩 ton more power doing so. Also fun fact, Mac’s and PC both been using unified memory wherever it can for a while now…. 😂
      I said nothing about scaling across multiple CPUs, GPUs, etc. and why would I when the platforms being discussed use SoCs which are highly integrated. however if you decoupled the entire memory sub system and put it on its own internal bus like how AMD is starting to with infinity fabric as the bus in their mobile APUs (and seeing impressive gains too), I can see it scaling further. I Can’t speak for what Apple has done maybe their already have a similar memory subsystem setup.. maybe not, I don’t know.

    • @chebrubin
      @chebrubin 4 місяці тому +2

      @iDoWayTooMuchAcid precisely check with Apple Services what tech they are procuring for running there Apple AI cloud it is probably AMD instinct and AMD and super micro racks and cages. Alex is benching client laptop power compute. 1 man 1 machine 1 c compilation runtime.
      Lets discuss the new Apple Mac Pro with NO GPU bus lanes, this SoC was hobled together last minute to ditch IA. No thunderbolt external GPU. It is a iPhone Pro Max with all the ram and ssds sodered. Steve Jobs is alive and kicking. The Woz needs to help find a bus for Apples SoC.

  • @heyitsmejm4792
    @heyitsmejm4792 4 місяці тому +8

    the fastest memory bandwidth on a consumer GPU that i remember was AMD's Radeon 7 with HBM2 memory that are soldered directly to the GPU die itself. Memory bus is at : 4096 bit with a Bandwidth of 1,024 GB/s

    • @xeridea
      @xeridea 4 місяці тому

      Sad we don't have HBM on consumer cards anymore. Anyway, the 4090 spec is 1008 GB/s bandwidth.

    • @heyitsmejm4792
      @heyitsmejm4792 4 місяці тому

      @@xeridea seems like the memory chip technology is the bottleneck here, since i doubt a 4096 bit bus can only do 1,024 GB/s.

  • @RomPereira
    @RomPereira 4 місяці тому +2

    Nice and interesting video, as always! Thank you Alex

  • @calingligore
    @calingligore 4 місяці тому +1

    Wasn't there any way of running the tests native on windows and not through wsl?

  • @AlmorTech
    @AlmorTech 4 місяці тому +2

    Wow, great video! Thumbnail and editing are awesome 🤩😄

    • @AZisk
      @AZisk  4 місяці тому +1

      Thank you so much 😁

  • @adriannasyraf3534
    @adriannasyraf3534 4 місяці тому +2

    did you test the msi laptop while plugged in?

  • @hamzaababou6523
    @hamzaababou6523 4 місяці тому +10

    I am not sure about the setup you use with the 4090 laptop setup, but was the testing done with wsl on windows 11, or was it in a vm, if so don't you think it's worth it to peoperly test it on native linux and doing a comparison between these cases?

    • @eliaserke5267
      @eliaserke5267 4 місяці тому +1

      Good point!

    • @sveinjohansen6271
      @sveinjohansen6271 4 місяці тому +3

      Also latest versions of Linux kernel does some memory magic. Worth looking into that vs windows.

    • @jksoftware1
      @jksoftware1 4 місяці тому +8

      Also a Laptop 4090 is not a real 4090... That would ONLY be comparable to a low wattage 4080.

    • @game_time1633
      @game_time1633 4 місяці тому +4

      @@jksoftware1 the comparison is between laptops. Not a desktop and a laptop

    • @jksoftware1
      @jksoftware1 4 місяці тому +7

      @@game_time1633 Then the title should be changed to "REALITY vs Apple’s Memory Claims | vs a laptop RTX4090" because it's deceptive without it because laptop 4090 uses the AD103 chip while the desktop 4090 uses AD102 chip. They are completely different.

  • @arozendojr
    @arozendojr 4 місяці тому

    Is the performance of the M3 using xcode very similar to the M2? I realize that there are no comparisons of mobile use, ios simulator using 8gb 16gb or 18gb RAM

  • @EHKvlogs
    @EHKvlogs 4 місяці тому

    what are your expectations from snapdragon elite x?

  • @MrArod001
    @MrArod001 4 місяці тому

    Is there a test somewhere where you compare M1 Max mbp vs m2 pro mbp? I’m looking at the used market and I’ve seen feels on these 2 chips and wondering which is better. Things online I’ve seen seem to put them close but you’re testing doesn’t have m1max in vids I’ve seen

  • @ShaunBrown8378
    @ShaunBrown8378 4 місяці тому +1

    You might also determine if you are using the onboard GPU vs the 4090 GPU. You can force certain applications to use oen or the other.

  • @enzocaputodevos
    @enzocaputodevos 4 місяці тому +3

    I must express my admiration for the extensive and informative data and analyses that you have presented to us. However, I am intrigued to learn if there exists the potential for the implementation of MLX technology to further maximize the performance of the already exceptional M series?

    • @AZisk
      @AZisk  4 місяці тому

      video on the way

  • @hi2chan
    @hi2chan 4 місяці тому

    3:49 Do you have a gaming laptop for the video?

  • @Dominik-K
    @Dominik-K 4 місяці тому

    Honestly your one of the best channels with these comparisons! Love your AI and machine learning content and I've personally had good experiences with my M2 Max running those benchmarks too

  • @envt
    @envt 4 місяці тому +7

    The windows machine needs to be run while plugged in?

    • @lesleyhaan116
      @lesleyhaan116 4 місяці тому

      yes it does just like every x86 windows laptop

    • @TheRealMafoo
      @TheRealMafoo 4 місяці тому +8

      @@lesleyhaan116 Just like 99.5% of the macs in the real world, that would be doing this workload. I mean it's a cool party trick and all, but who runs these kinds of things at a coffee shop?

    • @BenjaminSchollnick
      @BenjaminSchollnick 4 місяці тому

      @@TheRealMafoo Actually Apple Silicon performs the same with and without being plugged in. That's one of the major benefits, that you can still have the same performance without being plugged in, and still getting the same battery life.

  • @sigma_z
    @sigma_z 3 місяці тому

    I am not sure I understand the comparison. The RTX 4090 in a laptop is a mobile version with a low TDP. For a real test, one must use the desktop version?

  • @abduislam23
    @abduislam23 4 місяці тому +1

    I am wondering if there would be meaningful differences between RTX4090 (the original not the laptop version) vs M2 ultra

    • @whohan779
      @whohan779 4 місяці тому

      Nvidia laptops variants are and were always kinda mislabeled (apart from Pascal GTX 1000, where the 1070 even had more CUDA cores) having less CUDA cores, clock and often VRAM.
      This "RTX 4090" laptop has less performance than a desktop "RTX 4070 Ti Super" in most cases (or even always when on battery).

    • @giornikitop5373
      @giornikitop5373 4 місяці тому

      @@whohan779 and severely limited in tdp.

    • @Slav4o911
      @Slav4o911 3 місяці тому

      Yes 4090 is much faster. I think 2.5x to 4x faster. (depending on the model quantization) . But that's as long as you don't go above the VRAM. People usually use 2x or 3x 4090 for inference and that's how you run the biggest models really fast. Also you can use 4090 for training.

  • @codeline9387
    @codeline9387 4 місяці тому

    did you compare info with benchmark? in applegpuinfo is just info calculation 2 * clock * Double(bits / 8)

  • @jchi6822
    @jchi6822 4 місяці тому +1

    For dGPU you showed us ram to GPU memory operations speed, which could be 15gb/s, even though it seems a bit slow. Maybe your laptop was on battery? For vram to vram dGPU tests, check clinfo cli tool and GPU-Z utility.

  • @istvanszabo5745
    @istvanszabo5745 4 місяці тому

    I'd be curious to know if that kind of system ram bandwidth has any benefits. My guess is maybe, it depends on the use case, but not really.

  • @RocketLR
    @RocketLR 4 місяці тому

    I cant decide which m3 mbp i should get for development and running LLMs on.. Im aiming for m3 pro 36gb 512gb storage.. Or should i go for the MAX chip with 1TB and 36gb?

    • @whohan779
      @whohan779 4 місяці тому

      I've never owned a Mac, but personally the integrated SSDs wouldn't be worth it for me #AppleTax. Better go with the most unified RAM you can afford (within your needs) and strap a USB-enclosed M.2 to the back. Total USB/TB-bandwidth should be around 80 GBit/s or 10 GBytes/s - easily enough.
      I've never understood people buying the Apple Silicon version with like 16 GB RAM or below when you can't upgrade them. Even on DDR4 I can easily buy

  • @User-ry1wj
    @User-ry1wj 4 місяці тому +2

    Interesting insight 👍

  • @hardi_stones
    @hardi_stones 4 місяці тому +1

    Thumbs up for the amount of research done.

  • @chadramey1140
    @chadramey1140 4 місяці тому

    Are these tests done with the laptop plugged in or on battery? Because I know the Nvidia driver limits power consumption went on battery.

    • @AZisk
      @AZisk  4 місяці тому

      when the gpu is used, you gotta have the machine plugged in

  • @mariusmkv1
    @mariusmkv1 4 місяці тому +1

    is m1 max still good in 2024 for the 5 years?

  • @TheThaiLife
    @TheThaiLife 4 місяці тому +4

    I tried to train the same model on my M2 Max 32 and the swap went up to 200GB before it terminated. Even if it could access the 128 it wouldn't be enough. I think it's going to need at least 1 TB ram. This is a very sad no-go on any Mac in existence as far as I can determine.

    • @s.i.m.c.a
      @s.i.m.c.a 4 місяці тому

      why someone decided that mac is good for models to train? some puny model for fun probably, for some hipsters - but real work are done on nvidia specialized gpu's which could be clustered to how much tb's you need. Also you need to understand, that there is a limit -how much memory you can place in a cpu package

    • @totalermist
      @totalermist 4 місяці тому +3

      @@s.i.m.c.a I don't know about "puny models for hipsters", but the main concern isn't so much hardware capabilities with models like Mistral 7B (which is very capable indeed), but time constraints. Foundation models of the LLM variety simply cannot be trained or even reasonably finetuned (if Mixtral 7Bx8 or bigger) on consumer hardware, full stop.
      Just checking the model cards of even "just" smaller generative models like SDXL reveals that they take 10s of days to train from scratch on 256 GPU clusters...
      Just for fun I calculated that it'd take about half a year of constantly running on a single GPU to train SDXL. Consumer hardware (especially laptops) likely wouldn't even survive that. Finetuning on the other hand should be perfectly doable even for smaller LLMs, provided a reasonably quantized checkpoint is used. It'd still take several days or even weeks, though.

    • @TheThaiLife
      @TheThaiLife 4 місяці тому

      Yep, I get it. I have had some pretty good models going on my Mac though. But yea, having a 4090 plugged into the wall with massive amounts of ram is the way to go for now. @@s.i.m.c.a

  • @willidriver
    @willidriver 4 місяці тому

    Windows laptops often have a dynamic PCIE link, which might only use one lane whilst being idle. This could change the bandwidth as well.

  • @milleniumdawn
    @milleniumdawn 4 місяці тому +3

    There a risk free way to change the max memory for GPU access.
    Its a simple terminal command, and reset at reboot.
    It's set the new max memory you want your GPU to have access to. (Leave 8gb for the system and it's all stable).
    sudo sysctl iogpu.wired_limit_mb=57344
    Example for a 64gb model.

  • @adeguntoro
    @adeguntoro 4 місяці тому

    What about egpu RTX 4090 desktop with that "beast" windows ? Will it show better result ?

  • @5133937
    @5133937 4 місяці тому

    @7:15 Some models can layer or shard themselves between GPU VRAM and system RAM, so you can actually run larger models on smaller memory. Performance suffers obviously, but it's improving.

  • @zmeta8
    @zmeta8 4 місяці тому

    fyi, the memory limit for gpu is configurable in macos with single sysctl command.

  • @TheRealMafoo
    @TheRealMafoo 4 місяці тому

    What would be good to see, due to the massive difference in architecture, is actually running something useful on both, and seeing how they compare.

  • @RichardGetzPhotography
    @RichardGetzPhotography 4 місяці тому +1

    Will Apple allow again an eGPU? This will help greatly with ML. Or possibly the Mac Pro will have ML Afterburners?

    • @tonyburzio4107
      @tonyburzio4107 4 місяці тому

      No. Best bet is to expand memory access. M3 came before ML, expect Apple to create something nifty.

    • @RichardGetzPhotography
      @RichardGetzPhotography 4 місяці тому

      @@tonyburzio4107 they can come up with something new, but that won't get us the computational power needed for LLM to train or inference. M4 won't get us close enough to a couple of 4090s and 4090s won't get us close enough to A100s.
      If Apple wants to play in this space and get those fat hardware $$$$, then it will need something like Afterburners for ML.
      ML Afterburner could easily be just their current gen GPU is mass quantity with gobs of ram interconnected to other Afterburners with an insane fabric.

    • @whohan779
      @whohan779 4 місяці тому

      That would probably only work in Linux or heavily modded OS-X (if that wouldn't set off an integrity violation of sorts). Apple likely doesn't even have support for a Radeon GPU that was actually shipped with another Intel-based Mac (of any kind, including cheesegrater & trashcan).

  • @artnotes
    @artnotes 4 місяці тому

    Is that possible that when you copied on M1 that data copied from CPU to GPU and they shared a 400GB/s lane. For unified memory copied should not be necessary, but if you force to copy any way read and write might share the same lane.

  • @Cestpasfaux-
    @Cestpasfaux- 4 місяці тому

    I really like Ziskind of test, thanks !

  • @Fenrasulfr
    @Fenrasulfr 4 місяці тому +2

    I wonder if you could make use of the DirectStorage Api in machine learning applications that way theoretically you could bypass the cpu entirely.

    • @giornikitop5373
      @giornikitop5373 4 місяці тому

      if you want to transfer data from or to the ram, the cpu cannot be bypassed end of story, stop repeating that bs. directstorage just relieves the cpu from doing the transfer itself. same as rdma or any dma transfer.

    • @Fenrasulfr
      @Fenrasulfr 4 місяці тому

      @@giornikitop5373 Well I made a mistake, most news explains it as bypassing the cpu and sending data directly to vram. Next time try being nicer in explaining to others when they make a mistake.

    • @giornikitop5373
      @giornikitop5373 4 місяці тому

      @@Fenrasulfryou don't get to tell me what to do.

    • @Fenrasulfr
      @Fenrasulfr 4 місяці тому

      @@giornikitop5373 You don't get to be an ass to others just because you know a little more information on one specific topic that the vast majority of people don't give a sh*t about.

    • @giornikitop5373
      @giornikitop5373 4 місяці тому

      @@Fenrasulfr WHAT DID I SAY?

  • @user-ho3ez8zj8c
    @user-ho3ez8zj8c 4 місяці тому +20

    3:18 thanks for including your phone number in this video 😂

  • @TheRealLink
    @TheRealLink 4 місяці тому

    Would be curious how it stacks up against a desktop 3090 / 4080 / 4090. Obviously you're testing laptop to laptop rather than desktop parts but it would be neat for conjecture.

    • @Slav4o911
      @Slav4o911 3 місяці тому

      Desktop parts will be much faster.

  • @georgiecooper5958
    @georgiecooper5958 4 місяці тому

    A couple of questions:
    1. How did you test STREAM multiprocessor? If it's using MPI then MPI doesn't support Shared Memory nodes.
    2. A lot of these 3rd party tools can have huge bugs, it would be great if you can mention that as well.
    If you're planning to test ML Models, use MLX for Apple Silicon Mac and PyTorch optimized CUDA for Nvidia. Reason is, Although PyTorch works with MPS, it's not really using the shared memory concept properly. (Probably my guess is due to how PyTorch tensors operate, it's copied between the GPU and the CPU based on the device attribute.)

  • @Dashient
    @Dashient 4 місяці тому +1

    Haha I love the editing of the unboxing of that gaming laptop

  • @markjacobs1086
    @markjacobs1086 4 місяці тому

    The 1.05 TB/s number is "effective bandwidth". Only really reachable when whatever you're doing fits into the GPU's cache (which is rather large on Ada Lovelace compared to other generations).
    What Nvidia advertises is the typical bandwidth you could expect in most applications.

  • @paul7408
    @paul7408 4 місяці тому

    I feel like Microcenters are always next to a discount clothing store, my local one is next to a Burlington coat factory

  • @magfal
    @magfal 4 місяці тому +6

    You can add more system memory to that laptop.
    You can upgrade to 48GB dimms and of its the quad sodimm machine i suspect it might be, you can go up to 192GB system memory.
    Running 96GB in my Asus Scar 18 2023

  • @noodlz3660
    @noodlz3660 4 місяці тому

    i believe you can actually deal with the memory limit in terminal with a command like sudo sysctl iogpu.wired_limit_mb=26624. (change the mb to whatever but keep 8GB for the system). the thing though is that i think they changed this command once already in the system update, adn you need to do this everytime you restart the system.

  • @arneczool6614
    @arneczool6614 4 місяці тому

    You can't expect that the advertised memory bus bandwidth (which simply is bit width x Frequency) to match the bandwidth you messure in real usage, as any messurement is simply a combination of multiple bottlenecks - a good messurement there probably requires a deeper dive into the platform architecture.

  • @BrookZerihun
    @BrookZerihun 4 місяці тому

    nice, what would be the results with a full card not a mobile GPU, they are so power constrained would love to see a full desktop GPU test

  • @ubaft3135
    @ubaft3135 4 місяці тому

    Did you use wsl? Could be slowing it a bit

  • @Johno2518
    @Johno2518 3 місяці тому

    Would be good to see if Resizable BAR changes the performance of the 4090m and by how much

  • @chills5100
    @chills5100 4 місяці тому

    rtx 4090 has a tb/s of bandwith correct me if I am wrong.

  • @danielgall55
    @danielgall55 4 місяці тому +2

    Jesus!!! Finally! Thank you very much!!! I've been waiting for someone to do so for two years! No, seriously, truly, thank you!!!!

  • @gianlucab2261
    @gianlucab2261 4 місяці тому +1

    For some reason Quality setting for this video is grayed out in mobile app (Samsung s5e here) being it fixed at an abysmal value, it looks like 240p. First time I'm facing such an issue. BTW: great content, as usual.

    • @AZisk
      @AZisk  4 місяці тому

      sorry to hear that, I hope it was just a one-off for you. I checked the quality on this side and it seemed ok before i published

  • @hishnash
    @hishnash 4 місяці тому

    That TB/s number is a hitting the cache that is on the GPU die. You need to look at sustained performance over a large write so the cache does not impact the results to much.

  • @srivatsansamraj2768
    @srivatsansamraj2768 4 місяці тому

    i was thinking of putting a comment to try and get you the 40 series mobile cards for comparison but here you are. actually usual people compare benchmarks which are for x86 pcs and mac has to go thru lots of translation layers... but ML isn't that way. so for comparison, i know it will be in the making already, but we want raw GPU perf and ML perf based on model execution, training, max batch sizes etc compared between two GPUs... thank you for exploring this niche genre in tech.

  • @fai8t
    @fai8t 4 місяці тому

    so m3max 235 gb vs 4090 48gb copy?

  • @jolness1
    @jolness1 4 місяці тому

    Something to keep in mind; the mobile 4090 uses the same die as the 4080 desktop and the same narrower bus. The desktop 4090 has 24GB of VRAM and a 50% wider bus. There is also the a6000 which is an 18k “core” (vs 16k on 4090) model with 48GB of memory on the same bus. The 4090 is $1600-$2000 and a6000 is around $6000.
    So if needs exceed a mobile 4090, there are other options if willing to go to a desktop. I have a 4090 and it’s a great card for hobbyists like myself, plus I play games on it sometimes.
    Great video!

  • @PeterMartens98
    @PeterMartens98 4 місяці тому

    Really like your videos.

  • @prasadsawool
    @prasadsawool 4 місяці тому

    bro looks like Neo while casually buying a top of line 4090 laptop

  • @henfibr
    @henfibr 4 місяці тому

    Latest Nvidia 40XX series cards have increased L2 cache memory. The mobile 4090 has 64MB, compared to the previous generation which only had 5-6MB. This may explain why the system measures 1 TB/sec bandwidth out of a 576 GB/sec card.

  • @QUANTUMJOKER
    @QUANTUMJOKER 4 місяці тому

    This is very interesting. Thanks for the insight.
    Perhaps this information is inaccurate or outdated (as in, it doesn't apply to M2 and M3-based Macs), but I read a while back that an M1 chip with 8 GB of unified memory has a 2 GB ceiling for its GPU, and this scales up with more memory. My M1 Mac Mini with 8 GB of memory had a limit of 2 GB for its GPU, but my current M1 Max Mac Studio with 32 GB of unified memory has a limit of 8 GB for its GPU. Is this correct?
    If the GPU is actually limited to 75% of the unified memory, then it's pretty cool to think that my Mac Studio has as much as 24 GB for the GPU.

  • @kToni73
    @kToni73 4 місяці тому

    3:45 Now that was a smooth Unboxing 😎

  • @aumortis
    @aumortis 4 місяці тому

    2:11 where's the link?

  • @renanmonteirobarbosa8129
    @renanmonteirobarbosa8129 4 місяці тому +1

    Also Apple problem is that its chip has the theoretical speed but the software dont let u use it. While Nvidia you can use the GPU to its fullest given you put the effort.

  • @kennibal666
    @kennibal666 4 місяці тому

    I hope all tests on the 4090 are run with the laptop plugged in.

  • @ManishKumar-vm8nq
    @ManishKumar-vm8nq 4 місяці тому +1

    Video : Nvidia vs Apple
    Ad : Samsung

  • @woolfel
    @woolfel 4 місяці тому +1

    nice find.

  • @vasudevmenon2496
    @vasudevmenon2496 4 місяці тому

    You might need to rerun the benchmark on AC power since CPU and nvidia dgpu power throttle to maximize battery longevity. System Ram bandwidth seems very low on 13th gen which should peak at 60GBps at dual/quad channel mode depending on number of ddr5 sodimm 5200 or 6400 MT/s. I think MSI is using very loose memory timings. Something like hyper x or gskill ripjaws or Corsair vengeance should boost the performance quite a bit. I've seen better fps and compute performance after upgrading my system Ram from ddr4 2133 to 2666MTps

  • @toddsimone7182
    @toddsimone7182 4 місяці тому

    Sounds like they are advertising memory bandwidth for the GPU. The 4090 laptop version has 576.0 GB/s unlike the desktop version with 1,008 GB/s.

  • @vernearase3044
    @vernearase3044 4 місяці тому

    In the Mac memory model, both CPU, GPU, and all IP blocks can hit memory simultaneously and without moving data and get some pretty high memory bandwidth speeds.
    In the Win memory model, the CPU formats a GPU request and data in main memory, compresses it, transmits it over PCIe where the GPU receives the request and data from PCIe, decompresses it into VRAM, and runs the request. If it's a compute request, the results are compressed in VRAM, transmitted via PCIe, received by the CPU from PCIe into main memory, and decompressed. If we're talking an iterative request, the data flows back and forth over PCIe as many times as necessary.
    So … on the graphics card the GPU can hit the VRAM at tremendous speed - but the speed is bottlenecked by the PCIe transfer speed which is around 50 GB/sec.
    This is the marketing model employed by PC designers because it keeps CPU, GPU, and mother board vendors happy and separate and distinct - allowing each to play in their own sandbox and sell their wares to consumers independently - but the reality is the _“secret”_ win overhead every x86 user pays to keep everything separate.
    Wintel graphics cards have insane speed once the request and data have been set up in VRAM - but there's a _lot_ of steps they have to go through to get the data into VRAM, and a lot of steps required to return graphics card results to the CPU's main memory.

  • @ansoncall6497
    @ansoncall6497 День тому

    only 96 gigabytes of vram. I want you to repeat that SLOWLY....

  • @Theodosc
    @Theodosc 4 місяці тому +8

    I’m a computer engineering and informatics student atm and I watch many tech videos every day for different things. You are by far the best channel I’ve come across, from setting up my pcs for programming to comparing the best choices for new hardware I want to purchase. What I like the most is that you are professional, you present real stats and you give advices from your personal experience as a programmer! That’s a real tech guy right there. Keep up the great work u deserve more followers!

    • @AZisk
      @AZisk  4 місяці тому +1

      Wow, thanks!

  • @ggoddkkiller1342
    @ggoddkkiller1342 4 місяці тому +1

    Memory bandwidth doesn't mean anything without tensor cores that Nvidia cards have! This is the reason why people are literally fighting each others to buy 3090/4090s not apple products at all. Sure, you can run large models on a macbook but it will be painfully slow despite large vram capacity and bandwidth...

  • @jasonhurdlow6607
    @jasonhurdlow6607 4 місяці тому +1

    FYI, a 4090 laptop chip is not an AD102 chip (used in a desktop 4090), it's really an AD103 (4080 desktop) chip. Try in on a desktop 4090.

    • @AZisk
      @AZisk  4 місяці тому

      it says rtx4090 on the laptop. we all know desktop 4090 is a heck of a lot more powerful, but this was a laptop test

  • @leorickpccenter
    @leorickpccenter 4 місяці тому

    those front lights on the MSI is somewhat annoying to look at.

  • @WarshipSub
    @WarshipSub 4 місяці тому +5

    I love it how he casualy go and buy a 3200$ laptop. Damn, kind of a life goal for me :P
    Well done Alex :D

    • @whohan779
      @whohan779 4 місяці тому +1

      Really only worth it if you need it extremely portable. Even a 1000 Wh portable battery (with standard AC or laptop DC output) plus an RTX 4070 Ti Super, 4k240 OLED and decent base platform plus portable peripherals is around US$800 cheaper and much faster.
      This laptop only replaces some US$2k worth of components plus peripherals when plugged in. Sadly RTX 4070 mobile is only 8 GB (just like the Desktop variant), so really not future proof even though the sweet-spot for actually using it almost to the fullest while on battery.

  • @callowaysutton
    @callowaysutton 15 днів тому

    The 1TB/s is the bus speed. If you open up GPU-Z and overclock the memory you'll see the GPU's bandwidth change proportional to the memories clock speed. MacBooks have a governor limiting their maximum memory clock which is why it'll stay around 400GB/s +/- 10GB/s unless it thermally throttles

  • @CRC.Mismatch
    @CRC.Mismatch 4 місяці тому

    Where's the link to the repository? 😑

  • @hadeseh6808
    @hadeseh6808 4 місяці тому

    Can you please make a review on budget laptop for programing

    • @synen
      @synen 4 місяці тому +4

      Used Macbook Air M1

    • @sveinjohansen6271
      @sveinjohansen6271 4 місяці тому +2

      M1 mini with 16 gb memory and you’re good to go on budget.

    • @Andre.A.C.Oliveira
      @Andre.A.C.Oliveira 4 місяці тому

      Slimbook Elemental

  • @tapiolehto5312
    @tapiolehto5312 3 місяці тому

    I Play Company of heroes 2 with my MacBook Pro i9 2019 16 inch, works like a dream, Rome , just the same. There are many more and they work fine with touchpad.

  • @davidraborn3654
    @davidraborn3654 4 місяці тому

    Not a mac fan after the update bassicly made my Ipad useless for anything I was using it for.

  • @Heythisismychannel
    @Heythisismychannel 4 місяці тому

    The 4090's memory bandwidth is actually supposed to be 1TB/s but not the mobile version...

  • @ahsaft
    @ahsaft 4 місяці тому +1

    no u dont need an a100, just get the desktop 4090...
    the laptop 4090 is basically a 4080 specs wise and also has way lower power limit.

  • @SilentShadow-ss5xp
    @SilentShadow-ss5xp 4 місяці тому

    Just wanted to add some info here. But to my knowledge Nvidia uses compression on the memory bus to help save bandwidth. The memory controller does this in real-time and transparently. This is why you see higher effective bandwidth than advertised. Obviously real world numbers will vary tho.

  • @sultonbekrakhimov6623
    @sultonbekrakhimov6623 4 місяці тому +2

    So in short, VRAM in Nvidia graphics are faster than unified memory and gpu bandwidth in macs, but actually bandwidth between RAM and GPU in PC machines makes it 10 times worse than what 4090 actually capable of

    • @AZisk
      @AZisk  4 місяці тому

      best summary of the situation

  • @TheLokiGT
    @TheLokiGT 4 місяці тому

    Alex, the GPU can use ALL the unified memory, all it takes is a shell command. ~75% is just the default setting.

    • @tempeleng
      @tempeleng 4 місяці тому

      Maybe the default is set that way to reserve some RAM for MacOS? I can't imagine having the OS running well when it's memory starved.

  • @BeaglefreilaufKalkar
    @BeaglefreilaufKalkar 4 місяці тому

    "Machine learning requires a lot of troughput of data going between the cpu's and gpu's"
    What about the npu/neural engine?

  • @ultralaggerREV1
    @ultralaggerREV1 4 місяці тому

    Here’s my theory, the M2 Ultra DOES have the 800GB/s, however, one chunk of it is used by the kernel of the OS while the rest is what we see.

  • @DimitrisConstantinou
    @DimitrisConstantinou 4 місяці тому

    Spending $3200 to find out the pcie speed. Nice. Also the 1 TB/s is probably the SUM of write and read speed between GPU and GDDR ram without communicating with cpu.