REALITY vs Apple’s Memory Claims | vs RTX4090m

Поділитися
Вставка
  • Опубліковано 24 сер 2024
  • I put Apple Silicon memory bandwidth claims to the test against the nVidia RTX4090 powerhouse.
    Run Windows on a Mac: prf.hn/click/c... (affiliate)
    Use COUPON: ZISKIND10
    🛒 Gear Links 🛒
    * 🍏💥 New MacBook Air M1 Deal: amzn.to/3S59ID8
    * 💻🔄 Renewed MacBook Air M1 Deal: amzn.to/45K1Gmk
    * 🎧⚡ Great 40Gbps T4 enclosure: amzn.to/3JNwBGW
    * 🛠️🚀 My nvme ssd: amzn.to/3YLEySo
    * 📦🎮 My gear: www.amazon.com...
    🎥 Related Videos 🎥
    * 💰 MacBook Machine Learning | M3 Max - • Cheap vs Expensive Mac...
    * 🤖 INSANE Machine Learning on Neural Engine - • INSANE Machine Learnin...
    * 👨‍💻 M1 DESTROYS a RTX card for ML - • When M1 DESTROYS a RTX...
    * 🌗 RAM torture test on Mac - • TRUTH about RAM vs SSD...
    * 👨‍💻 M1 Max VS RTX3070 - • M1 Max VS RTX3070 (Ten...
    * 🛠️ Developer productivity Playlist - • Developer Productivity
    - - - - - - - - -
    ❤️ SUBSCRIBE TO MY UA-cam CHANNEL 📺
    Click here to subscribe: www.youtube.co...
    - - - - - - - - -
    📱LET'S CONNECT ON SOCIAL MEDIA
    ALEX ON TWITTER: / digitalix
    #m3max #m2max #machinelearning

КОМЕНТАРІ • 563

  • @AZisk
    @AZisk  4 місяці тому

    JOIN: youtube.com/@azisk/join

  • @R3endevous
    @R3endevous 6 місяців тому +342

    RTX 4090 in the laptop is an TDP limited RTX 4080.

    • @gliderman9302
      @gliderman9302 6 місяців тому +6

      That shouldn’t impact memory right?

    • @headmetwall
      @headmetwall 6 місяців тому

      @@gliderman9302 somewhat, but not due to the TDP, the laptop version uses GDDR6 memory chips instead of GDDR6x (Bandwidth limit of 576.0 GB/s VS 716.8 GB/s)

    • @user-jk9zr3sc5h
      @user-jk9zr3sc5h 6 місяців тому +69

      @@gliderman9302 whats limiting the memory is the terrible ass laptop

    • @Architek1
      @Architek1 6 місяців тому +10

      I was so confused as to why it only had 16GB of VRAM

    • @kahaneck
      @kahaneck 6 місяців тому +40

      IT IS a 4080, its the same AD103 chip. The desktop 4090 uses the AD102.

  • @mdxggxek1909
    @mdxggxek1909 6 місяців тому +570

    My bro the dedication of just "casually" buying a brand new laptop with a 4090 for the tests, my wallet could never

    • @synen
      @synen 6 місяців тому +54

      Most places in the US have a comfortable return window where you get 100% of your money back.

    • @petersuvara
      @petersuvara 6 місяців тому +14

      He can just return it after a few days for this. Apple has 2 week return window.

    • @habsanero2614
      @habsanero2614 6 місяців тому +9

      Also resell value on these machines is very high in short windows

    • @user-jk9zr3sc5h
      @user-jk9zr3sc5h 6 місяців тому +13

      because its not an actual 4090

    • @Jeannemarre
      @Jeannemarre 6 місяців тому +11

      @@petersuvara it’s cool you guys can do it, I’m Europe once you open the box you cannot return it unless it’s faulty

  • @RichWithTech
    @RichWithTech 6 місяців тому +491

    4:08 when you've been rocking Mac for so long you forget you need to plug in gaming laptops to get full power

    • @NguyenTran-eq2wg
      @NguyenTran-eq2wg 6 місяців тому +14

      Oh righttttttt!

    • @CHURCHISAWESUM
      @CHURCHISAWESUM 6 місяців тому +12

      Any windows laptop is like this

    • @eulehund99
      @eulehund99 6 місяців тому +59

      ​@@CHURCHISAWESUM*gaming laptops with a discrete GPU. Any AMD mobile chip from 6th gen and up and any Intel Core Ultra Chip have great battery life.

    • @rafewheadon1963
      @rafewheadon1963 6 місяців тому +64

      too bad you cant play any fucking games on a mac.

    • @NguyenTran-eq2wg
      @NguyenTran-eq2wg 6 місяців тому +23

      @@rafewheadon1963 You actually can lmao. Stop throwing blaket and inaccurate comments around.

  • @Momi_V
    @Momi_V 6 місяців тому +112

    In "non unified memory land" aka PC world there is a huge difference between the CPUs memory bandwith, the GPUs memory bandwith and the link in between.
    50-70 GiB/s seems reasonable for dual channel DDR5 at limited clock speeds (4800-5600 MT/s), so the CPU numbers are correct, but ~16GiB/s is atrocious in terms of GPU memory bandwith. This is not the actual GPU memory bandwith but rather the PCIe transfer bandwith between the CPU and GPU. It's probably only running a PCIe 4.0 x8 link with 8 * 16 GT/s - overhead. That test is using the CPUs memory to perform GPU operations and not even utilizing the GPUs dedicated RAM. That's madness and in no way representative of the GPUs capabilitys. The 4090 mobile has a theoretical memory bandwith of 576 GiB/s and should be able to reach around 400-500 GiB/s in those "memory access" microbenchmarks (if they were actually testing GPU memory). I am running a 3080 Ti mobile (512 GiB/s theoretical) and get around 400-450 GiB/s depending on the test. CPU to GPU bandwith is still important, but basically all real world workloads (including AI training and inference) either copy the working set to the GPUs memory upfront or stream relevant sections in and out. For the first method the interface bandwith is neglegible as it could only affect startup time (loading the model) and that's basically always bottlenecked by the storage performance (it does not matter if the CPU GPU link is 16 GiB/s or 200 GiB/s if your drive only reads at 4 GiB/s). For the second method it's a bit more relevant, but even in that case the bandwith required to move sections of, for example training data is orders of magnitude smaller than the bandwith required to actively perform calculations on that data. This is due to the nature of GPU workloads where a lot of parallel operations are performed repeatedly on a bounded dataset. For each of those operations that data has to move in and the result out of the GPU core, but it does not have to move back and forth between CPU and GPU every time. The communication is limited to instructions about what to do with the data and occasionally new pieces of data that are transferred once, insted of over and over again. The results of those calculations might also be streamed back, but thats usually equal to or smaller than the Input and does not compete for bandwith, as PCIe is full duplex. If souch a high processor to processor bandwith is actually required, Nvidias NVLink exists and can do up to 1.2 TiB/s per link. The main benefit of Apples unified memory is a more flexible and efficient allocation of RAM as data does not have to be duplicated between CPU and GPU and the amount of RAM available to each is not fixed but dynamic. You simply can not get a PC laptop with more that 24 GiB of VRAM right now.
    The 1 TiB/s number is due to the AD103's 64 MiB L2 cache. If the dataset of the test is small enough it just sits in the GPUs cache.

    • @gavinbad2371
      @gavinbad2371 6 місяців тому +2

      Thank you so much for the explanation!

    • @aquss33
      @aquss33 6 місяців тому +1

      dayum, that's one hell of an explanation, it's really interesting to hear that the max bandwidth the 4090 achieved was due to its large cache size in comparison with the size of a specific dataset being tested. But, yeah, your explanation made a lot of sense and I understood most of it, but I still don't understand how you get such info from watching this video with limited details, truly fascinating, I was trying to figure something out on my own, the best thing I got to was that the windows laptop wasn't plugged in, saw someone comment about that already, does that make any difference in bandwidth speed compared to it being plugged into the wall?

    • @Momi_V
      @Momi_V 6 місяців тому +6

      @@aquss33 honestly, I almost completely forgot that the laptop was actually not plugged in. It might even have used the iGPU for some tests, those

    • @sprockkets
      @sprockkets 6 місяців тому

      IDK, didn't like DirectX12 eliminate the whole need to copy memory around in the first place?

    • @nightthemoon8481
      @nightthemoon8481 6 місяців тому

      you can't get current gen pc laptops with more than 24gb of vram, but there are ones with last gen quadros with 48gb

  • @mdxggxek1909
    @mdxggxek1909 6 місяців тому +122

    The read & write gfx ram is purely on the card executing opencl, while the peak write and peak read GFX ram measure the pci express bus speed. Transfering data over pci-e is a lot slower than just reading and writing to the memory on the gpu itself

    • @himynameisryan
      @himynameisryan 6 місяців тому +5

      Thank you for putting my exacts thoughts into words that are understandable lmao

    • @oloidhexasphericon5349
      @oloidhexasphericon5349 6 місяців тому +2

      so theoretically if we could have a direct gpu-cpu connection in a pc it would be 1 tBps as opposed to 800 gBps for m2 ultra ?

    • @himynameisryan
      @himynameisryan 6 місяців тому +9

      @@oloidhexasphericon5349 thats correct
      If nvidia VRAM was used like unified memory was in a Mac, it would be 1tbps
      But that extra latency is an issue apparently
      Which is mildly disappointing as a gaming pc owner

    • @Debilinside
      @Debilinside 6 місяців тому +3

      @@himynameisryan I think this is more of a problem for laptops. Desktops have usually much better bandwith, more PCI lanes etc...

    • @himynameisryan
      @himynameisryan 6 місяців тому +3

      @@Debilinside no the delay still occurs in my gaming PC. It's a real issue for some workloads but not mine

  • @eivis13
    @eivis13 6 місяців тому +168

    Title is a bit missleading, since the RTX4090 and RTX4090M (RTX4080 nerfed?) are 2 different GPUs with different memory bandwidths and internal cache layouts.

    • @AZisk
      @AZisk  6 місяців тому +26

      This was comparisons of mobile machines. I haven’t done a desktop rtx4090 test yet.

    • @user-jk9zr3sc5h
      @user-jk9zr3sc5h 6 місяців тому +17

      @@AZisk have you seen the size of a 4090? It’s not gonna fit in a laptop

    • @AZisk
      @AZisk  6 місяців тому +55

      @@user-jk9zr3sc5h I'll stuff it in.

    • @eivis13
      @eivis13 6 місяців тому +10

      @@AZisk After that please fit a Bugatti(VW) W16 into/onto a Vespa.
      Just food for future videos ;)

    • @eivis13
      @eivis13 6 місяців тому

      @@user-jk9zr3sc5h sure it will, but it will have to sit on a dry ice block.

  • @jihadrouani5525
    @jihadrouani5525 6 місяців тому +86

    Yeah I think that was pretty clear, Nvidia's GPU's tend to hit 1TB/s of VRAM bandwidth very easily, so if whatever you're trying to run is loaded on the VRAM then Nvidia would squash Apple any day of the week, bandwidth to system RAM however is much slower since it's running through PCI-ex. Games and things like that tend to load data to the VRAM so the GPU wouldn't sit idle waiting for meshes and textures to load from system RAM.

    • @Honeypot-x9s
      @Honeypot-x9s 6 місяців тому +28

      The desktop 4090 and the Radeon Vii both at around 1TB/second peak bandwidth.
      I stopped this video halfway because he spent $3grand plus buying an RTX 4090 laptop to use the integrated graphics and battery power for his testing. Should been a. Given to anyone technical, assign the GPU manually on both even if it’s working on one, always be testing on wall power to again, remove variables. But specially testing a PC because by default windows wants to dial down and save battery on battery. Many times with such a high performance GPU, only real way to do that is to turn off the dedicated card, else going have like half hour of battery.

    • @jihadrouani5525
      @jihadrouani5525 6 місяців тому +4

      @@Honeypot-x9s The bandwidth test was done on the dedicated GPU on the PC laptop, not the iGPU, and battery is irrelevant here because the bandwidth doesn't go down while on battery, it's a simple wide interface that transfers data, it has no power requirement of its own to throttle on battery. Basically, the bandwidth is 1TB/s no matter what you do.

    • @Honeypot-x9s
      @Honeypot-x9s 6 місяців тому

      @@jihadrouani5525 everything is dynamically clocked these days for various optimizations. And even before where we are today we had power states (still used today but differently) that all different clocks and power limits making up a curve. But these days GPUs can clock themselves dynamically and decouple their clocks with their memory based Temperature, power, power availability, thermal overhead, thermal saturation (more of a Radeon thing with STAPM/skin temp aware), lack of utilization (specifically how often memory is being accessed), etc.
      Plus yes windows in some deeper settings than just the power options in control panel will when you unplug it from a wall set the a lower power state on GPU and if you got a hybrid system it will almost always suspend the dGPU infavor of the iGPU. Either way it will noticeably reduce performance. Also at the point I stopped it was 48GB/sec. That is perfectly inline with what I expect out of dual channel system memory bandwidth….

    • @Syping
      @Syping 6 місяців тому

      @@jihadrouani5525 The 1 TB of bandwidth is not all the time, the L2 Cache is impacting the speed, as soon the data sizes get too big the speed will decrease to 500 GB/s, which is still good but not 1 TB anymore

    • @jihadrouani5525
      @jihadrouani5525 6 місяців тому +6

      @@Syping The bandwidth actually stays the same as long as needed, it is not limited by L2 cache, L2 cache is less than 40MB, if that was to be the limitation then the bandwidth would be crippled in milliseconds. 1TB/s can be sustained as long as the GPU itself can crunch that data in real-time. And in gaming and many other use cases 1TB/s can be sustained throughout the entire play session.

  • @Egor9090
    @Egor9090 6 місяців тому +30

    AppleGPUInfo isn't measuring bandwidth, it's just doing 2 * clock * (bus bits / 8)

  • @EfrainDeLaRocha
    @EfrainDeLaRocha 6 місяців тому +3

    buys a 3000 dollar machine to run a 5 minute test to debunk one question.

  • @celderian
    @celderian 6 місяців тому +25

    I was definitely not expecting my local Microcenter to be featured in one of your video XD

    • @AZisk
      @AZisk  6 місяців тому +10

      hey neighbor

  • @sveinjohansen6271
    @sveinjohansen6271 6 місяців тому +23

    Next Alex goes undercover into NVIDIA HQ to buy a laptop with A100 under the hood ! Excellent content Alex. This is what separates your channel from all other review channels, the developer focused reviews of the machines, and not only Mac’s. I have 3090, had AMD R7 in the past with 1024 bit HBM2 memory. Man the R7 card were really great but couldn’t do cuda. A100 next Alex ? :):)

    • @AZisk
      @AZisk  6 місяців тому +14

      I think for the A100, I was considering renting one in the cloud to do some tests. Definitely not worth it for me to buy one yet.

  • @cyclone760
    @cyclone760 6 місяців тому +13

    I didn't know what unified memory did before this video. I thought they meant it was on mounted on silicon. Great to know about the GPU can access it too.

    • @lamelama22
      @lamelama22 6 місяців тому +5

      It's not just the GPU having access the to the RAM instead of having to go through the slower PCIe interface; the memory bandwidth is so much higher than it is in traditional PCs.
      I have seen *some* limited math / AI / etc workloads, just on the CPU, that had something like a 100-1000x speed increase b/c of the unified memory, and was the biggest speed increase they ever had in like 10-20 years of algorithm development. So it's not just GPU. Yes, for normal workloads it doesn't necessarily give you a speed increase, but for algorithms that are memory bound, not CPU bound, and aren't streaming data off your storage device, you can have a radically huge speedup. Even without the GPU / AI engines.
      The downside, of course, is that you can never upgrade it, without upgrading the CPU, and Apple has also made that impossible; they are intentionally making their products as hard to repair as possible so you have to buy a whole new machine every time. Though nothing's stopping, say AMD or Intel from making an all-in-one SoC that is socketed & upgradable....

    • @FernandoAES
      @FernandoAES 6 місяців тому +1

      I think every block in the package can access it. Remember reading in Anandtech that the bandwidth was not the same, but encoder/decoders, CPU, GPU, NPU all had direct access to the unified memory.

    • @s.i.m.c.a
      @s.i.m.c.a 6 місяців тому

      @@lamelama22 and now imagine - you have only 8 gb for everything and apple considering it enough with statement that it is like 16 gb lol )))

  • @heyitsmejm4792
    @heyitsmejm4792 6 місяців тому +8

    the fastest memory bandwidth on a consumer GPU that i remember was AMD's Radeon 7 with HBM2 memory that are soldered directly to the GPU die itself. Memory bus is at : 4096 bit with a Bandwidth of 1,024 GB/s

    • @xeridea
      @xeridea 6 місяців тому

      Sad we don't have HBM on consumer cards anymore. Anyway, the 4090 spec is 1008 GB/s bandwidth.

    • @heyitsmejm4792
      @heyitsmejm4792 6 місяців тому

      @@xeridea seems like the memory chip technology is the bottleneck here, since i doubt a 4096 bit bus can only do 1,024 GB/s.

  • @TheDanEdwards
    @TheDanEdwards 6 місяців тому +66

    Apple gets their numbers simply from the LPDDR5 spec. Each of those LPDDR5 chips does 51GT/s, using 96 pins.

    • @chebrubin
      @chebrubin 6 місяців тому +16

      Agreed Unified Memory Architecture is so dumb down step back from the modular work intel did with South Bridge and North Bridge and PCI. Sure they can claim high speeds but there is no modular connectivity between the CPU and L2, L2, L3 cache. Wait until GPU is running native PCIE 5 with HMB memory aka decade ago AMD Vegas. That's why Elon Musk wants to acquire every new AMD Instinct™ MI300X Platform card with HBM3 memory interface. Apple Silicon is a joke meant to put a iPhone MAX Pro in a laptop enclosure.
      Want to build out a multiuser AI workstation cluster you need AMD.

    • @Honeypot-x9s
      @Honeypot-x9s 6 місяців тому +3

      @@chebrubin no with the IMC integrated into the SoC these days it’s the most efficient way of handling memory.

    • @chebrubin
      @chebrubin 6 місяців тому +7

      @iDoWayTooMuchAcid efficient is the word. These days?
      Why don't you go fanboy your TSLA and APPL trade somewhere else. Integrated memory does not scale to more than 1 CPU. All these "cores" are meaningless when the bus is saturated with network IO. Apples claims will not scale either.
      There is a reason why Intel worked on the bus and CPU and GPU managmet architecture for 25 years before you woke up. Take your single CPU / GPU code tests on 1 machine and take off your tinted Vision Pro headset.
      A MacPro from 10 years ago will scale better for network IO under multiple connections.

    • @Honeypot-x9s
      @Honeypot-x9s 6 місяців тому +7

      ⁠​⁠@@chebrubin LOL, man put down your handbook from two decades ago. Yes it is far more efficient, if you understand it at a low level like at silicon level, you would agree. there’s a lot less running in circles being done, fewer cycles are needed, less misses, less latency, etc, etc, etc. now what from that did you also see me say that it’s the fastest? I didn’t. Because outright it isn’t the fastest way, but a 4090 running at max bandwidth while it is capable of running rings around this M chip, will also suck down a 💩 ton more power doing so. Also fun fact, Mac’s and PC both been using unified memory wherever it can for a while now…. 😂
      I said nothing about scaling across multiple CPUs, GPUs, etc. and why would I when the platforms being discussed use SoCs which are highly integrated. however if you decoupled the entire memory sub system and put it on its own internal bus like how AMD is starting to with infinity fabric as the bus in their mobile APUs (and seeing impressive gains too), I can see it scaling further. I Can’t speak for what Apple has done maybe their already have a similar memory subsystem setup.. maybe not, I don’t know.

    • @chebrubin
      @chebrubin 6 місяців тому +2

      @iDoWayTooMuchAcid precisely check with Apple Services what tech they are procuring for running there Apple AI cloud it is probably AMD instinct and AMD and super micro racks and cages. Alex is benching client laptop power compute. 1 man 1 machine 1 c compilation runtime.
      Lets discuss the new Apple Mac Pro with NO GPU bus lanes, this SoC was hobled together last minute to ditch IA. No thunderbolt external GPU. It is a iPhone Pro Max with all the ram and ssds sodered. Steve Jobs is alive and kicking. The Woz needs to help find a bus for Apples SoC.

  • @RomPereira
    @RomPereira 6 місяців тому +26

    I mean, if you can't do it as a developer, you can always hang this MSI laptop in a silver chain and go be part of the 'hood.

  • @user-ho3ez8zj8c
    @user-ho3ez8zj8c 6 місяців тому +20

    3:18 thanks for including your phone number in this video 😂

  • @clanzu2
    @clanzu2 6 місяців тому +4

    so basically mac is slower ?

  • @bigdaddy5303
    @bigdaddy5303 6 місяців тому +1

    As has been the case for 50 years - if you want to make sure your computer can do something, don't buy a Mac.

    • @AZisk
      @AZisk  6 місяців тому

      macs can’t do anything, is that what you’re saying?

  • @jasonhurdlow6607
    @jasonhurdlow6607 6 місяців тому +1

    FYI, a 4090 laptop chip is not an AD102 chip (used in a desktop 4090), it's really an AD103 (4080 desktop) chip. Try in on a desktop 4090.

    • @AZisk
      @AZisk  6 місяців тому

      it says rtx4090 on the laptop. we all know desktop 4090 is a heck of a lot more powerful, but this was a laptop test

  • @ahsaft
    @ahsaft 6 місяців тому +1

    no u dont need an a100, just get the desktop 4090...
    the laptop 4090 is basically a 4080 specs wise and also has way lower power limit.

  • @ManishKumar-vm8nq
    @ManishKumar-vm8nq 6 місяців тому +1

    Video : Nvidia vs Apple
    Ad : Samsung

  • @envt
    @envt 6 місяців тому +7

    The windows machine needs to be run while plugged in?

    • @lesleyhaan116
      @lesleyhaan116 6 місяців тому

      yes it does just like every x86 windows laptop

    • @TheRealMafoo
      @TheRealMafoo 6 місяців тому +8

      @@lesleyhaan116 Just like 99.5% of the macs in the real world, that would be doing this workload. I mean it's a cool party trick and all, but who runs these kinds of things at a coffee shop?

    • @BenjaminSchollnick
      @BenjaminSchollnick 6 місяців тому

      @@TheRealMafoo Actually Apple Silicon performs the same with and without being plugged in. That's one of the major benefits, that you can still have the same performance without being plugged in, and still getting the same battery life.

  • @WarshipSub
    @WarshipSub 6 місяців тому +6

    I love it how he casualy go and buy a 3200$ laptop. Damn, kind of a life goal for me :P
    Well done Alex :D

    • @whohan779
      @whohan779 6 місяців тому +1

      Really only worth it if you need it extremely portable. Even a 1000 Wh portable battery (with standard AC or laptop DC output) plus an RTX 4070 Ti Super, 4k240 OLED and decent base platform plus portable peripherals is around US$800 cheaper and much faster.
      This laptop only replaces some US$2k worth of components plus peripherals when plugged in. Sadly RTX 4070 mobile is only 8 GB (just like the Desktop variant), so really not future proof even though the sweet-spot for actually using it almost to the fullest while on battery.

  • @TheThaiLife
    @TheThaiLife 6 місяців тому +4

    I tried to train the same model on my M2 Max 32 and the swap went up to 200GB before it terminated. Even if it could access the 128 it wouldn't be enough. I think it's going to need at least 1 TB ram. This is a very sad no-go on any Mac in existence as far as I can determine.

    • @s.i.m.c.a
      @s.i.m.c.a 6 місяців тому

      why someone decided that mac is good for models to train? some puny model for fun probably, for some hipsters - but real work are done on nvidia specialized gpu's which could be clustered to how much tb's you need. Also you need to understand, that there is a limit -how much memory you can place in a cpu package

    • @totalermist
      @totalermist 6 місяців тому +3

      @@s.i.m.c.a I don't know about "puny models for hipsters", but the main concern isn't so much hardware capabilities with models like Mistral 7B (which is very capable indeed), but time constraints. Foundation models of the LLM variety simply cannot be trained or even reasonably finetuned (if Mixtral 7Bx8 or bigger) on consumer hardware, full stop.
      Just checking the model cards of even "just" smaller generative models like SDXL reveals that they take 10s of days to train from scratch on 256 GPU clusters...
      Just for fun I calculated that it'd take about half a year of constantly running on a single GPU to train SDXL. Consumer hardware (especially laptops) likely wouldn't even survive that. Finetuning on the other hand should be perfectly doable even for smaller LLMs, provided a reasonably quantized checkpoint is used. It'd still take several days or even weeks, though.

    • @TheThaiLife
      @TheThaiLife 6 місяців тому

      Yep, I get it. I have had some pretty good models going on my Mac though. But yea, having a 4090 plugged into the wall with massive amounts of ram is the way to go for now. @@s.i.m.c.a

  • @CodeMonkeX
    @CodeMonkeX 6 місяців тому +1

    Yeah this is par for the course with Apple. They never say how their slides and performance numbers are calculated, so they can cherry pick numbers for slides. They don’t seem to flat out lie about numbers, but it’s very dishonest what they do.

    • @facetubetwit1444
      @facetubetwit1444 6 місяців тому

      Of cause it's apple, how else are they going to milk their shills? Anyone with half a brain knows apple is over priced garbage no matter what they claim. So keep this in mind when you start seeing idiots walk around with their latest vision pro crap.

  • @hahahahahahahaha6682
    @hahahahahahahaha6682 2 місяці тому

    RTX 5090 is being said to be on GDDR7 with 10TBps+ memory bandwidth and all of that speed at half the power consumption

  • @ultralaggerREV1
    @ultralaggerREV1 6 місяців тому

    Here’s my theory, the M2 Ultra DOES have the 800GB/s, however, one chunk of it is used by the kernel of the OS while the rest is what we see.

  • @paul1979uk2000
    @paul1979uk2000 6 місяців тому +2

    Apple remind me of Nintendo and Disney, target the audience that don't know their arse from their elbow, build a strong reputation and loyal fan base and then milk them for all it's worth.
    In the case of Nintendo and Disney, they target kids, knowing that most parents of those kids don't know much about tech in the case of Nintendo and just want to shut the kids up from nagging them so they buy them a Nintendo console or take them to Disneyland.
    In the case of Apple, they target the none tech people out there that most don't even know what is in their hardware they are buying, and don't know the value or lack of value in it, all those groups of people are primed to be taken advantage off, Nintendo and Disney takes advantage of the kids because they are easy targets, Apple goes after the users that don't know much about tech stuff, and are easy targets, hence why when you look at Apple hardware to performance, the price point is insane compared to what rivals offers, especially if you want to do any little upgrades to any of that hardware, the price really goes through the roof lol.
    With that, all 3 companies have one thing in common, they are all ruthless with their policies, and there are so many examples of that over the decades on the 3 companies, that I'm surprised anyone would buy into their ecosystem, but then again, most that do, are unaware that they are being taken for fools by them.
    In the case of A.I. Apple does have the advantage, mainly on ram size, not performance, but for almost every other use case, the 4090 wipes the floor with it, even with A.I. as long as you can fit the A.I. in the limited vram, which for me is the real advantage Apple has when it comes to A.I. a far bigger pool of memory to play with.
    With that said, there's nothing stopping the PC being able to do the same thing with an APU, but I think a new motherboard standard will be needed to support far more memory bandwidth, being that DDR5 won't be enough, but even with DDR5, performance would still be good, especially when you take into account the cost difference compared to Apple products.
    With that said, the APU market seems to be heating up in the PC space, the advantages Apple have could quickly start to reduce over the coming years.

  • @____trazluz____9804
    @____trazluz____9804 6 місяців тому

    intel kinda does this " this new CPU is 30% faster!!" but they never say "if the comparison is with this and that"

  • @renanmonteirobarbosa8129
    @renanmonteirobarbosa8129 6 місяців тому +1

    Also Apple problem is that its chip has the theoretical speed but the software dont let u use it. While Nvidia you can use the GPU to its fullest given you put the effort.

  • @xeridea
    @xeridea 6 місяців тому

    The 1 GB/s on the 4090 is in line with the specs. This is essentially the bandwidth of the RAM on the card. The other numbers are all going across the PCIE bus, and also effected by system RAM and CPU speeds. Apple is lying a bit saying their bandwidth is 10x that of the fastest desktop card, when in reality it is less.

  • @AlmorTech
    @AlmorTech 6 місяців тому +2

    Wow, great video! Thumbnail and editing are awesome 🤩😄

    • @AZisk
      @AZisk  6 місяців тому +1

      Thank you so much 😁

  • @sultonbekrakhimov6623
    @sultonbekrakhimov6623 6 місяців тому +2

    So in short, VRAM in Nvidia graphics are faster than unified memory and gpu bandwidth in macs, but actually bandwidth between RAM and GPU in PC machines makes it 10 times worse than what 4090 actually capable of

    • @AZisk
      @AZisk  6 місяців тому

      best summary of the situation

  • @ShaunBrown8378
    @ShaunBrown8378 6 місяців тому +1

    You might also determine if you are using the onboard GPU vs the 4090 GPU. You can force certain applications to use oen or the other.

  • @hamzaababou6523
    @hamzaababou6523 6 місяців тому +10

    I am not sure about the setup you use with the 4090 laptop setup, but was the testing done with wsl on windows 11, or was it in a vm, if so don't you think it's worth it to peoperly test it on native linux and doing a comparison between these cases?

    • @eliaserke5267
      @eliaserke5267 6 місяців тому +1

      Good point!

    • @sveinjohansen6271
      @sveinjohansen6271 6 місяців тому +3

      Also latest versions of Linux kernel does some memory magic. Worth looking into that vs windows.

    • @jksoftware1
      @jksoftware1 6 місяців тому +8

      Also a Laptop 4090 is not a real 4090... That would ONLY be comparable to a low wattage 4080.

    • @game_time1633
      @game_time1633 6 місяців тому +4

      @@jksoftware1 the comparison is between laptops. Not a desktop and a laptop

    • @jksoftware1
      @jksoftware1 6 місяців тому +7

      @@game_time1633 Then the title should be changed to "REALITY vs Apple’s Memory Claims | vs a laptop RTX4090" because it's deceptive without it because laptop 4090 uses the AD103 chip while the desktop 4090 uses AD102 chip. They are completely different.

  • @jolness1
    @jolness1 6 місяців тому

    Something to keep in mind; the mobile 4090 uses the same die as the 4080 desktop and the same narrower bus. The desktop 4090 has 24GB of VRAM and a 50% wider bus. There is also the a6000 which is an 18k “core” (vs 16k on 4090) model with 48GB of memory on the same bus. The 4090 is $1600-$2000 and a6000 is around $6000.
    So if needs exceed a mobile 4090, there are other options if willing to go to a desktop. I have a 4090 and it’s a great card for hobbyists like myself, plus I play games on it sometimes.
    Great video!

  • @callowaysutton
    @callowaysutton 2 місяці тому

    The 1TB/s is the bus speed. If you open up GPU-Z and overclock the memory you'll see the GPU's bandwidth change proportional to the memories clock speed. MacBooks have a governor limiting their maximum memory clock which is why it'll stay around 400GB/s +/- 10GB/s unless it thermally throttles

  • @RomPereira
    @RomPereira 6 місяців тому +2

    Nice and interesting video, as always! Thank you Alex

  • @MatthewMS.
    @MatthewMS. 8 днів тому

    I was curious why apple made their own chip and not use Nvidia GPU like the others. Now I remember this is just Apple’s MO since I used a MAC LCII in the 90’s, Apple refuses outsource and/or make let their machines be upgradable/customized. Just like their “Apple Intelligence” instead of calling it AI like the rest of the world. It’s what they do, everything is in house and consistent marketing is a top priority.

  • @milleniumdawn
    @milleniumdawn 6 місяців тому +3

    There a risk free way to change the max memory for GPU access.
    Its a simple terminal command, and reset at reboot.
    It's set the new max memory you want your GPU to have access to. (Leave 8gb for the system and it's all stable).
    sudo sysctl iogpu.wired_limit_mb=57344
    Example for a 64gb model.

  • @enzocaputodevos
    @enzocaputodevos 6 місяців тому +3

    I must express my admiration for the extensive and informative data and analyses that you have presented to us. However, I am intrigued to learn if there exists the potential for the implementation of MLX technology to further maximize the performance of the already exceptional M series?

    • @AZisk
      @AZisk  6 місяців тому

      video on the way

  • @ioscruz24
    @ioscruz24 6 місяців тому

    It's a common practice to advertise the theoretical bandwidth. The actual bandwidth used is too dependent on the application, run conditions and machine architecture. Unified memory also means that memory bandwidth is share between CPU and GPU actors. It is very naive to try to infer bandwidth from benchmark results under these premises.

  • @adriannasyraf3534
    @adriannasyraf3534 6 місяців тому +2

    did you test the msi laptop while plugged in?

  • @ansoncall6497
    @ansoncall6497 2 місяці тому

    only 96 gigabytes of vram. I want you to repeat that SLOWLY....

  • @KellyWu04
    @KellyWu04 5 місяців тому

    No CPU architecture allows you to access the full memory bandwidth the memory controllers can provide.

  • @danielgall55
    @danielgall55 6 місяців тому +2

    Jesus!!! Finally! Thank you very much!!! I've been waiting for someone to do so for two years! No, seriously, truly, thank you!!!!

  • @prasadsawool
    @prasadsawool 6 місяців тому

    bro looks like Neo while casually buying a top of line 4090 laptop

  • @magfal
    @magfal 6 місяців тому +6

    You can add more system memory to that laptop.
    You can upgrade to 48GB dimms and of its the quad sodimm machine i suspect it might be, you can go up to 192GB system memory.
    Running 96GB in my Asus Scar 18 2023

  • @marsrocket
    @marsrocket 6 місяців тому

    Marketing claims are always subject to specific definitions and configurations, especially when it comes to benchmarks. Apple isn’t doing anything different from every other manufacturer out there.

  • @espi742
    @espi742 6 місяців тому

    1. The CPU can't access the whole memory bandwidth simply because the CPU cores aren't 'fast' enough to do it, or more specifically, the cores don't have enough load/store units to do transfers fast enough to saturate the memory. Simply because they don't need it since CPU workloads aren't that memory bandwidth intensive.
    2. The 400GB/s Apple markets is the amount the amount of data the SoC as a whole can pull from memory, on either combined workloads or very memory intensive GPU workloads.
    3. applegpuinfo doesn't actually benchmark anything, it just shows the theoretical bandwidth (6.4GT/s * 512bit / 8 = 409.6GB/s).

  • @Slav4o911
    @Slav4o911 6 місяців тому

    13B models will run on that 4090 GPU with 16GB of VRAM. 13B models run on my RTX 3060 sometimes barely, but most of them run, so they'll run easily on 4090.

  • @vernearase3044
    @vernearase3044 6 місяців тому

    In the Mac memory model, both CPU, GPU, and all IP blocks can hit memory simultaneously and without moving data and get some pretty high memory bandwidth speeds.
    In the Win memory model, the CPU formats a GPU request and data in main memory, compresses it, transmits it over PCIe where the GPU receives the request and data from PCIe, decompresses it into VRAM, and runs the request. If it's a compute request, the results are compressed in VRAM, transmitted via PCIe, received by the CPU from PCIe into main memory, and decompressed. If we're talking an iterative request, the data flows back and forth over PCIe as many times as necessary.
    So … on the graphics card the GPU can hit the VRAM at tremendous speed - but the speed is bottlenecked by the PCIe transfer speed which is around 50 GB/sec.
    This is the marketing model employed by PC designers because it keeps CPU, GPU, and mother board vendors happy and separate and distinct - allowing each to play in their own sandbox and sell their wares to consumers independently - but the reality is the _“secret”_ win overhead every x86 user pays to keep everything separate.
    Wintel graphics cards have insane speed once the request and data have been set up in VRAM - but there's a _lot_ of steps they have to go through to get the data into VRAM, and a lot of steps required to return graphics card results to the CPU's main memory.

  • @calingligore
    @calingligore 6 місяців тому +1

    Wasn't there any way of running the tests native on windows and not through wsl?

  • @hardi_stones
    @hardi_stones 6 місяців тому +1

    Thumbs up for the amount of research done.

  • @MLWJ1993
    @MLWJ1993 6 місяців тому

    The 1.05 TB/s number is "effective bandwidth". Only really reachable when whatever you're doing fits into the GPU's cache (which is rather large on Ada Lovelace compared to other generations).
    What Nvidia advertises is the typical bandwidth you could expect in most applications.

  • @Johno2518
    @Johno2518 6 місяців тому

    Would be good to see if Resizable BAR changes the performance of the 4090m and by how much

  • @DimitrisConstantinou
    @DimitrisConstantinou 6 місяців тому

    Spending $3200 to find out the pcie speed. Nice. Also the 1 TB/s is probably the SUM of write and read speed between GPU and GDDR ram without communicating with cpu.

  • @jchi6822
    @jchi6822 6 місяців тому +1

    For dGPU you showed us ram to GPU memory operations speed, which could be 15gb/s, even though it seems a bit slow. Maybe your laptop was on battery? For vram to vram dGPU tests, check clinfo cli tool and GPU-Z utility.

  • @yum33333
    @yum33333 6 місяців тому

    What is even confusing about this? You're obviously measuring the PCIe bandwidth when you see 40 GB/sec, which has little to do with the real system performance.

  • @SciTechEnthusiasts
    @SciTechEnthusiasts 2 місяці тому

    Only one photo 😂 7:40

  • @Theodosc
    @Theodosc 6 місяців тому +8

    I’m a computer engineering and informatics student atm and I watch many tech videos every day for different things. You are by far the best channel I’ve come across, from setting up my pcs for programming to comparing the best choices for new hardware I want to purchase. What I like the most is that you are professional, you present real stats and you give advices from your personal experience as a programmer! That’s a real tech guy right there. Keep up the great work u deserve more followers!

    • @AZisk
      @AZisk  6 місяців тому +1

      Wow, thanks!

  • @GShockWatchFan.
    @GShockWatchFan. 6 місяців тому +1

    That beast is 3k. In 5 years you will need to spend another 3k to replace it

    • @whohan779
      @whohan779 6 місяців тому

      It's also nonsense for mobile usage unless you severely underclock it (below regular spec) as it's essentially a more efficient RTX 4070 Ti Super, even the largest airplane-permissible batteries could only power it for about half an hour under load (that's why most of these underclock them below what the battery can deliver in wattage).
      You're likely better off buying a cheaper model with RTX 4070 or below and having a real desktop 4090 with a 5800X3D or smth. for gaming.

  • @ggoddkkiller1342
    @ggoddkkiller1342 6 місяців тому +1

    Memory bandwidth doesn't mean anything without tensor cores that Nvidia cards have! This is the reason why people are literally fighting each others to buy 3090/4090s not apple products at all. Sure, you can run large models on a macbook but it will be painfully slow despite large vram capacity and bandwidth...

  • @boshi9
    @boshi9 6 місяців тому

    A CPU will never saturate 400 Gb/s by itself anyway.

  • @willidriver
    @willidriver 6 місяців тому

    Windows laptops often have a dynamic PCIE link, which might only use one lane whilst being idle. This could change the bandwidth as well.

  • @Heythisismychannel
    @Heythisismychannel 6 місяців тому

    The 4090's memory bandwidth is actually supposed to be 1TB/s but not the mobile version...

  • @Smirnoff67
    @Smirnoff67 6 місяців тому

    What a clown world we live in..

  • @SilentShadow-ss5xp
    @SilentShadow-ss5xp 6 місяців тому

    Just wanted to add some info here. But to my knowledge Nvidia uses compression on the memory bus to help save bandwidth. The memory controller does this in real-time and transparently. This is why you see higher effective bandwidth than advertised. Obviously real world numbers will vary tho.

  • @codymcarthur1932
    @codymcarthur1932 6 місяців тому +1

    More sneaky apple marketing! Why do people support this company so much!?

  • @toddsimone7182
    @toddsimone7182 6 місяців тому

    Sounds like they are advertising memory bandwidth for the GPU. The 4090 laptop version has 576.0 GB/s unlike the desktop version with 1,008 GB/s.

  • @fungo6631
    @fungo6631 6 місяців тому

    PCMR wins again!
    One problem about shared RAM, as seen with the N64 back in the day is that the GPU and CPU have to fight for RAM access. You need to write code differently than in non shared architectures, smaller code being faster due to simply fitting in the cache. Now ofc, modern systems have way more cache than the N64, but the logic still applies. If either side is a memory hog you're gonna have a bad time.

  • @gianlucab2261
    @gianlucab2261 6 місяців тому +1

    For some reason Quality setting for this video is grayed out in mobile app (Samsung s5e here) being it fixed at an abysmal value, it looks like 240p. First time I'm facing such an issue. BTW: great content, as usual.

    • @AZisk
      @AZisk  6 місяців тому

      sorry to hear that, I hope it was just a one-off for you. I checked the quality on this side and it seemed ok before i published

  • @qwertzuiop875
    @qwertzuiop875 6 місяців тому

    Apple never stated that the whole bandwidth can be taken advantage of solely by the GPU. The bandwidth number is for the whole SOC.
    Workloads that stress CPU, GPU, and media engines all at the same time will take full advantage of the full system memory bandwidth.

  • @arneczool6614
    @arneczool6614 6 місяців тому

    You can't expect that the advertised memory bus bandwidth (which simply is bit width x Frequency) to match the bandwidth you messure in real usage, as any messurement is simply a combination of multiple bottlenecks - a good messurement there probably requires a deeper dive into the platform architecture.

  • @SpencerHHO
    @SpencerHHO 6 місяців тому

    For a laptop solution, a unified memory pool with a very wide bus, feeding the SOC which is very close to the ram physically is almost impossible to beat with a more conventional topography. The comparison would be with a workstation laptop running fast DDR5 and a workstation GPU with as much memory as possible. These are hyper niche machines and present worse value than using apple silicon.
    Running significant machine learning tasks on a laptop does seem insane to me though. I can understand some projects being done on the macs and it will be interesting to see how upcoming products from AMD and intel will fair with their dedicated hardware acceleration but I'd rather run such projects on a server in a closet somewhere.
    AMDs upcoming MI 350X chips will be very interesting for this, they aren't monolithic but they will have a huge bus widths, CPU and GPU dies on an interposer with HBM3 memory in addition to (I think) 12 channel DDR5 memory. I believe they may also have 3D stacked SRAM cache dies ontop of logic dies as well. Of course this product would cost more than just one organ and isn't comparable at all to laptop applications but maybe AMD might release a much smaller SOC based on the same layout in a few years.

  • @Sanchuniathon384
    @Sanchuniathon384 6 місяців тому

    800 GB/s is great and all, but the real problem with Apple Silicon is their memory clock speed and their processor clock speed. The GPU's clock speed is 1400 MHz, as well as a lack of shader cores. The Apple's CPU is at the top of the game. However:
    The M3 Max has 5120 shader cores with a GPU clock running at up to

  • @TheRealMafoo
    @TheRealMafoo 6 місяців тому

    What would be good to see, due to the massive difference in architecture, is actually running something useful on both, and seeing how they compare.

  • @Dashient
    @Dashient 6 місяців тому +1

    Haha I love the editing of the unboxing of that gaming laptop

  • @ArtemDzhadzha
    @ArtemDzhadzha 6 місяців тому

    1000 GB/s vs 400 GB/s is not such a big difference, technically speaking. So to me it looks totally true

  • @ousi00
    @ousi00 6 місяців тому

    The VRAM bandwidth is likely calculated. The 4090 has a 384-bit bus which is much wider than the 64-bit running on DDR5. These are GDDR6X memory

  • @5133937
    @5133937 6 місяців тому

    @7:15 Some models can layer or shard themselves between GPU VRAM and system RAM, so you can actually run larger models on smaller memory. Performance suffers obviously, but it's improving.

  • @henfibr
    @henfibr 6 місяців тому

    Latest Nvidia 40XX series cards have increased L2 cache memory. The mobile 4090 has 64MB, compared to the previous generation which only had 5-6MB. This may explain why the system measures 1 TB/sec bandwidth out of a 576 GB/sec card.

  • @jusk2ru
    @jusk2ru 6 місяців тому

    Imagine when this guy discovers that dedicated gpu's exist.

    • @AZisk
      @AZisk  6 місяців тому +1

      this guy is blown away.

  • @Dominik-K
    @Dominik-K 6 місяців тому

    Honestly your one of the best channels with these comparisons! Love your AI and machine learning content and I've personally had good experiences with my M2 Max running those benchmarks too

  • @paul7408
    @paul7408 6 місяців тому

    I feel like Microcenters are always next to a discount clothing store, my local one is next to a Burlington coat factory

  • @desertfish74
    @desertfish74 6 місяців тому

    Expensive way to lean about nvidia shenanigans on laptops.

  • @12Burton24
    @12Burton24 6 місяців тому

    But is apple not talking about the memory bandwidth of ram?

  • @daehxxiD
    @daehxxiD 6 місяців тому

    You can just get GPU-Z which reports the actual VRAM Bandwidth. I always assumed this is just a very clear number that results from multiplying the bus-width x memory clock. No need to actually measure it.
    Then again, PC GPUs have a tendency to downclock both gpu and memory on battery, so you'd better test this connected to mains.

  • @sigma_z
    @sigma_z 5 місяців тому

    I am not sure I understand the comparison. The RTX 4090 in a laptop is a mobile version with a low TDP. For a real test, one must use the desktop version?

  • @Fenrasulfr
    @Fenrasulfr 6 місяців тому +2

    I wonder if you could make use of the DirectStorage Api in machine learning applications that way theoretically you could bypass the cpu entirely.

    • @giornikitop5373
      @giornikitop5373 6 місяців тому

      if you want to transfer data from or to the ram, the cpu cannot be bypassed end of story, stop repeating that bs. directstorage just relieves the cpu from doing the transfer itself. same as rdma or any dma transfer.

    • @Fenrasulfr
      @Fenrasulfr 6 місяців тому

      @@giornikitop5373 Well I made a mistake, most news explains it as bypassing the cpu and sending data directly to vram. Next time try being nicer in explaining to others when they make a mistake.

    • @giornikitop5373
      @giornikitop5373 6 місяців тому

      @@Fenrasulfryou don't get to tell me what to do.

    • @Fenrasulfr
      @Fenrasulfr 6 місяців тому

      @@giornikitop5373 You don't get to be an ass to others just because you know a little more information on one specific topic that the vast majority of people don't give a sh*t about.

    • @giornikitop5373
      @giornikitop5373 6 місяців тому

      @@Fenrasulfr WHAT DID I SAY?

  • @JakeeY
    @JakeeY 6 місяців тому

    If the GPU have 409 and the cpu have 220 at the same moment. It should be 6xx bandwich

  • @tapiolehto5312
    @tapiolehto5312 6 місяців тому

    I Play Company of heroes 2 with my MacBook Pro i9 2019 16 inch, works like a dream, Rome , just the same. There are many more and they work fine with touchpad.

  • @mattqson122
    @mattqson122 6 місяців тому

    apple exec claim their 8gb ram is equivalent to a pc with 16gb of ram, what a boatload of crap.

  • @codyf1
    @codyf1 5 місяців тому

    I know Philip Turner IRL! He's a nice person

  • @laurentitolledo1838
    @laurentitolledo1838 6 місяців тому

    Apple can claim what they want....however they want....
    we the consumers will decide for ourselves....

  • @Monstermotivationalofficial
    @Monstermotivationalofficial 3 дні тому

    Apple advertises what's convenient for them, obviously... BUT still new Macs are incredible powerful when compared to Win laptops. If we also consider that Win machines need to be plugged in order to deliver full power, and also only last few hours on battery.... there is no competition. I'm not an apple fan as I always used, and still have, powerful Win laptops. But this years I decided to buy also a Mac Pro M3 Max and OMG, this thing is a complete different monster. It completes heavy tasks in few minutes, while also playing music, working on heavy files, running the browser and other stuff simultaneously like nothing. Speakers are awesome and the screen is the best I have ever seen on a laptop. And I can stay a full working day Unplugged to power source. My maxed out MSI GE67 Raider in comparison looks like a prehistoric laptop.

  • @hishnash
    @hishnash 6 місяців тому

    That TB/s number is a hitting the cache that is on the GPU die. You need to look at sustained performance over a large write so the cache does not impact the results to much.

  • @EHKvlogs
    @EHKvlogs 6 місяців тому

    what are your expectations from snapdragon elite x?

  • @TheLokiGT
    @TheLokiGT 6 місяців тому

    Alex, the GPU can use ALL the unified memory, all it takes is a shell command. ~75% is just the default setting.

    • @tempeleng
      @tempeleng 6 місяців тому

      Maybe the default is set that way to reserve some RAM for MacOS? I can't imagine having the OS running well when it's memory starved.