@@Martin-wj6um Stupid for you, necessary for him, more clicks/views, more ppl see why the conclusion is what it is. Thinking about yourself only while consuming a product of others is like trying to mix water with oil...
@@w04h PCIe 5 gives 128GB/s vs 1.8TB/s for NVlink v5. PCIe is RUBBISH for inter-gpu communication. PCIe is WAYYYYYY slower. That's why you want an LLM to fit into GPU memory or at worst use NVLink, you NEVER want to go over PCIe, even v5, it's simply too slow.
I was googling this since the 5090 reviews ware up and no reviewers ware taking about this topic. Thank you so much der8auer for focusing on this topic.
While the desktop PC market doesnt necessarily need PCIe 5.0, it is a huge step forward for egpu setups. Without it, stuff like Thunderbolt 5.0, Copprlink 4i, USB 4.2 and other forward looking egpu tech would all have been hard capped to 64Gb which has been our current limit for quite some time.
People forget that the enterprise market is demanding PCI-E 5.0 mostly. Both Nvidia, AMD and Intel are not going to design or create separate chipsets just for consumers and such. All is made for one.
@@vanderlinde4youAnd that’s where the bundled cost is a factor. But the gaming segment is do short sighted because they often don’t do anything productive other than leverage cards for games.
@@davitdavid7165 Possibly. The performance of a full-power mobile 4080 and 4090 is actually quite impressive (with the mobile 4090 being hilariously close to a desktop 4080), and they're actually not even that expensive now. So, the only real option would be an external 4090 or 5090 connected to a laptop, which would still not get full performance. And it has to be a laptop with Thunderbolt 5.0, which limits you to only the newest of the new. This does not benefit those with older gaming laptops, you can connect a 4090 or 5090 eGPU and it'd still get clowned on by a 4080 laptop.
also, try looking into pcie slot speeds in HWMonitor or something ...its funy when I see like 6-8gb/s traffic playing CB77 }ya I have rather oldish 10600 with 3060 ti ..but still...btw, could be interesting test to see the real thruput nees for gaming PCs ...;)
bro you could use the 8x 4.0 and not lose much performance. And people were losing their minds over lane sharing :'D was hilarious to see who knew nothing about bandwidth and realistic performance of the 50 series.
nvme with 4.0 in most cases will run just fine at only 2x lanes, 5.0 can run on 1x lanes, with 8x pcie5 lanes you can run 8 nvme drives at really high speeds
@ Not really. NVMe is usually just a 4x PCIE slot but can be 2x or 1x as well. The x16 slot you put your graphics card in is no different other than the x16 part. What matters is how lanes are split up electrically. But you can buy cards you could put into that x16 slot to get 4 NVMe slots at x4 each or 8 at x2, etc so long as the motherboard supports that level of bifurcation.
Awesome video! Something that would be interesting to test would be how the PCIe generations effect the 5090 at PCIe x8. The big thing I've noticed on the current generation motherboards (and CPU lane allocation) is that the PCIe gen 5x16 slot is switched/split with the lower slot and if both are in use, they are only x8 slots. In addition, if you end up using more than 2 m.2 drives (on both x870e and z890 boards) the main PCIe_1(G5) slot will only run at 8x. From a theoretical stand point PCIe Gen5 x8 should equal PCIe Gen4 x16 in bandwidth but I haven't been able to find any info if the fewer lanes cause a performance hit. As you stated in the video too, if the slot is x16, then using a PCIe Gen4 riser cable shouldn't be an issue, but if you are using 3 m.2s at the same time the bandwidth could be reduced enough to see a performance hit. It would be really helpful/interesting to see how the 5090 performs in a "x8" slot at the various PCIe generations.
I think it is worth mentioning in the intro that users might be running bifurcation and 2x8 instead 1x16. The video is focusing on gaming so not really an error.
@Apokathelosis I wrote oculink because it's more popular than MCIO (which is oculink but for 8 PCI lanes), but my point is PCIe 5.0 is a neat feature when you can't have all 16 lanes
Glad you mentioned riser compatibility at the end - just like when the 3000 series RTX cards landed, users will again need to pay attention to the compatibility of their specific case or board riser/adapters, or risk a no-POST/no output situation. SFF builds in particular can be tricky, as you may need to completely disassemble them in order to directly slot the GPU into the motherboard, providing access to the BIOS in order to manually select Gen4 (and then reassemble it again). With most AM5 Ryzens now coming with iGPUs, this should make troubleshooting or circumventing this particular step a bit easier, but it's still something that a lot of people may forget to check. Make sure you're on a Gen5 capable riser if possible, it'll save a lot of headaches. :)
Yes, this is quite an annoying thing. Same with Intel B580 with no UEFI support and standard bios values of manufacturers. I think maybe the bios manufacturers should make some better "fail safe" for lower pci-e speed and uefi settings if CMOS is cleared. Or maybe they should make some new key we can hold to get it to default bios into "safe mode"..
Yup SFF riser builds can be a pain. Esp. if you have to reset or update bios afterwards it'll no longer post. I'm glad my current X670E-I has a switch on it with positions for auto (gen 5), gen 4 or gen 3. More boards should do this.
@fredEVOIX depends on the case, some itx cases can handle it. Especially if paired with 9800x3d which uses very little energy. I will try it in the ncase m2 grater
One area that the video overlooked are large AI workloads, or games, where there is VRAM spillover to system memory. For example, a 12 to 16GB card working on a 20 to 40GB dataset. In cases like that, the PCIe bus bottlenecks has a massive impact on the workload. For example, having stable diffusion do a 40GB workload on RTX4070, dropping from PCIe 4.0 X16 to PCIe 3.0 X16 (effectively simulating PCIe 4.0.X8), doubles the time needed for the render. In such a case, a 4070 12GB performs similarly to a 3060 12GB. While most people will not do AI image, but it is the easiest test to do since the setup wizard is automated, but it represents a very real issue. Consumer cards tend to lack sufficient VRAM, but many AI workloads are memory hungry, this means that depending on the model, a user can easily end up with a workload wanting to allocate 2 to 4X the amount of VRAM the car actually has, this makes the performance bottleneck the speed of the PCIe bus and not even the GPU cores themselves.
Well 5090 seems not to run out of vram! 😂 So the test is very valid! But 4060 is slow even with pci5 when it run out of vram… so the PCI version does not matter even in there. Aka if you runnout of memory, it does not matter if you would have pci7. It is too slow to compensate. The main point of PCI slot is to deliver instructions to the GPU and pci3 is fast enough for that… for any gpu. 5090 is the fastest, so it needs CPU to talk to it faster than any other GPU… so there are not GPUs that needs anything faster than pci3.
@@haukikannel If you are not running out of VRAM, then the PCIe bus speed has very little impact on performance, and often with newer more demanding games, often they have become even less PCIe bus speed sensitive since many newer rendering methods are reliant more on VRAM and computation within the card. For example, unreal engine nanite functions do not require many additional instructions over the PCIe bus, but it requires drastically more reads and writes to the VRAM, since the cars needs to maintain the original asset in VRAM as well as a new computed asset that gets used in the viewport, and is updated for every frame. While it is computationally simple, thus low compute overhead, it requires a lot of VRAM throughput, which is usually not an issue modern cards with 500-800+GB/s throughput. (though it is an issue for cards like the RTX 4060 which have a disproportionately larger hit since the VRAM throughput is so low (similar to that of the GTX 970 and GTX 980). While the PCIe bus is always a bottleneck when pulling double duty as a bus for video memory, modern cards do prioritize memory assets. For example, they will use it for idle and low usage assets that are less sensitive to slower access. A good example, is a game like Ratchet & Clank: Rift Apart, it can allocate 4+GB of system memory as video memory, and not suffer a noticeable performance hit outside of a slightly higher frame time spike for a few milliseconds when there is a sudden rift change. This is because the game preloads all assets for other rifts below the main map, and since occlusion culling prevents its render, it has no noticeable performance hit. Once you travel through the rift, it simply sends you below the map, and then NPC AI starts up, then there is a small frame time increase as data is swapped. With that in mind, there are limits to the prioritization where at some point you reach a level of shared memory usage where data that needs the higher throughput ends up in system RAM, and that leads the PCIe bus to reach saturation, then at that point, you start to notice major performance hits. The faster the PCIe bus is, the more bandwidth intensive assets can be used in system memory before saturation is reached. This is why for example, game can use 9GB of VRAM on an 8GB card (spillover of 1GB), and in many games, only suffer a slight performance drop, but as soon as the bus usage on the read side reaches full saturation, then the game starts to hitch and suffer large frame rate drops. With a few exceptions, PCIe 4.0 X16 can do 1.5-2GB of shared memory without major hitching, while PCIe 4.0 X8 will only do around 500-700MB before major hitching is encountered. A faster bus has more headroom to pull double duty.
@@mromutt It doesn't seem like PCIe 5.0 will do much for direct storage, since the main bottleneck even now is the read speed of the SSD. With PCIe 4.0 you get a real world read speed of around 25-26GB/s over the PCIe bus, while the current best PCIe 5.0 SSDs will do 12-14GB/s. While a RAID 0 can improve performance in those areas, most game loads are low queue depths, often there is a single thread overhead in those workloads, thus even saturating the PCIe 4.0 X16 bus will be a challenge, unless they start to optimize it for more database style workloads. Beyond that, for most consumer platforms, often you will have 4 CPU lanes for the SSD, and then 4 lanes going to the chipset which is shared with everything else on the chipset, thus scaling will not be 100%. With more workstation and server platforms with more PCIe lanes, there is more throughput to work with but often at lower single threaded performance. Since the CPU does not need to do decompression, it is left as a single threaded task.
Thanks for making this video, a lot of the sites just gloss over this bandwidth testing between the different versions of PCIe standards. I been curious to what the benchmarking would show and thanks to you I atleast get a glimpse of the answer 🤔
Oh dear. When I built my AM5 system a few months back, I paid a bit more than planned on the motherboard because it has a gen 5x16 slot. Looks like I wasted a bit of money. Great info Roman, thanks!
There is still value in having gen5x16 PCIe slots, just maybe not gaming GPU scenarios. You can bifurcate for 4x NVMe, and the bandwidth is far more important if you use AI models that don't fit into VRAM and need to transfer across PCIe into system RAM.
im on pci gen 3, and if my rtx 4080 drops to 8x, performance drops almost linearly with it, where it wouldn't budge at all with pcie gen 4, and pcie gen 5 could go to pci 4x just fine. so no.
This makes me feel better running my 7900 XTX in PCIe 4 x8 (same bandwidth as PCIe 3 x16) due to the PCIe lane sharing with M.2 slots on X870E. I already figured it wouldn't make a difference but this just solidifies all doubts now! :D
I'm very happy with the outcome of these tests. It means that you'd be able to configure the mobo to split lanes out for M.2 devices etc. and still achieve plenty fast enough bandwidth to the GPU. (Not that I'm gonna be in the market for a 5090 any time this century, but a more modest sibling might be a possibility.)
Does limiting the amount of lanes at 5.0 cause any differences in the scaling vs dropping to older versions still at x16? I know 5.0x8 is approx the same bandwidth as 4.0x16 and 5.0x4 is about the same bandwidth of 3.0x16 or 4.0x8 , but does having less faster lanes have any implications on latency that might cause any performance differences despite being the same raw bandwidth.
@@Arcona I looked at their 5090 article and the methodology is not laid out, they don't explain what they tested at they just say that they should be identical.
This is what I want to know as well. In theory it is identical but in practice? Lots of x870 mobos cut pcie5.0x16 into x8 when certain m.2 slots are populated
Thanks for benchmarking this and showing gpu-z for how to verify your configuration. The Motherboard info on my Gigabyte was slightly confusing for me whether my 4.0 ran at x16 or x8 with an M.2 in a particular slot. Edit: Also realized my 2080 only runs at 3.0 x16, which is why my MOBO reported 3.0. Was confused for a few minutes, good to know.
Thank you! Exactly the info I was looking into yesterday. If I had $2000 to spend on a RTX 5090, I'd buy a better CPU & Mobo first anyway. Way more performance per dollar.
I would love to see a comparison of 5090 at 5.0x16 and x8 vs 4.0x16 and x8 vs 3.0x16 and x8 . One of the big benefits of a Gen5 GPU is splitting lanes on the x16 slot.
5.0 x8 = 4.0 x16 and same for 4.0x8 = 3.0 x 16 etc. So it would be the same as shown in video. He would only have to do 3.0x8 then, thats the only one missing
@@EquilibriuMindyes and no. For instance you see power differences in this video. Does this mean gen 5 in x8 results in gen 4 perf with gen 5 consumptions ? I think it would still be interesting. Quite a few people are running add-on cards and very few motherboards won't drop your slot to x8 in that situation.
Yes, this was my first thought when seeing this video hope we can get something to show this. As the motherboard I have has split lanes for the GPU and the first m.2 slot would be interesting to see if you still get the full speed on x8 rather than x16.
Only reason it's necessary is because of motherboard manufacturers doing pcie lane shenanigans with M.2 slots. Being able to run at 5.0 x8 and get full 4.0 16x bandwidth will be nice for people who use their other PCIE slots. I am actually rather surprised that the performance difference between 5.0 and 4.0 was different at all, wouldn't expect the bandwidth difference to matter even if it's only a few %.
i sense another victim of the lie they keep pushing like i was, they always write 5.0 goes x8 but its not true its x8 of whatever you plug in aka 8x 4.0 on a 4090, im on msi motherboards now ypu can use all m.2 and keep pcie1 x16 in their good mb, no or very little bandwith sharing x670e meg ace is great
Not entirely true. You also have to think about optics. Think back to how everyone was reacting to Intel 10th Gen not supporting PCIe Gen 4. Almost everyone was taking them to task for it, even though it made no real world difference. If NVIDIA hadn't made Blackwell Gen 5 then they would have caught no end of grief over it. Doesn't matter if the bandwidth is necessary. Gen 5 is there, so it _has_ to be supported. The critics won't allow anything else.
@@Michael_mki233CPU supporting higher PCIE specifications is entirely different than GPU’s. The CPU PCIE lanes has multiple applications where it can benefit (like NVME SSD’s and such) compared to the GPU where it’s more of a toss up.
I'm sill using a 2019 MacPro with Cascade Lake Xeon 16core and a FE 4090 may upgrade GPU in the future PC is still fine. Running Windows 11 or course. So good to see 3.0 is not a huge performance drop.
it would be interesting to see this in games that actually stress the Vram usage of the card - like Microsoft flight sim, or large open world games - where new data is constantly loading in and swapped out. in these games I would expect the bandwidth to matter more.
I do actually have a good use for this. It lets me use an x8 link to my cpu without performance loss, meaning i can have high speed networking as well as a GPU, which is a big plus for me.
On many motherboards if you populate all the M.2 slots the PCIE slot will be dropped from x16 to 8x. So the point of PCIE gen5 on this card is so that it can run in gen5 x8 in that setup which is equivalent to gen4 x16 speeds which is fine. If this were still a gen4 card, and the motherboard again dropped the lanes for it to x8, it would be a much different story (basically gen3 x16 speeds).
Wow, the CP2077 result really ended up being a good way to measure... relative PCIe controller power usage! We see the hints in the other graphs but I suspect most or all of the power difference comes from the PCIe x16 controller. And the numbers are about what one would guess/expect but it's still neat to see it that graphically.
@@TransformHypnosis They do mean the same thing at the core! I caught that too, I wouldn't say it's wrong or inaccurate. Roman's English grammar and use is better than most of the people I encounter daily!
As I am on a x370 MoBo with a 5800x3D, I am still rocking PCIe 3.0. I was worried about needing to upgrade for this. I am glad that there is not much of a difference in the PCIe generations. I still have a 3080ti, which is still excellent. I will still be skipping 50xx series. I just dont see the need to upgrade. I get decent frames in the games I play still.
Never believe the latest games telling you it is now time to upgrade, you'd be supporting lazy developers and not actual progress. It has mostly been the case past, like, 2010 or something, but nowadays it is exponentially true.
@@danielainger I've got a rx 6900XT and 5800x which was my first gaming PC....of course I did it at the worst possible time lol. It's been great for all the games I play, esp since I don't care about raytracing whatsoever. However, I was hoping to see an 8900XT (9090XT) this generation but AMD decided not to have a higher end card this gen which is very disappointing. I'm hoping they don't permanently stop making a flagship card. I like AMD's value (at least in the few years I've been into gaming PCs) compared to Nvidia, even if they can't go completely toe to toe.
If you want a game to add to PCIe testing specifically. Look at Metro Exodus. It was one of the most impacted back in TPUs 4090 test from dropping down to 3.0.
Is the data at 6:35 accurate? That’s a 48% increase in average FPS from 4090 to 5090. The 1% lows seems about right (27%). Seems like you put the 5090’s 1% low in place of the 4090 average, perhaps?
I bought a r5 3600 with broken pcie x16 lane for ~$30. go to bios and change it to x4x4x4x4 and it's run fine. bought a rx5600 oem and now running pcie 3.0x2. it got some problem booting no display and I figured it out that you need to loosen the gpu socket( let it sag). I'm using a 400w psu and somehow gpu could run 150w just fine. but i cap all game at 60fps and using lossless scaling fsr and fg x2 on my 1080p 60hz monitor which overclock to 100hz(it is a 3d 120hz but run 1080p 60hz). thank you for listening to my rant.
As others have pointed out - this means that the vast majority of users do not have to worry about being capped at x8 due to lane sharing - something that Gigabyte is doing a lot of with their latest motherboards. (Not sure who needs 4 Gen 5 NVMe slots though).
I imagine the PCIE bandwidth would have a big impact on games with Direct Storage and/or Resizable bar where much more data is being transmitted over PCIE
@@KazutoKirigaya-hi8zh Running out of VRAM is gonna make games run like crap regardless though. Best option there would be to play in settings where you won't run out of memory.
PCIE 5.0 is extremely useful for deep learning/AI, where moving training data from system memory to GPU memory is a bottleneck. Combined with the ability to put 4 cards on a single motherboard without having to replace the coolers with water cooling make it possible to now relatively simply build very beefy deep learning workstations (if you can get the heat under control). Basically, for gaming the 5090 may be "meh". But as a budget AI card, it's a big leap from the 4090. And while $2000 may seem a lot for gamers, compared to the $30-40k for the H100, it's a baragin. If even SOME of the workloads that used to require H100's can be replaced by systems with 5090s, it basically makes such systems 10x cheaper.
What kind of abominations of fans would one need to drive four 5090s next to each other? I can't even imagine that with those "will chop a carrot" server fans requiring hearing protection. They will suffocate each other.
CUDA code samples has a PCIe bandwidth test. The PCIe speed should have influence over the loading time, as long as the texture data is already in the memory cache.
This is AMAZING! - not because it gives us more performance (it doesn't), but because it enables us to use this card at pcie 5 x8 instead of pcie 5 x16. this means that we GAIN a cpu-connected pcie gen5 x8 pcie slot to do with as we please without losing any performance on our new 5090. I have been complaining for YEARS that intel and amd are not giving us enough pcie lanes on our current generation enthusiast cpus. this now gives us the opprotunity to use up to 8 lanes of pcie 5 to do with as we please. this could mean 2 more pcie 5 ssd's at full direct-to-cpu bandwidth, or it could mean a 4x 4k video capture card, or it could mean a 25g, 50g or 100g network card... it used to be if you had a top end consumer cpu, you had zero expansion if you wanted all of your gpu performance... NOT TODAY. this is FANTASTIC!
Could use something like memtest-vulkan as an alternate to measure bandwidth drop. Typically the AM5 PCIe lane config for least restricted/full-lane M.2 configuration is: CPU(Mode: x16 or x8/x4/x4) + CPU(x4 5.0) + Chipset(x4 4.0) SSDs based on the new SM2508 controller will finally be what I consider thermally viable.
It was the same story when PCIe 4.0 and 3.0 released What I'm really excited about when it comes to PCIe upgrades is the extra lanes I can suddenly use for storage devices.
Ive been looking for updated videos on the 4080 on this subject. Using the top nvme on the apex encore drops it from 16 lanes to 8 lanes. I'd like to see the effects on modern games
@super2k128 I'm not talking about 5090 I'm talking about 4080 the 5090 is pcie 5 the 4080 is pcie 4 then when you use the top nvme it drops to 8x from 16x
@@toonnut1 It's going to be a similar situation for most recent cards. Gen3 x16 is basically the lower limit for GPUs these days for GPU bandwidth, even for cards as low as the RX 6600. So, Gen3x16, Gen4x8, and even Gen5x4 will be enough for all GPUs right now.
The good news is that when using only 8 lanes for the GFX, it allows for two more SSDs to be connected directly to the CPU (at least on the X870E platform).
For completeness I'd love to see 5x8 tested since some people want to use those extra PCIE5 lanes in modern motherboards for other things now. I know that 5x8 is the same throughput as 4x16 so it's probably just fine, but you never know for sure until you measure.
Maybe in a very niche application in which the VRAM overflows - but only on a high bandwidth platform (-> Threadripper or server platforms with more than four channels of memory). But then again the memory will be at a glacial speed compared to the GDDR7@512bit
this is why I wish we would see some motherboards out there with a full x16 4.0 slot on the bottom and let me have more 5.0 lanes for m.2 (not that I really need it there, but at least there is a show-able difference)
I remember seeing someone explaining that pcie gen impacts performances only when the vram is limited, it was an interesting video hence the low loss of performance thanks to the 32GB
Would love to see results with the 16x slot running at 8x electrical at 5.0, 4.0 and 3.0 in the charts. This is the more relevant test with folks wanting to run multiple NVMEs.
I imagine the average use case is bottlenecked on the work being done to the data, rather than the transfers happening in the background. If you always finish the work after the transfer is finished, at either transfer speed, the transfer speed doesn't matter.
PCIe is waaaaaaaaaay ahead of the needs of graphics cards. Because it's primarily a standard used by servers with much greater throughput needs. I think GPU makers use the latest pcie standards just to 'hope' for one extra frame in comparative reviews. And because pcie interfaces are cheap to update.
Maybe the diff could be only visible when loading textures or other materials from the SSD to the VRAM, or in the other way, based on the bandwidth it should impact the time to do these operations. But it needs to be tested in the "real world", to see if there's a latence or something..
Each Nvidia H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs.
Thanks for the test, i was already a bit worried with my PCIe 3.0 board, even if i will only upgrade to a 5070Ti or maybe a 9070XT if FSR4 will be good and FMF2 is available on ALL games, not like DLSS Frame Generation..
Well I would think that the PCIe bus only matters when you're uploading textures, and since the 5090 has 32 gb vram, it has enough space to hold all of the textures in a game before hand.
Exactly! Here I have a 4090 PCIe 4.0 at x8 because it shares bandwidth with an Intel Optane 960 GB PCIe drive. Performance loss is negligible, but with a 5090, PCIe 5.0 x8 would be equivalent to full 4.0 x16 bandwidth. So it is always good that the 5090 is PCIe 5.0 even if we don't take full advantage of it (and even if only for peace of mind lol)
Having to not sacrifice speeds at x8 to populate more drives or other devices is huge for so many folks, and for same saves costs of not having to go workstation platforms. I don’t why anyone would think it’s useless
The crazy thought to me is you could run a 5090 off of a pcie 5 x4 connection which is the same as pcie 3 x16 and be fine. Kinda cool when you think about it.
I was talking with someone on the AMD subreddit the other day about the move to higher PCIE standards, how it was done not due to performance needs but mostly just because everything else was moving forward a generation. As a cost-cutting measure, they could honestly probably wire these cards for PCIE 5.0 x4 (which has the equivalent bandwidth of 4.0 x8/3.0 x16). The only major issue would be legacy systems, and probably the reason why they generally don't; the card in a PCIE 4.0 system (like Ryzen 5000) would probably be okay for the most part (4.0 x4/3.0 x8), but it would start to be severely hampered on a PCIE 3.0 board (Ryzen 3000 and below/Core 10th Gen and below). Low-end cards like the RTX xx60/RX x600, which have access to 8 lanes of PCIE 4.0, lose noticeable performance when dropping to 3.0 (but even then it's not much). We got into discussing PCIE switches to "convert" said cards' 4.0 x8 into 3.0 x16 in order to maintain the bandwidth and keep legacy systems viable for a while longer, and got discouraged at the exorbitant prices for them...
The 3DMark PCI Bandwidth test is highly effected by rBar too, use Nvidia Profile Inspector to enable rBar feature and change the rBar - size limit to a setting greater than 2GB and the test will likely have much higher results, I found this a year ago when trying to find why my system seemed stuck at 24G/s when some lesser systems saw more. IIRC you have to set it globally to affect the actual test app 3Dmark's UI spawns. On a 13900k 4090 reBar under 2GB size limit always gave ~24 GB/s, 2GB limit gave ~28 GB/s and any rBar window over 2GB gave ~60 GB/s 0x0000000080000000 = 2GB VRAM 0x0000000100000000 = 4GB VRAM 0x0000000200000000 = 8GB VRAM 0x0000000280000000 = 10GB VRAM Yakumo.
@@MissMan666 There are pros and cons with both. There are most definitely valid reasons to only make it 8x - size is one of them. Lets take a PCIe 5.0 8x card that consumes the entire bandwidth of 5.0 8x. Add this card to a PCIe 4.0 16x motherboard, and you won't have the bandwidth to support it, even though 16 lanes of 4.0 should be the same as 8 lanes of 5.0, the card only has 8 lanes, so you're limited to 8 lanes of 4.0, and that's not enough. Depending on the cost and type of product, they could have made it 5.0 16x. Yes, it would only ever need half of that slots bandwidth, but it would allow the card to run in multiple configurations. Like in a bifurcated 5.0 x8/x8 configuration or a full 4.0 x16 slot. It's quite common for PCIe slots to allow longer cards than the actual slot length, so unless size of the card is a concern, there's no reason not to make it 16x, well, except cost...
This almost looks like the game engines conservatively restrict transactional data to the bandwidth of PCIe 3 x16. This in turn looks bright for the oculink 4i for gen 5 external GPUs until the new titles start to shift this limit to PCIe 4 x16.
PCIe 5 is going to be a lot more important for lower VRAM boards. The 5090 has more memory than any game will use, so has very little need to ever transfer data in and out during play. I'd like to see this test done again with a 5060
I think the one advantage that I'd gain with PCI-E 5.0 is that the current X870/X870E motherboards from AMD are kind of starved for PCI-E lanes. I use a 10G NIC in my second slot, which means my first PCI-E slot runs at 8x instead of 16x. So, using a PCI-E 5.0 card means I'd theoretically get my effective PCI-E 4.0 16x speeds back since PCI-E 5.0 is twice as fast. It's not a huge deal, but it's a nice little bonus.
We always knew gen 5 would be useless for graphics cards from the pure performance POV. But opening the option for x8/x4/X4 is the real deal here. Maybe even in the card itself, with SSDs on the back of the board holding all the assets. The way is paved.
@@riba2233 Ah sure. I suppose if it did make a big difference you'd have to get a 5.0 certified cable, and you know they'd charge way extra just to say it works at 5.0 ha ha.
I'd like to propose an explanation for the PCI bandwidth test results. I know that back in 2020, Nvidia introduced a library to do on the fly LZ4 compression for data on the PCIe Bus. This was meant for large scale systems using NVLink and CUDA workloads, but it's possible this has now made its way down to consumer cards, specially with the recent push for disk streaming. It'd be interesting to see if you could run this benchmark for different card generations, all the way back to 2018 maybe, for both Nvidia and AMD cards. As a side note, I also wonder if you might be able to notice the drop from 5.0 to 4.0 if your game is using disk streaming.
New standard like PCIe 5.0 is NEVER useless! As someone who typically runs multiple GPUs, higher standard not only allows to use older PCIe slots, it also allows to insert GPU into PCIe 5.0 x8 or even x4 slot while losing minimal amount performance. This is also notable with older enterprise NICs that were built with PCIe 3.0, which could either be used with 2.0 x8 slots or 3.0 x4 slots with full speed and 3.0 x4 is easy to find these days, while dedicating 8 lanes just for dual-port NIC feels excessive.. Similarly latest Aquantia 10G chips support PCIe 4.0 x1 or PCIe 3.0 x4 to run at full speed, which is great in case you only have a single x1 slot left on the motherboard and latest motherboards often have those slots at 4.0 speeds going through the chipset. So in all possible cases I would choose a GPU/NIC/whatever that supports higher PCIe version rather than lower, even if it technically doesn't require it today.
I love how the title is not a clickbait, but a short conclusion
only the stupid question mark....
Hear hear!
I guess the real youtubers beat him to anything that was relevant or important, or maybe he just likes to troll Nvidia.
@@Martin-wj6um Stupid for you, necessary for him, more clicks/views, more ppl see why the conclusion is what it is. Thinking about yourself only while consuming a product of others is like trying to mix water with oil...
as it should be
We need PCIe AI for better performance.
psu ai for better overlocking power, trust bro, just one more ai
💀
You are joking but more PCI bandwidth is great for multi gpu communication which is useful for running big LLMs
@@w04h PCIe 5 gives 128GB/s vs 1.8TB/s for NVlink v5. PCIe is RUBBISH for inter-gpu communication. PCIe is WAYYYYYY slower. That's why you want an LLM to fit into GPU memory or at worst use NVLink, you NEVER want to go over PCIe, even v5, it's simply too slow.
With the new DL PCIe from nvidia, you could easily inject 4 time the bandwith of 5.0 while running your system on 1.0.
I was googling this since the 5090 reviews ware up and no reviewers ware taking about this topic. Thank you so much der8auer for focusing on this topic.
Agreed im surprised no one but him is talking about this for the people that dont want to upgrade yet.
TechPowerUp did PCIe scaling testing.
VideoCardz also did this kind of testing
While the desktop PC market doesnt necessarily need PCIe 5.0, it is a huge step forward for egpu setups. Without it, stuff like Thunderbolt 5.0, Copprlink 4i, USB 4.2 and other forward looking egpu tech would all have been hard capped to 64Gb which has been our current limit for quite some time.
People forget that the enterprise market is demanding PCI-E 5.0 mostly. Both Nvidia, AMD and Intel are not going to design or create separate chipsets just for consumers and such. All is made for one.
That is my hope too. A world where egpus are viable could lead to some very cool options for people
@@vanderlinde4youAnd that’s where the bundled cost is a factor. But the gaming segment is do short sighted because they often don’t do anything productive other than leverage cards for games.
@@davitdavid7165 Possibly. The performance of a full-power mobile 4080 and 4090 is actually quite impressive (with the mobile 4090 being hilariously close to a desktop 4080), and they're actually not even that expensive now. So, the only real option would be an external 4090 or 5090 connected to a laptop, which would still not get full performance. And it has to be a laptop with Thunderbolt 5.0, which limits you to only the newest of the new. This does not benefit those with older gaming laptops, you can connect a 4090 or 5090 eGPU and it'd still get clowned on by a 4080 laptop.
also, try looking into pcie slot speeds in HWMonitor or something ...its funy when I see like 6-8gb/s traffic playing CB77 }ya I have rather oldish 10600 with 3060 ti ..but still...btw, could be interesting test to see the real thruput nees for gaming PCs ...;)
The only thing I see pcie 5.0 being useful for, is that now I can use x8 instead of x16 for more expansion
and that is massive to me.
with the absolute dearth of available lanes these days, this is hugely important. As long as the motherboards let us use them.
bro you could use the 8x 4.0 and not lose much performance. And people were losing their minds over lane sharing :'D was hilarious to see who knew nothing about bandwidth and realistic performance of the 50 series.
@@GraveUypo You know what else is massive
just curious, what would you even be running in expansion slots in a gaming rig?
Happy to hear that 5.0 x8 would run nice, so you could use 2 extra NVMe slots.
Like I do now with my 3090, whit 3x NVMe. (1x OS and 2x in R0)
nvme with 4.0 in most cases will run just fine at only 2x lanes, 5.0 can run on 1x lanes, with 8x pcie5 lanes you can run 8 nvme drives at really high speeds
I honestly kind of wish this just became the norm. There just aren't enough PCI-E lanes in a consumer computer to have a lot of drives.
But GPU PCIE is separated from NVMe, no?
nobody asked
@ Not really. NVMe is usually just a 4x PCIE slot but can be 2x or 1x as well. The x16 slot you put your graphics card in is no different other than the x16 part. What matters is how lanes are split up electrically. But you can buy cards you could put into that x16 slot to get 4 NVMe slots at x4 each or 8 at x2, etc so long as the motherboard supports that level of bifurcation.
Awesome video! Something that would be interesting to test would be how the PCIe generations effect the 5090 at PCIe x8. The big thing I've noticed on the current generation motherboards (and CPU lane allocation) is that the PCIe gen 5x16 slot is switched/split with the lower slot and if both are in use, they are only x8 slots. In addition, if you end up using more than 2 m.2 drives (on both x870e and z890 boards) the main PCIe_1(G5) slot will only run at 8x. From a theoretical stand point PCIe Gen5 x8 should equal PCIe Gen4 x16 in bandwidth but I haven't been able to find any info if the fewer lanes cause a performance hit. As you stated in the video too, if the slot is x16, then using a PCIe Gen4 riser cable shouldn't be an issue, but if you are using 3 m.2s at the same time the bandwidth could be reduced enough to see a performance hit. It would be really helpful/interesting to see how the 5090 performs in a "x8" slot at the various PCIe generations.
I think it is worth mentioning in the intro that users might be running bifurcation and 2x8 instead 1x16. The video is focusing on gaming so not really an error.
PCIe 5.0 is really useful when you use graphics with oculink, because it's limited to only 4 PCI lanes
Does anything with oculink @5.0 even exist?
@Apokathelosis I wrote oculink because it's more popular than MCIO (which is oculink but for 8 PCI lanes), but my point is PCIe 5.0 is a neat feature when you can't have all 16 lanes
oculink?
@@Apokathelosis There are adapters on Aliexpress PCI 5.0 but without oculink
Nobody runs that crap, stop coping.
Glad you mentioned riser compatibility at the end - just like when the 3000 series RTX cards landed, users will again need to pay attention to the compatibility of their specific case or board riser/adapters, or risk a no-POST/no output situation. SFF builds in particular can be tricky, as you may need to completely disassemble them in order to directly slot the GPU into the motherboard, providing access to the BIOS in order to manually select Gen4 (and then reassemble it again). With most AM5 Ryzens now coming with iGPUs, this should make troubleshooting or circumventing this particular step a bit easier, but it's still something that a lot of people may forget to check. Make sure you're on a Gen5 capable riser if possible, it'll save a lot of headaches. :)
Yes, this is quite an annoying thing.
Same with Intel B580 with no UEFI support and standard bios values of manufacturers.
I think maybe the bios manufacturers should make some better "fail safe" for lower pci-e speed and uefi settings if CMOS is cleared.
Or maybe they should make some new key we can hold to get it to default bios into "safe mode"..
Yup SFF riser builds can be a pain. Esp. if you have to reset or update bios afterwards it'll no longer post. I'm glad my current X670E-I has a switch on it with positions for auto (gen 5), gen 4 or gen 3. More boards should do this.
Didn't the radeon 5000 series have similar issues with risers as well?
if you put a 600w card in an sff you deserve the hell you wanted, please be reasonable 300w is already way too much unless its liquid cooled
@fredEVOIX depends on the case, some itx cases can handle it. Especially if paired with 9800x3d which uses very little energy. I will try it in the ncase m2 grater
One area that the video overlooked are large AI workloads, or games, where there is VRAM spillover to system memory. For example, a 12 to 16GB card working on a 20 to 40GB dataset. In cases like that, the PCIe bus bottlenecks has a massive impact on the workload. For example, having stable diffusion do a 40GB workload on RTX4070, dropping from PCIe 4.0 X16 to PCIe 3.0 X16 (effectively simulating PCIe 4.0.X8), doubles the time needed for the render.
In such a case, a 4070 12GB performs similarly to a 3060 12GB.
While most people will not do AI image, but it is the easiest test to do since the setup wizard is automated, but it represents a very real issue. Consumer cards tend to lack sufficient VRAM, but many AI workloads are memory hungry, this means that depending on the model, a user can easily end up with a workload wanting to allocate 2 to 4X the amount of VRAM the car actually has, this makes the performance bottleneck the speed of the PCIe bus and not even the GPU cores themselves.
Well 5090 seems not to run out of vram!
😂
So the test is very valid!
But 4060 is slow even with pci5 when it run out of vram… so the PCI version does not matter even in there. Aka if you runnout of memory, it does not matter if you would have pci7. It is too slow to compensate. The main point of PCI slot is to deliver instructions to the GPU and pci3 is fast enough for that… for any gpu. 5090 is the fastest, so it needs CPU to talk to it faster than any other GPU… so there are not GPUs that needs anything faster than pci3.
@@haukikannel the 4060 is only pcie 4.0 though, and 8 lanes
@@haukikannel If you are not running out of VRAM, then the PCIe bus speed has very little impact on performance, and often with newer more demanding games, often they have become even less PCIe bus speed sensitive since many newer rendering methods are reliant more on VRAM and computation within the card. For example, unreal engine nanite functions do not require many additional instructions over the PCIe bus, but it requires drastically more reads and writes to the VRAM, since the cars needs to maintain the original asset in VRAM as well as a new computed asset that gets used in the viewport, and is updated for every frame. While it is computationally simple, thus low compute overhead, it requires a lot of VRAM throughput, which is usually not an issue modern cards with 500-800+GB/s throughput. (though it is an issue for cards like the RTX 4060 which have a disproportionately larger hit since the VRAM throughput is so low (similar to that of the GTX 970 and GTX 980).
While the PCIe bus is always a bottleneck when pulling double duty as a bus for video memory, modern cards do prioritize memory assets. For example, they will use it for idle and low usage assets that are less sensitive to slower access. A good example, is a game like Ratchet & Clank: Rift Apart, it can allocate 4+GB of system memory as video memory, and not suffer a noticeable performance hit outside of a slightly higher frame time spike for a few milliseconds when there is a sudden rift change. This is because the game preloads all assets for other rifts below the main map, and since occlusion culling prevents its render, it has no noticeable performance hit. Once you travel through the rift, it simply sends you below the map, and then NPC AI starts up, then there is a small frame time increase as data is swapped.
With that in mind, there are limits to the prioritization where at some point you reach a level of shared memory usage where data that needs the higher throughput ends up in system RAM, and that leads the PCIe bus to reach saturation, then at that point, you start to notice major performance hits.
The faster the PCIe bus is, the more bandwidth intensive assets can be used in system memory before saturation is reached. This is why for example, game can use 9GB of VRAM on an 8GB card (spillover of 1GB), and in many games, only suffer a slight performance drop, but as soon as the bus usage on the read side reaches full saturation, then the game starts to hitch and suffer large frame rate drops. With a few exceptions, PCIe 4.0 X16 can do 1.5-2GB of shared memory without major hitching, while PCIe 4.0 X8 will only do around 500-700MB before major hitching is encountered. A faster bus has more headroom to pull double duty.
I was actually thinking about what kind of impact this would have on direct storage games.
@@mromutt It doesn't seem like PCIe 5.0 will do much for direct storage, since the main bottleneck even now is the read speed of the SSD. With PCIe 4.0 you get a real world read speed of around 25-26GB/s over the PCIe bus, while the current best PCIe 5.0 SSDs will do 12-14GB/s. While a RAID 0 can improve performance in those areas, most game loads are low queue depths, often there is a single thread overhead in those workloads, thus even saturating the PCIe 4.0 X16 bus will be a challenge, unless they start to optimize it for more database style workloads. Beyond that, for most consumer platforms, often you will have 4 CPU lanes for the SSD, and then 4 lanes going to the chipset which is shared with everything else on the chipset, thus scaling will not be 100%.
With more workstation and server platforms with more PCIe lanes, there is more throughput to work with but often at lower single threaded performance. Since the CPU does not need to do decompression, it is left as a single threaded task.
Not only is der8auer content better but the comments he gets are better too.
der8auer viewers actually know what the hell they're talking about and understand what is being said in the video.
@@beowulf885 Sometimes.
@@beowulf885 Actually I'm here for the cat.
Thanks for making this video, a lot of the sites just gloss over this bandwidth testing between the different versions of PCIe standards. I been curious to what the benchmarking would show and thanks to you I atleast get a glimpse of the answer 🤔
Great video... I've wondered about this and it's valuable information to see that FPS isn't really impacted that much by PCIE.
Thank you for doing this. There are some old videos about PCIe generation around but they are mostly inconclusive. Yours is on point.
There's a whole bunch of tests on TechPowerUp.
I love your conciseness man, super efficient and totally awesome, thanks a lot!
As expected, thx for confirming derBauer.
This is video that I needed. Thanks as always
Oh dear. When I built my AM5 system a few months back, I paid a bit more than planned on the motherboard because it has a gen 5x16 slot. Looks like I wasted a bit of money. Great info Roman, thanks!
There is still value in having gen5x16 PCIe slots, just maybe not gaming GPU scenarios. You can bifurcate for 4x NVMe, and the bandwidth is far more important if you use AI models that don't fit into VRAM and need to transfer across PCIe into system RAM.
@mattrogers6646 good point
Running on PCIe3.0 x16, usually only shows visible impact once you go over VRAM budget.
what is the meaning of your comment?
good luck running out of 32GB of VRAM on the 5090 :D
@@h5b12345 super easy with an llm, those can use hundreds of gb.
@@h5b12345 that's the point of my comment
im on pci gen 3, and if my rtx 4080 drops to 8x, performance drops almost linearly with it, where it wouldn't budge at all with pcie gen 4, and pcie gen 5 could go to pci 4x just fine. so no.
I requested this specific test on your last vid, thanks so much.
This makes me feel better running my 7900 XTX in PCIe 4 x8 (same bandwidth as PCIe 3 x16) due to the PCIe lane sharing with M.2 slots on X870E. I already figured it wouldn't make a difference but this just solidifies all doubts now! :D
Happy to see this video. Now I can not worry about using my SSD in my gen 5 slot.
Thanks for the information! I'm running pcie 3.0 with i7-5960x so I was looking for this.
Bring back the old outro! Love the videos keep it up ❤
Would love to see it in pcie 2.0 for data 😊. I got an old spare pc from 2010 that I wanna upgrade the GTX 480 to something modern for fun.
thank you I was wondering this myself.
concise, clear, interesting.
i skim watched and enjoyed. thanks
also i liked the sponsor
Great work as usual, sir!
Finaly some one to test PCIE 5! Thanks!
I'm very happy with the outcome of these tests. It means that you'd be able to configure the mobo to split lanes out for M.2 devices etc. and still achieve plenty fast enough bandwidth to the GPU. (Not that I'm gonna be in the market for a 5090 any time this century, but a more modest sibling might be a possibility.)
You nailed it, thanks, I was wonder about it.
Does limiting the amount of lanes at 5.0 cause any differences in the scaling vs dropping to older versions still at x16? I know 5.0x8 is approx the same bandwidth as 4.0x16 and 5.0x4 is about the same bandwidth of 3.0x16 or 4.0x8 , but does having less faster lanes have any implications on latency that might cause any performance differences despite being the same raw bandwidth.
It's identical. TPU tested this.
@@Arcona I looked at their 5090 article and the methodology is not laid out, they don't explain what they tested at they just say that they should be identical.
This is what I want to know as well. In theory it is identical but in practice? Lots of x870 mobos cut pcie5.0x16 into x8 when certain m.2 slots are populated
This would be good to know. Also would be good to see TB3 and TB5 GPU enclosure testing
@ Not on their 5090 review, I mean they did a dedicated article a while back testing a bunch of stuff on different generations of PCI-E spec.
Thanks for benchmarking this and showing gpu-z for how to verify your configuration. The Motherboard info on my Gigabyte was slightly confusing for me whether my 4.0 ran at x16 or x8 with an M.2 in a particular slot. Edit: Also realized my 2080 only runs at 3.0 x16, which is why my MOBO reported 3.0. Was confused for a few minutes, good to know.
It definitely makes me think about what would happen with an x8 or an x4 connection. Especially if the extra lanes were routed to fast storage.
i was worried because i have a Hyte Y70 with the pcie 4.0 riser, thanks Roman
Great info and results. Thanks!
Thank you! Exactly the info I was looking into yesterday. If I had $2000 to spend on a RTX 5090, I'd buy a better CPU & Mobo first anyway. Way more performance per dollar.
I would love to see a comparison of 5090 at 5.0x16 and x8 vs 4.0x16 and x8 vs 3.0x16 and x8 .
One of the big benefits of a Gen5 GPU is splitting lanes on the x16 slot.
5.0 x8 = 4.0 x16 and same for 4.0x8 = 3.0 x 16 etc. So it would be the same as shown in video. He would only have to do 3.0x8 then, thats the only one missing
@@EquilibriuMindyes and no. For instance you see power differences in this video. Does this mean gen 5 in x8 results in gen 4 perf with gen 5 consumptions ? I think it would still be interesting.
Quite a few people are running add-on cards and very few motherboards won't drop your slot to x8 in that situation.
Yes, this was my first thought when seeing this video hope we can get something to show this. As the motherboard I have has split lanes for the GPU and the first m.2 slot would be interesting to see if you still get the full speed on x8 rather than x16.
@@xlias5636 Aren't power differences mostly due to higher FPS? If GPU renders 2% more FPS, it needs X% more power.
@@hagaiakMaybe? But kinda why would be interesting to see the data.
I still use 9900k which is pcie 3.0 with my 3090 kp.
I just finished setting up my mora 400. Thanks Roman......
Only reason it's necessary is because of motherboard manufacturers doing pcie lane shenanigans with M.2 slots. Being able to run at 5.0 x8 and get full 4.0 16x bandwidth will be nice for people who use their other PCIE slots.
I am actually rather surprised that the performance difference between 5.0 and 4.0 was different at all, wouldn't expect the bandwidth difference to matter even if it's only a few %.
i sense another victim of the lie they keep pushing like i was, they always write 5.0 goes x8 but its not true its x8 of whatever you plug in aka 8x 4.0 on a 4090, im on msi motherboards now ypu can use all m.2 and keep pcie1 x16 in their good mb, no or very little bandwith sharing x670e meg ace is great
Not entirely true. You also have to think about optics. Think back to how everyone was reacting to Intel 10th Gen not supporting PCIe Gen 4. Almost everyone was taking them to task for it, even though it made no real world difference. If NVIDIA hadn't made Blackwell Gen 5 then they would have caught no end of grief over it. Doesn't matter if the bandwidth is necessary. Gen 5 is there, so it _has_ to be supported. The critics won't allow anything else.
@@Michael_mki233CPU supporting higher PCIE specifications is entirely different than GPU’s. The CPU PCIE lanes has multiple applications where it can benefit (like NVME SSD’s and such) compared to the GPU where it’s more of a toss up.
I'm sill using a 2019 MacPro with Cascade Lake Xeon 16core and a FE 4090 may upgrade GPU in the future PC is still fine. Running Windows 11 or course. So good to see 3.0 is not a huge performance drop.
So my 5800X3D lives to fight another day. Thanks for the video!
Thanks for the info. Maybe my 10900k will live on.
it would be interesting to see this in games that actually stress the Vram usage of the card - like Microsoft flight sim, or large open world games - where new data is constantly loading in and swapped out. in these games I would expect the bandwidth to matter more.
I do actually have a good use for this. It lets me use an x8 link to my cpu without performance loss, meaning i can have high speed networking as well as a GPU, which is a big plus for me.
Thank you this was the video i wanted to see.
On many motherboards if you populate all the M.2 slots the PCIE slot will be dropped from x16 to 8x. So the point of PCIE gen5 on this card is so that it can run in gen5 x8 in that setup which is equivalent to gen4 x16 speeds which is fine. If this were still a gen4 card, and the motherboard again dropped the lanes for it to x8, it would be a much different story (basically gen3 x16 speeds).
Wow, the CP2077 result really ended up being a good way to measure... relative PCIe controller power usage! We see the hints in the other graphs but I suspect most or all of the power difference comes from the PCIe x16 controller. And the numbers are about what one would guess/expect but it's still neat to see it that graphically.
"Negligible" not "Neglectible" ;-) Thank you for the video Roman. Du bist der Hammer.
In case of Nvidia its both...
Your english is very good if you noticed this error. Even most english native people would miss that! 😂
@@TransformHypnosis They do mean the same thing at the core! I caught that too, I wouldn't say it's wrong or inaccurate. Roman's English grammar and use is better than most of the people I encounter daily!
As I am on a x370 MoBo with a 5800x3D, I am still rocking PCIe 3.0. I was worried about needing to upgrade for this. I am glad that there is not much of a difference in the PCIe generations. I still have a 3080ti, which is still excellent. I will still be skipping 50xx series. I just dont see the need to upgrade. I get decent frames in the games I play still.
Never believe the latest games telling you it is now time to upgrade, you'd be supporting lazy developers and not actual progress. It has mostly been the case past, like, 2010 or something, but nowadays it is exponentially true.
same mobo just lower gen cpu 2600 and 1050ti :D now saving money for 5800x3d and maybe 3060 or amd gpu :)
Yep i'm rocking a 5800x3D and a 6950xt no need for an upgrade especially since I game at 1440p.
@@pavelstoikov3780 3060 cost the same as 6750XT that is 50-100% faster.
@@danielainger I've got a rx 6900XT and 5800x which was my first gaming PC....of course I did it at the worst possible time lol. It's been great for all the games I play, esp since I don't care about raytracing whatsoever. However, I was hoping to see an 8900XT (9090XT) this generation but AMD decided not to have a higher end card this gen which is very disappointing. I'm hoping they don't permanently stop making a flagship card. I like AMD's value (at least in the few years I've been into gaming PCs) compared to Nvidia, even if they can't go completely toe to toe.
If you want a game to add to PCIe testing specifically. Look at Metro Exodus. It was one of the most impacted back in TPUs 4090 test from dropping down to 3.0.
Is the data at 6:35 accurate? That’s a 48% increase in average FPS from 4090 to 5090. The 1% lows seems about right (27%). Seems like you put the 5090’s 1% low in place of the 4090 average, perhaps?
I bought a r5 3600 with broken pcie x16 lane for ~$30. go to bios and change it to x4x4x4x4 and it's run fine. bought a rx5600 oem and now running pcie 3.0x2. it got some problem booting no display and I figured it out that you need to loosen the gpu socket( let it sag). I'm using a 400w psu and somehow gpu could run 150w just fine. but i cap all game at 60fps and using lossless scaling fsr and fg x2 on my 1080p 60hz monitor which overclock to 100hz(it is a 3d 120hz but run 1080p 60hz). thank you for listening to my rant.
great video! thank you :)
Ok! I need to see you overclock the ASTRAL 5090, ENOUGH WITH THESE LITTLE TEASER VIDEOS 😂😂😂😂😂 LOVE YOUR WORK BROTHER! 😁
As others have pointed out - this means that the vast majority of users do not have to worry about being capped at x8 due to lane sharing - something that Gigabyte is doing a lot of with their latest motherboards. (Not sure who needs 4 Gen 5 NVMe slots though).
I imagine the PCIE bandwidth would have a big impact on games with Direct Storage and/or Resizable bar where much more data is being transmitted over PCIE
especially when game needs more ram than your card have vram. Here it have huge impact.
there is pretty much nothing gaming wise that could saturate pci-e 5.0 bandwidth its just i got bigger number means better logic at this point.
Nah, that's still small compared to the bandwidth of the gpu, and makes no noticeable difference.
@@Arcona check 4060ti 8gb and pcie verison impact when gpu go out of vram :).
"is nothing" till you keep in vram.
@@KazutoKirigaya-hi8zh Running out of VRAM is gonna make games run like crap regardless though. Best option there would be to play in settings where you won't run out of memory.
PCIE 5.0 is extremely useful for deep learning/AI, where moving training data from system memory to GPU memory is a bottleneck. Combined with the ability to put 4 cards on a single motherboard without having to replace the coolers with water cooling make it possible to now relatively simply build very beefy deep learning workstations (if you can get the heat under control).
Basically, for gaming the 5090 may be "meh". But as a budget AI card, it's a big leap from the 4090.
And while $2000 may seem a lot for gamers, compared to the $30-40k for the H100, it's a baragin. If even SOME of the workloads that used to require H100's can be replaced by systems with 5090s, it basically makes such systems 10x cheaper.
The memory will overheat if you put 4x 5090’s behind each other
I haven't heard as funny of a thing as a "deep learning workstation" since crypto mining stations.
@@Gary_Hun Spot on this is the single user segment of this product on a gaming orientated channel.
What kind of abominations of fans would one need to drive four 5090s next to each other? I can't even imagine that with those "will chop a carrot" server fans requiring hearing protection. They will suffocate each other.
@@Gary_Hun /r/LocalLLaMA would like a word with you
Would liked to have seen the use of riser cables of the various pcie gen as well. Interesting information nonetheless thank you.
CUDA code samples has a PCIe bandwidth test. The PCIe speed should have influence over the loading time, as long as the texture data is already in the memory cache.
It's just the same as when the first PCIe 4.0 cards came out no surprise
This is AMAZING! - not because it gives us more performance (it doesn't), but because it enables us to use this card at pcie 5 x8 instead of pcie 5 x16. this means that we GAIN a cpu-connected pcie gen5 x8 pcie slot to do with as we please without losing any performance on our new 5090. I have been complaining for YEARS that intel and amd are not giving us enough pcie lanes on our current generation enthusiast cpus. this now gives us the opprotunity to use up to 8 lanes of pcie 5 to do with as we please. this could mean 2 more pcie 5 ssd's at full direct-to-cpu bandwidth, or it could mean a 4x 4k video capture card, or it could mean a 25g, 50g or 100g network card... it used to be if you had a top end consumer cpu, you had zero expansion if you wanted all of your gpu performance... NOT TODAY. this is FANTASTIC!
Could use something like memtest-vulkan as an alternate to measure bandwidth drop.
Typically the AM5 PCIe lane config for least restricted/full-lane M.2 configuration is:
CPU(Mode: x16 or x8/x4/x4) + CPU(x4 5.0) + Chipset(x4 4.0)
SSDs based on the new SM2508 controller will finally be what I consider thermally viable.
It was the same story when PCIe 4.0 and 3.0 released
What I'm really excited about when it comes to PCIe upgrades is the extra lanes I can suddenly use for storage devices.
Ive been looking for updated videos on the 4080 on this subject. Using the top nvme on the apex encore drops it from 16 lanes to 8 lanes. I'd like to see the effects on modern games
Bandwidth on Pcie5 x8 = Pcie4 x16 = 1% loss with 5090 (at least in games)
@super2k128 I'm not talking about 5090 I'm talking about 4080 the 5090 is pcie 5 the 4080 is pcie 4 then when you use the top nvme it drops to 8x from 16x
@@toonnut1 It's going to be a similar situation for most recent cards. Gen3 x16 is basically the lower limit for GPUs these days for GPU bandwidth, even for cards as low as the RX 6600.
So, Gen3x16, Gen4x8, and even Gen5x4 will be enough for all GPUs right now.
@@toonnut1then you need to check for 4080 pcie3_x16 benchmarks
I expect at least 5% loss
Nice work. It would be great to showcase where PCI gen actually makes a difference.
Since it’s now modular, it would be nice to see if we could get it a board to get it to run on a 4 lane PCIE 5 slot.
The good news is that when using only 8 lanes for the GFX, it allows for two more SSDs to be connected directly to the CPU (at least on the X870E platform).
For completeness I'd love to see 5x8 tested since some people want to use those extra PCIE5 lanes in modern motherboards for other things now. I know that 5x8 is the same throughput as 4x16 so it's probably just fine, but you never know for sure until you measure.
We need some tests with PCIe 4.0 risers like the one included with Hyte cases to test compatibility and stability at PCIe 5.0 speeds.
Great video! Please do the same with 5080 when it's available, it should have a little faster memory so maybe a bigger difference from 3 to 4?
Maybe in a very niche application in which the VRAM overflows - but only on a high bandwidth platform (-> Threadripper or server platforms with more than four channels of memory). But then again the memory will be at a glacial speed compared to the GDDR7@512bit
this is why I wish we would see some motherboards out there with a full x16 4.0 slot on the bottom and let me have more 5.0 lanes for m.2 (not that I really need it there, but at least there is a show-able difference)
I remember seeing someone explaining that pcie gen impacts performances only when the vram is limited, it was an interesting video hence the low loss of performance thanks to the 32GB
Would love to see results with the 16x slot running at 8x electrical at 5.0, 4.0 and 3.0 in the charts. This is the more relevant test with folks wanting to run multiple NVMEs.
I imagine the average use case is bottlenecked on the work being done to the data, rather than the transfers happening in the background. If you always finish the work after the transfer is finished, at either transfer speed, the transfer speed doesn't matter.
PCIe is waaaaaaaaaay ahead of the needs of graphics cards. Because it's primarily a standard used by servers with much greater throughput needs.
I think GPU makers use the latest pcie standards just to 'hope' for one extra frame in comparative reviews. And because pcie interfaces are cheap to update.
Maybe the diff could be only visible when loading textures or other materials from the SSD to the VRAM, or in the other way, based on the bandwidth it should impact the time to do these operations. But it needs to be tested in the "real world", to see if there's a latence or something..
Each Nvidia H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs.
Thank goodness, this means that my B450 PCIE 3 ×16 slot should be good for my RTX 5070
Thanks for the test, i was already a bit worried with my PCIe 3.0 board, even if i will only upgrade to a 5070Ti or maybe a 9070XT if FSR4 will be good and FMF2 is available on ALL games, not like DLSS Frame Generation..
It's not bad to have it, downgrading to x8 on high-end Motherboards, you can use the other shared PCIe or add 2 additional M.2
Well I would think that the PCIe bus only matters when you're uploading textures, and since the 5090 has 32 gb vram, it has enough space to hold all of the textures in a game before hand.
Good to know, especially if one of your nvme drives shares pcie lanes. So 5.0x16 would drop to 5.0x8 which has the same bandwidth as 4.0x16.
Exactly! Here I have a 4090 PCIe 4.0 at x8 because it shares bandwidth with an Intel Optane 960 GB PCIe drive. Performance loss is negligible, but with a 5090, PCIe 5.0 x8 would be equivalent to full 4.0 x16 bandwidth. So it is always good that the 5090 is PCIe 5.0 even if we don't take full advantage of it (and even if only for peace of mind lol)
Having to not sacrifice speeds at x8 to populate more drives or other devices is huge for so many folks, and for same saves costs of not having to go workstation platforms. I don’t why anyone would think it’s useless
The crazy thought to me is you could run a 5090 off of a pcie 5 x4 connection which is the same as pcie 3 x16 and be fine. Kinda cool when you think about it.
I was talking with someone on the AMD subreddit the other day about the move to higher PCIE standards, how it was done not due to performance needs but mostly just because everything else was moving forward a generation.
As a cost-cutting measure, they could honestly probably wire these cards for PCIE 5.0 x4 (which has the equivalent bandwidth of 4.0 x8/3.0 x16). The only major issue would be legacy systems, and probably the reason why they generally don't; the card in a PCIE 4.0 system (like Ryzen 5000) would probably be okay for the most part (4.0 x4/3.0 x8), but it would start to be severely hampered on a PCIE 3.0 board (Ryzen 3000 and below/Core 10th Gen and below). Low-end cards like the RTX xx60/RX x600, which have access to 8 lanes of PCIE 4.0, lose noticeable performance when dropping to 3.0 (but even then it's not much).
We got into discussing PCIE switches to "convert" said cards' 4.0 x8 into 3.0 x16 in order to maintain the bandwidth and keep legacy systems viable for a while longer, and got discouraged at the exorbitant prices for them...
The 3DMark PCI Bandwidth test is highly effected by rBar too, use Nvidia Profile Inspector to enable rBar feature and change the rBar - size limit to a setting greater than 2GB and the test will likely have much higher results, I found this a year ago when trying to find why my system seemed stuck at 24G/s when some lesser systems saw more. IIRC you have to set it globally to affect the actual test app 3Dmark's UI spawns.
On a 13900k 4090 reBar under 2GB size limit always gave ~24 GB/s, 2GB limit gave ~28 GB/s and any rBar window over 2GB gave ~60 GB/s
0x0000000080000000 = 2GB VRAM
0x0000000100000000 = 4GB VRAM
0x0000000200000000 = 8GB VRAM
0x0000000280000000 = 10GB VRAM
Yakumo.
My problem is when GPU card manufacturers cut the number of physical pins to force a PCIe x8 connection. It happened recently with the Radeon 6600XT.
So? The card or chip would not see any benefit from going to 16X. By cutting your saving costs. You don't need a PCB with extra traces.
Why do you have a problem with that ? it is as vanderlinde just explained for you. So it's a you problem, not an actual problem.
@@MissMan666 There are pros and cons with both. There are most definitely valid reasons to only make it 8x - size is one of them.
Lets take a PCIe 5.0 8x card that consumes the entire bandwidth of 5.0 8x.
Add this card to a PCIe 4.0 16x motherboard, and you won't have the bandwidth to support it, even though 16 lanes of 4.0 should be the same as 8 lanes of 5.0, the card only has 8 lanes, so you're limited to 8 lanes of 4.0, and that's not enough.
Depending on the cost and type of product, they could have made it 5.0 16x. Yes, it would only ever need half of that slots bandwidth, but it would allow the card to run in multiple configurations. Like in a bifurcated 5.0 x8/x8 configuration or a full 4.0 x16 slot.
It's quite common for PCIe slots to allow longer cards than the actual slot length, so unless size of the card is a concern, there's no reason not to make it 16x, well, except cost...
@@vanderlinde4you The difference rises it head when you run that card in older system which makes sense as it is a budged card.
@@MissMan666 It's a budget PC builder's problem, or an old PC owner's problem, perhaps. Your perspective seems...young.
This almost looks like the game engines conservatively restrict transactional data to the bandwidth of PCIe 3 x16. This in turn looks bright for the oculink 4i for gen 5 external GPUs until the new titles start to shift this limit to PCIe 4 x16.
PCIe 5 is going to be a lot more important for lower VRAM boards. The 5090 has more memory than any game will use, so has very little need to ever transfer data in and out during play. I'd like to see this test done again with a 5060
I'm glad to know that my PCI-E 3.0 motherboard won't be a huge limitation, and might actually benefit my wimpy PSU.
I think the one advantage that I'd gain with PCI-E 5.0 is that the current X870/X870E motherboards from AMD are kind of starved for PCI-E lanes. I use a 10G NIC in my second slot, which means my first PCI-E slot runs at 8x instead of 16x. So, using a PCI-E 5.0 card means I'd theoretically get my effective PCI-E 4.0 16x speeds back since PCI-E 5.0 is twice as fast. It's not a huge deal, but it's a nice little bonus.
We always knew gen 5 would be useless for graphics cards from the pure performance POV. But opening the option for x8/x4/X4 is the real deal here. Maybe even in the card itself, with SSDs on the back of the board holding all the assets. The way is paved.
Great news for us SFF case manufacturers 😊
Why? Using an SFF case doesn't mean you have to use an older gen PCI-E does it? Pretty sure you can get gen5 boards in Mini ITX form.
@@Arcona true, I am referring to many SFF cases that use riser cables.
@@riba2233 Ah sure. I suppose if it did make a big difference you'd have to get a 5.0 certified cable, and you know they'd charge way extra just to say it works at 5.0 ha ha.
I'd like to propose an explanation for the PCI bandwidth test results. I know that back in 2020, Nvidia introduced a library to do on the fly LZ4 compression for data on the PCIe Bus. This was meant for large scale systems using NVLink and CUDA workloads, but it's possible this has now made its way down to consumer cards, specially with the recent push for disk streaming. It'd be interesting to see if you could run this benchmark for different card generations, all the way back to 2018 maybe, for both Nvidia and AMD cards.
As a side note, I also wonder if you might be able to notice the drop from 5.0 to 4.0 if your game is using disk streaming.
This is a pretty clever benchmark. Curious what prompted der8auer to run this benchmark. :)
You get same performance using half or even 1/4 of the PCI lanes. That means more lanes left to add a second graphics card, M.2s and so on.
New standard like PCIe 5.0 is NEVER useless!
As someone who typically runs multiple GPUs, higher standard not only allows to use older PCIe slots, it also allows to insert GPU into PCIe 5.0 x8 or even x4 slot while losing minimal amount performance.
This is also notable with older enterprise NICs that were built with PCIe 3.0, which could either be used with 2.0 x8 slots or 3.0 x4 slots with full speed and 3.0 x4 is easy to find these days, while dedicating 8 lanes just for dual-port NIC feels excessive..
Similarly latest Aquantia 10G chips support PCIe 4.0 x1 or PCIe 3.0 x4 to run at full speed, which is great in case you only have a single x1 slot left on the motherboard and latest motherboards often have those slots at 4.0 speeds going through the chipset.
So in all possible cases I would choose a GPU/NIC/whatever that supports higher PCIe version rather than lower, even if it technically doesn't require it today.