If you want to skip the SIMD basics and get straight to AVX512 instructions: 10:22. (Mic setup is great this time around. Looking forward to the next video!)
Creel here points out that AVX-512 is only marginally better at floats and really shines in integer performance. Meanwhile Linus Torvalds went on a rant about how AVX-512 is Intel creating magic instructions to improve performance in float heavy benchmarks and he'd rather them focus on integer performance. Is Linus just stupid?
@@Ang3lUki The man started and helps maintain the biggest and most-used open-source kernel in the world, he is clearly not stupid; his _opinion_ of AVX stems from his experience as a kernel dev, where massively parallel floating point calculations are just not important
@@RaaynML Not only that, but the cost of saving and loading register state, especially for massive register files like SIMD, can get expensive in the kernel when context switching. As a result, SIMD as a whole is often just completely disabled during kernel development outside of context switches.
Unfortunately, consumers in particular will not benefit from this as long as most CPUs do not support AVX512. Even if they have an AVX512 capable CPU. Computer game manufacturers in particular will only support AVX512 when the market is big enough for it. Until then, consumers and gamers will have to be content with AVX2. This is likely to be the lowest common denominator over the next few years. Of course there are exceptions. As a programmer, I don't have to buy a server CPU to use AVX512 if I can buy a consumer processor from AMD with AVX512 support. And performance-hungry end user programs, where the effort for extra binaries or plugins is worthwhile, will probably also be able to offer AVX512 support.
Your channel is definitely one of the best ways to catch up on interesting technologies and programming methods out there. You put out some excellent work right here 🤩😍
As of yesterday, August 29th 2022, AMD has announced that their ZEN 4 7000 desktop cpus will have AVX-512. I am excited since Pixar Renderman takes advantage of this instruction set.
@@WhatsACreel Just remembered it was part of Moore's Law is Dead video ua-cam.com/video/x7dtqdbJQW8/v-deo.html . So it was leak/rumour/speculation. So it's not fact until AMD say it is. Not sure SIMD is necessary on a CPU since a GPU is pretty much used for this kind of computation. This may be AMD's stance and would use the extra transistors on future CPU architectures for something else.
@@loserface3962 Plus a lot of stuff really can't be pushed to the GPU for a bit and then results returned for the rest of the program to work on. Maybe with an iGPU and unified memory address space, but still. I'll be interested in AMDs plans for chiplet FPGAs, although probably not gonna happen on desktop and probably only specific Epyc SKUs
The leak is for Genoa, or Zen 4 EPYC. And what it says is it supports AVX3-512 and I believe it expands the instruction set but I'm not positive. This is more important for server than PC, considering their large core count CPUs can be used for render servers and large data processing. The rumor is also a core count up to 96 cores, so I can see why AVX-512 would go nicely with that. It's a hell of a lot of compute power having 192 cores on a single server board. I'd say power would be an issue, but since this is TSMC N5 which is a 25 - 30% power reduction, that helps to offset going up to 96 cores. I'm more curious to see if AMD starts producing more specialized chips. They kind of hinted at this a couple months ago in a video talking about future server products. And I'll stick with AMD here over a variety of products. I think they should drop the chiplet design for desktop, for CPUs that are 8 core and less since the transistor density of N5 is a substantial gain over N7. Maybe they have a slightly higher failure rate, although TSMC has been putting out really low defect wafers. The benefit is getting rid of the complexity of a chiplet design and no need for an IO die. In fact if they did that they could leave a connection on that chip to add another 8 core chiplet directly to the die with both 8 cores and IO. They would take them down to 2 chips for higher core desktop parts. For server though, they'd have to maintain an IO die. I DO think they should be pushing 12 cores/chiplet for server, otherwise too many interconnections to go up to 96 cores with an 8 core chiplet. It will be interesting to see.
I'm loving the content. I watched through the video on set associative caches and it didn't really click until I saw your video explaining the topic. Thank you!
Whats odd to me is that Intel has taken AVX 512 back out of its 13th and announced 14th gen CPUs. Now its the opposite situation that you spoke about in 2020. AMD has it and Intel does not. I would love to hear a follow up video to this series explaining why you think intel removed it from its current gen CPUs. Thanks for these videos. I really enjoyed them!
Have you heard of the riscv vector extension? The concept seems to be preatty cool, because they allow for arbitrary vector size whiles maintaining binary compatibility. So basically generic vectorized functions.
I have yes! a lot of really cool ideas! I haven't kept up for a little while. They were talking about SIMD possibilities when I was reading up. I think x86/64 is so complicated nowadays, it is only a good thing to invent new architectures :)
Expectation: AVX-512 is the future of Intel, AMD will only keep adding more cores. Reality: Intel removed AVX-512 to get more cores, AMD added a power-efficient implementation of AVX-512. Who would have thought!
Single-cycle barrel shifters are a similar order of magnitude in physical size to a wallace network (aka pipelined multiplier used for fma), so you better make sure you really want one before you add it. Obviously very useful though. Edit: well, maybe not quite the same order. You might fit 4-5 barrel shifters in the space of a 64->128 bit wallace network. Still a huge chunk of silicon you can't allocate to registers or store buffer entries or whatever.
I'll have a pint of bitter please. Wait, sorry, good thrown off by the accent. I meant to say your videos are great, keep up the hard work - it's appreciated.
Question / request for confirmation for possible advantage of AVX512: My understanding is that (without AVX512) on cache write miss, data must be read from RAM even if you will write the whole cache line later. Will AVX512 eliminate this problem? I mean if you can fill a whole cache line with one store instruction there's no need to read data from RAM (or higher level cache). The way I avoided this extra read with SSE/AVX was to use stream store instructions. I would think the advantage with AVX512 really comes in this aspect when the output both needs to be processed further immediately (but the data needed for the next step does not fit in the registers) and stored in memory for later usage. It may be the blind spot of pre-AVX512 streaming vs non-streaming stores. Where am I wrong? :)
Just an FYI, the recently announced ryzen 7000 series CPUs have been announced to support AVX512 at launch. Would be interesting to see a simple program in AVX512 running on two high end platforms, AMD vs Intel platform. It would be even more neat if the ASM could be compared, I wonder if AMD cpus have fuseops?
Totally agree. Intel has made a poor decision on this AVX-512 stuff. It is not useful for 90% plus of its users. They should not implement it in the mainstream product.
@@catchnkill These instructions should be implemented on an accelerator chip/card for this kind of operations. Or well... create a parallel line of processors with and without these instructions for the people that need or not that. I'm sure that without these instructions, you could fit another core that will help more for most people.
@@catchnkill I highly disagree. I hope AVX-512 will become mainstream in every x86-64 CPU. Reason: A highly optimizing compiler will use it when it's available, even in cases where a normal programmer would never think to use the SIMD unit for it. Special accelerator chips/cards are expensive, usually have a small market share, access to them is slow due to the BUS and programs have to be written specially for them. Because of the latter case, compilers cannot use them for normal programs written for the CPU. Normal program code does not benefit at all from such external accelerator chips. With AVX512 it's different because of compiler optimization. The only prerequisite is a wide adaptation of AVX512 in all x86-64 CPUs.
@@OpenGL4ever It is a poor move that has made Intel way far behind AppleSilicon. They put AVX-512 into every chip in latest few generation chips while less than 0.1% of buyers actually use the feature. The feature is there. It takes silicon area. It consumes energy but is doing nothing. The wasted die area can be used to put more small cores in it. I do not agree that Intel should put a nearly never used feature into every chip. In fact, they have started to disable the AVX-512 in manufacturing to save energy. And they also require PC manufacturers to turn off AVX-512 by default in BIOS.
@@catchnkill Apple was only faster for a very short time. Current x86-64 CPUs have long been faster than Apple CPUs again and this applies to both AMD and Intel CPUs. There's nothing wrong with that. It takes time to adapt the compilers to AVX-512. It also took a long time until x86-64 was properly supported and the extended registers of the 64-bit long mode could be used. Compilers are not created overnight, they need time to evolve. No, AVX-512 doesn't use energy if it is not used. Energy saving features are shutting down the units and its transistors that are not needed. I disagree, more cores will not improve single thread performance. But AVX-512 does when it's in use. As soon as AVX-512 becomes more widespread, i.e. there are more CPUs with this feature among end users, applications will use AVX-512 in the same way that SSE2 is used for many things today. And today's compiler uses SSE2 for things where a human programmer programming directly in assembler would never think of using SSE2. So ordinary tasks. This only affects certain CPU models. They probably made a mistake here and there, but future CPU generations won't have that problem.
Do I think I will ever write anything in assembly? Only if I end up at AMD or IBM and am actively debugging a CPU I’m working on the logic for (while that’s my goal, I dont consider it a guarantee, and certainly not for the next 4 years). Will I still watch through this because it’s just fascinating? Absolutely
Zen 4 will have AVX512 aaaaaaaaand Intel just disabled it by default because of lack of support in E-cores. I miss MMX and 3D Now - can't you grab and old AMD chip and give 3D now a bit of a spin for us?
Back in the day, I programmed the Z-80 and 8080 8-bit CPUs for a living. More recently, I designed a Minimal Instruction Set with only 16 instructions. I don't see how a larger instruction set is necessarily a good thing, because there's always a learning curve. For my money, I like a very reduced instruction set, and a smart compiler with pseudo instructions and a nice API for fancy things like 64-bit multiply and divide. All good wishes!
Its not fully supported in hardware though, i.e. it does not execute any faster because intel didn't add the extra alus needed, it merely supports it so developers can write mainframe code on dezktops.
Oh, that's an interesting thought! This would explain why the floating point performance I measured was the same speed as AVX. I figured they would work on the speed as time went on. I think it was similar in Sandy Bridge, with the original AVX. Must say, the integer performance was great in AVX512, and certainly the masking and auto-broadcast etc. are all really flexible additions too! Anywho, cheers for the info and thanks for watching :)
Memset certainly does use broadcasting! I'm not aware of any that use the AVX512 broadcast specifically, but they definitely use AVX and SSE broadcast.
Hi, I have a noob question If our computer has a GPU, why do we bother using AVX? is it just about the cost of moving data between CPU and GPU or is there other factors?
Essentially yes, the latency between the GPU and the CPU is extremely high in terms of CPU time (thousands, if not tens of thousands of clock cycles). For raw throughput, GPUs are often more performant. But sometimes you want to calculate more stuff, but don't want to wait for it that long. There are also lots of tasks CPUs are really good in but GPUs are terrible at, because CPUs are designed to do stuff really fast, while GPUs are designed to do lots of stuff in parallel, efficiently. This kind of brings the CPU a bit closer to the GPU realm, which can be handy for certain applications
Like others are saying, Zen 4 (7000 series Ryzen) does support AVX512, but the CPU doesn't actually have 512-bit registers. As I understand it they're double pumping the 256-bit registers to achieve the effect of AVX512, but it probably doesn't have the same performance as an Intel CPU. Most consumer Intel chips don't support AVX512 at all anymore though... There was a way to get early samples of Alder Lake working, but afaik it's impossible with Raptor Lake Edit: Also, I'm pretty sure early Ryzens did the same thing. Double pumped SSE registers to achieve AVX2 support
Hello everyone. Do you know which of these CPUs will perform better in machine learning and data science tasks, no need to say I would use a Nvidia GPU like 3070 besides the cpu, but I wanna choose an appropriate cpu for these types of tasks. These are my choices: 1. 5900X: $250 (used) 2. 13600KF: $400 (used) 3. 13700KF: $500 (new) But as you know there is another important factor, GPU. if I choose the 5900x then I could spend the extra money on a better GPU. If I wanna summarize both CPU and GPU configuration I afford are these three options: 1. 5900X + 3080Ti 2. 13600KF + 3070Ti 3. 13700KF + 3060 / 3060Ti which one should be a better combination?
That depends on whether you need to move a lot of data back and forth between RAM, GPU, and CPU when using the GPU solution. The BUS speed is the slowest and does have the highest performance penalty. That's why sometimes it's better to do everything on the CPU. The BUS does not come close to the data throughput and clock rate of the CPU. The situation is different if you can move the data to the VRAM of the dedicated GPU and the data stays there for a long time and is only read out at the end. Then the GPU makes a lot of sense. And then of course there are the cases where a CPU is significantly slower than the GPU. You then have to weigh up whether the penalties for pushing the data over the lame BUS are worth it. Another way would be to just put the money in a strong CPU and upgrade the GPU later when you have money again. Until then, you can use the old GPU.
After compilation how does software know if it’s running on a cpu which can handle the new instructions? If it’s not running on an equipped model, how does the compiler generate alternate code to compensate? Where does the differentiation take place? At compile time by generating alternate code in the object file or does the program have to branch to an alternate code block at runtime to compensate for the missing instruction capability?
There's compiler flags to tell the compiler what CPU to target. If you then try running the binary on a CPU that doesn't support the instructions, it'll probably crash or throw or an error. If you want runtime detection and branching, you'll have to do that manually.
@@weirddan455 Then these new instructions seem pointless if compilers, linkers, and installers can't take advantage by producing object modules which support various architecture features, then can decide which version of the object code block to install.
@@lohphat As AVX-512 becomes more widespread, programmers can be more confident that AVX-512 will be supported. And then at some point the support of AVX-512 will be chosen as the compile target as the lowest common denominator. Today this is the case with SSE2, for example, because every 64-bit x86 CPU can handle SSE2. What you can also do is simply create several different binaries with AVX-2, AVX-512 and without AVX support and then start a small program first when the user wants to start the program. This small program then checks which features the CPU supports. And only then is the corresponding binary started. This was done, for example, for the computer game "The Chronicles of Riddick: Escape from Butcher Bay" from 2004. AVX didn't exist at the time, but MMX, SSE, SSE2 and 3dNow! did. The corresponding binary was thus started according to their capabilities. Of course, this is only a small effort if the code is written in a high-level language and the compiler is advanced enough to optimize for the corresponding SIMD units. Otherwise the code or at least parts of it would have to be written manually for each SIMD unit type and that takes a lot of time and therefore money.
A lot of these seem like things that would be better done on a GPU/can already be done in greater volume on a GPU. Why add these capabilities to CPUs? Doesnt that spend die space and instruction caching efficiency that impacts everything else?
1. Doing work on a dedicated GPU comes with a penalty. The bus is slow, thus data exchange between GPU, CPU and RAM is slow and latency high. For example in games things like physic effects that should affect the game play and not only be eye candy are better done on the CPU. 2. General code does not benefit from dedicated units like a GPU. If code is to use a dedicated unit, it must always be written specifically for the dedicated unit to be used. It's different with AVX-512, here it is enough to recompile the general code with an optimized compiler and it can already benefit from AVX-512 if AVX-512 can be used profitably for this.
@@arditm2178 In Scotland, it is a device used to catch lobsters. They are basically a wooden and cement-bottomed, net structure (usually a curved roofed barn shape). The "walls" and "roof" is made from rope/net - wide enough for fish to swim through - which is also used to create a funneled entrance into the centre of the creel. The lobster swims into the funnel of the creel, but can't swim out of it again due to magic or a cult. You go out in your boat, and throw the creels into the sea, all tethered together with a bouy at the end for finding them again. On collection, you should have a lobster or two (not in each one, we're not crazy), just waiting there in the creel. Occasionally you'll get some dickhead crab who thinks its hilarious to jump in to the lobster's seat. A quick five minutes in boiling water does the trick.
Check out AMD's new Zen 3 architecture with their new Ryzen 5000 reveal. Pretty impressive stuff! Lots of limitations between cores eliminated etc.... there's a video about it here you might like... ua-cam.com/video/5uWXfoX1x3A/v-deo.html
AVX support started 10 years ago with Haswell. I don't know about AMD. As soon as Windows 10 is no longer supported with updates, you can rely on AVX2 as a common denominator. Because Windows 11 does not officially support most of the old and broken SPECTRE and MELTDOWN CPUs. So if you want to use Windows 11 with official CPU support, you have to upgrade your CPU as customer anyway. And all new CPUs do all at least support AVX2.
What are the chances that any of this will ever be used? Could you express anything in a language that would tell the code generator to use this stuff and use it well. Do the language developers advance at the rate of the instruction set developers? Languages from their beginning have always used a tiny part of the instruction set. If you use these new instructions aren't you incompatible with everything else. How many version of code do you write and support? You can always access everything from assembly. I write only assembly (for 50 years) for embedded applications but most programmers today can't spell assembly. I love the idea of massively parallel fast architectures that I would program in assembly. But I wonder how many programmers are out there ready to do it. Intel poured millions into this so they must think it is important. I wonder if it will hit .01% of the applications? You do an excellent job of covering these detailed topics and support assembly very well. Thanks.
I think Intel really missed an opportunity, however, in that AVX-512 should have used a VLIW model, and should have transitioned to a decimal floating point arithmetic format for that instruction set. Also: AMD's Zen 4 architecture is planned to include AVX-512. P.S.: Your playlist is currently arranged backwards.
Intel burned their fingers with VLIW on Itanium 1 and 2. VLIW was therefore a dead end, just like the optimization to super high clock rates with the Pentium 4.
Adding new instructions is kind of bad approach comparing to adding cores. Old software compiled years ago will have no benefits from new instructions, software compiled today with new instructions set will not run on old hardware. While multithreaded application can create as many threads in runtime as you need, at same time will work on old hardware, and have benefits of new hardware with more cores. Planned obsolescence and support of lazy programers who wanna write single thread applications...
If you want to skip the SIMD basics and get straight to AVX512 instructions: 10:22.
(Mic setup is great this time around. Looking forward to the next video!)
Cheers mate! Glad the sound is better :)
Much needed, instead of the usual tripe people say about avx512 (that it's 'just 2x avx2 + a lot of mess)
Creel here points out that AVX-512 is only marginally better at floats and really shines in integer performance. Meanwhile Linus Torvalds went on a rant about how AVX-512 is Intel creating magic instructions to improve performance in float heavy benchmarks and he'd rather them focus on integer performance. Is Linus just stupid?
@@Ang3lUki The man started and helps maintain the biggest and most-used open-source kernel in the world, he is clearly not stupid; his _opinion_ of AVX stems from his experience as a kernel dev, where massively parallel floating point calculations are just not important
@@RaaynML Not only that, but the cost of saving and loading register state, especially for massive register files like SIMD, can get expensive in the kernel when context switching. As a result, SIMD as a whole is often just completely disabled during kernel development outside of context switches.
2 years later intel doesn’t support AVX512 and AMD does (on consumer) How the times have changed
Unfortunately, consumers in particular will not benefit from this as long as most CPUs do not support AVX512. Even if they have an AVX512 capable CPU. Computer game manufacturers in particular will only support AVX512 when the market is big enough for it. Until then, consumers and gamers will have to be content with AVX2. This is likely to be the lowest common denominator over the next few years.
Of course there are exceptions. As a programmer, I don't have to buy a server CPU to use AVX512 if I can buy a consumer processor from AMD with AVX512 support. And performance-hungry end user programs, where the effort for extra binaries or plugins is worthwhile, will probably also be able to offer AVX512 support.
"Why didn't we have this for ten years?"
Intel: _laughs in monopoly_
We didn’t have it because it makes your CPU so hot it can trigger nuclear fusion
Now I'm going to watch this following video and see what it thinks. --> Linus Torvalds hopes: "AVX512 Dies A Painful Death"
@Czekot with how uncompetitive AMD has been in the past, it might as well be.
@@semicolontransistor well.....now he have problem with uncompetetive intel
Your channel is definitely one of the best ways to catch up on interesting technologies and programming methods out there.
You put out some excellent work right here 🤩😍
I am assembly fan. I love the way you explain it. No matter how many videos it takes, just explain it through. Very few youtubers do that. Keep up
Ah yes more instructions whose names I won't be able to remember
Thank God we don't have to write in assembly
@@Jaker788 What? You mean there's something other than assembly?
StackOverFlow, Google, Reddit:
Am i a joke to you?
you’ll remember them when you use them
@@clementpoon120 so never
man !! where a rare content in youtube. Thank you very much. YOU deserve more subscribers and views.
Totally right about avx512s power. Its use tends to make the CPU clock down. Even cores not running avx512. Gotta love dynamic boosting.
For new viewers, the AMD added AVX512 support beginning with their Ryzen 7000 series.
As of yesterday, August 29th 2022, AMD has announced that their ZEN 4 7000 desktop cpus will have AVX-512. I am excited since Pixar Renderman takes advantage of this instruction set.
My next CPU that i will buy will definitely have AVX512 support.
Zen 4 is supposed to support AVX512. Don't know if that's for Epyc only or Threadripper and Ryzen too.
That's great info! I'll check it out, thanks for sharing :)
@@WhatsACreel Just remembered it was part of Moore's Law is Dead video ua-cam.com/video/x7dtqdbJQW8/v-deo.html . So it was leak/rumour/speculation. So it's not fact until AMD say it is. Not sure SIMD is necessary on a CPU since a GPU is pretty much used for this kind of computation. This may be AMD's stance and would use the extra transistors on future CPU architectures for something else.
@@maxwellsmart3156 simd on cpus can still be good for things that need a somewhat high throughput with lower latency than gpus.
@@loserface3962 Plus a lot of stuff really can't be pushed to the GPU for a bit and then results returned for the rest of the program to work on. Maybe with an iGPU and unified memory address space, but still.
I'll be interested in AMDs plans for chiplet FPGAs, although probably not gonna happen on desktop and probably only specific Epyc SKUs
The leak is for Genoa, or Zen 4 EPYC. And what it says is it supports AVX3-512 and I believe it expands the instruction set but I'm not positive.
This is more important for server than PC, considering their large core count CPUs can be used for render servers and large data processing. The rumor is also a core count up to 96 cores, so I can see why AVX-512 would go nicely with that. It's a hell of a lot of compute power having 192 cores on a single server board.
I'd say power would be an issue, but since this is TSMC N5 which is a 25 - 30% power reduction, that helps to offset going up to 96 cores.
I'm more curious to see if AMD starts producing more specialized chips. They kind of hinted at this a couple months ago in a video talking about future server products.
And I'll stick with AMD here over a variety of products. I think they should drop the chiplet design for desktop, for CPUs that are 8 core and less since the transistor density of N5 is a substantial gain over N7. Maybe they have a slightly higher failure rate, although TSMC has been putting out really low defect wafers. The benefit is getting rid of the complexity of a chiplet design and no need for an IO die. In fact if they did that they could leave a connection on that chip to add another 8 core chiplet directly to the die with both 8 cores and IO. They would take them down to 2 chips for higher core desktop parts. For server though, they'd have to maintain an IO die. I DO think they should be pushing 12 cores/chiplet for server, otherwise too many interconnections to go up to 96 cores with an 8 core chiplet.
It will be interesting to see.
I'm loving the content. I watched through the video on set associative caches and it didn't really click until I saw your video explaining the topic. Thank you!
Whats odd to me is that Intel has taken AVX 512 back out of its 13th and announced 14th gen CPUs. Now its the opposite situation that you spoke about in 2020. AMD has it and Intel does not. I would love to hear a follow up video to this series explaining why you think intel removed it from its current gen CPUs. Thanks for these videos. I really enjoyed them!
Have you heard of the riscv vector extension? The concept seems to be preatty cool, because they allow for arbitrary vector size whiles maintaining binary compatibility. So basically generic vectorized functions.
Here is a good introductory video ua-cam.com/video/GzZ-8bHsD5s/v-deo.html.
@@oj0024 Thanks!
ua-cam.com/video/9e9LCYt3hoc/v-deo.html is also really good
I have yes! a lot of really cool ideas! I haven't kept up for a little while. They were talking about SIMD possibilities when I was reading up. I think x86/64 is so complicated nowadays, it is only a good thing to invent new architectures :)
RISC-V is a hot mess and will never get anywhere as is
If anyone's wondering, AMD supports AVX-512 as of Zen4 (Ryzen 7000 series)
i really don't know how i came here, but i like your channel.
Thanks for this, I love how you just knew the question that popped in my head at 4:31. What could be a use case for MISD? BAM, error checking
Expectation: AVX-512 is the future of Intel, AMD will only keep adding more cores. Reality: Intel removed AVX-512 to get more cores, AMD added a power-efficient implementation of AVX-512. Who would have thought!
How many more tabs can I open in Chrome though?
512 😁
I have 96 tabs open right now... No joke.
that depends on your ram, not your registers :P
Chrome will crash.
AVX 512 is like doomslayer and chrome is like demons.
Single-cycle barrel shifters are a similar order of magnitude in physical size to a wallace network (aka pipelined multiplier used for fma), so you better make sure you really want one before you add it. Obviously very useful though.
Edit: well, maybe not quite the same order. You might fit 4-5 barrel shifters in the space of a 64->128 bit wallace network. Still a huge chunk of silicon you can't allocate to registers or store buffer entries or whatever.
I'll have a pint of bitter please.
Wait, sorry, good thrown off by the accent. I meant to say your videos are great, keep up the hard work - it's appreciated.
Great stuff.
It's true... Dave is a legend!
Question / request for confirmation for possible advantage of AVX512:
My understanding is that (without AVX512) on cache write miss, data must be read from RAM even if you will write the whole cache line later. Will AVX512 eliminate this problem? I mean if you can fill a whole cache line with one store instruction there's no need to read data from RAM (or higher level cache). The way I avoided this extra read with SSE/AVX was to use stream store instructions.
I would think the advantage with AVX512 really comes in this aspect when the output both needs to be processed further immediately (but the data needed for the next step does not fit in the registers) and stored in memory for later usage. It may be the blind spot of pre-AVX512 streaming vs non-streaming stores. Where am I wrong? :)
Just an FYI, the recently announced ryzen 7000 series CPUs have been announced to support AVX512 at launch. Would be interesting to see a simple program in AVX512 running on two high end platforms, AMD vs Intel platform. It would be even more neat if the ASM could be compared, I wonder if AMD cpus have fuseops?
I think even older ryzens have a few FMA 'cores'
Best youtube channel
AVX-512? More like "Awesome video; well done!" 👍
animation work is pretty dope and matches the content 👌
So glad we have this taking up lots of space for limited use. Especially when there’s other ways to achieve this.
Totally agree. Intel has made a poor decision on this AVX-512 stuff. It is not useful for 90% plus of its users. They should not implement it in the mainstream product.
@@catchnkill These instructions should be implemented on an accelerator chip/card for this kind of operations. Or well... create a parallel line of processors with and without these instructions for the people that need or not that. I'm sure that without these instructions, you could fit another core that will help more for most people.
@@catchnkill I highly disagree. I hope AVX-512 will become mainstream in every x86-64 CPU.
Reason:
A highly optimizing compiler will use it when it's available, even in cases where a normal programmer would never think to use the SIMD unit for it. Special accelerator chips/cards are expensive, usually have a small market share, access to them is slow due to the BUS and programs have to be written specially for them. Because of the latter case, compilers cannot use them for normal programs written for the CPU. Normal program code does not benefit at all from such external accelerator chips. With AVX512 it's different because of compiler optimization. The only prerequisite is a wide adaptation of AVX512 in all x86-64 CPUs.
@@OpenGL4ever It is a poor move that has made Intel way far behind AppleSilicon. They put AVX-512 into every chip in latest few generation chips while less than 0.1% of buyers actually use the feature. The feature is there. It takes silicon area. It consumes energy but is doing nothing. The wasted die area can be used to put more small cores in it. I do not agree that Intel should put a nearly never used feature into every chip. In fact, they have started to disable the AVX-512 in manufacturing to save energy. And they also require PC manufacturers to turn off AVX-512 by default in BIOS.
@@catchnkill Apple was only faster for a very short time. Current x86-64 CPUs have long been faster than Apple CPUs again and this applies to both AMD and Intel CPUs.
There's nothing wrong with that.
It takes time to adapt the compilers to AVX-512. It also took a long time until x86-64 was properly supported and the extended registers of the 64-bit long mode could be used. Compilers are not created overnight, they need time to evolve.
No, AVX-512 doesn't use energy if it is not used. Energy saving features are shutting down the units and its transistors that are not needed.
I disagree, more cores will not improve single thread performance. But AVX-512 does when it's in use.
As soon as AVX-512 becomes more widespread, i.e. there are more CPUs with this feature among end users, applications will use AVX-512 in the same way that SSE2 is used for many things today. And today's compiler uses SSE2 for things where a human programmer programming directly in assembler would never think of using SSE2.
So ordinary tasks.
This only affects certain CPU models. They probably made a mistake here and there, but future CPU generations won't have that problem.
Damn, I'm so sorry to the person who had to write an x86 emulator.
That was a great overview, thanks!
I see a lot of game in cryptos, especially in checking blocks. by the massive use of calculations on strings
Do I think I will ever write anything in assembly? Only if I end up at AMD or IBM and am actively debugging a CPU I’m working on the logic for (while that’s my goal, I dont consider it a guarantee, and certainly not for the next 4 years). Will I still watch through this because it’s just fascinating? Absolutely
Zen 4 will have AVX512 aaaaaaaaand Intel just disabled it by default because of lack of support in E-cores. I miss MMX and 3D Now - can't you grab and old AMD chip and give 3D now a bit of a spin for us?
- What did it cost?
- Two cores...
Back in the day, I programmed the Z-80 and 8080 8-bit CPUs for a living. More recently, I designed a Minimal Instruction Set with only 16 instructions. I don't see how a larger instruction set is necessarily a good thing, because there's always a learning curve. For my money, I like a very reduced instruction set, and a smart compiler with pseudo instructions and a nice API for fancy things like 64-bit multiply and divide. All good wishes!
haha I see we both know Mr EEvlog Dave I also like the Davecad system so :)
Love your videos!
Awesome video
Whats the overhead like for explicitly including rounding per instruction
The playlist is in the wrong order.
Just an FYI.
love your channel!
Imagine avx512 with 128 thread, you could write real time path tracer without a GPU
well said
Its not fully supported in hardware though, i.e. it does not execute any faster because intel didn't add the extra alus needed, it merely supports it so developers can write mainframe code on dezktops.
Oh, that's an interesting thought! This would explain why the floating point performance I measured was the same speed as AVX. I figured they would work on the speed as time went on. I think it was similar in Sandy Bridge, with the original AVX. Must say, the integer performance was great in AVX512, and certainly the masking and auto-broadcast etc. are all really flexible additions too! Anywho, cheers for the info and thanks for watching :)
I love your videos!!!
WOOOOOOOOOOOO
Thanks
Don't know man, MMX should be enough for everyone.
I would love to have avx512 cpu just because of the fun stuff in programming.
please elaborate
@@FsimulatorX more instructions to play with, more elaborate algorithms, etc.
True Survivor!
love this
Thanks for the video you're awesome ;)
No problem 😊
does memset use the broadcast technique?
Memset certainly does use broadcasting! I'm not aware of any that use the AVX512 broadcast specifically, but they definitely use AVX and SSE broadcast.
Hi, I have a noob question
If our computer has a GPU, why do we bother using AVX? is it just about the cost of moving data between CPU and GPU or is there other factors?
Essentially yes, the latency between the GPU and the CPU is extremely high in terms of CPU time (thousands, if not tens of thousands of clock cycles). For raw throughput, GPUs are often more performant. But sometimes you want to calculate more stuff, but don't want to wait for it that long. There are also lots of tasks CPUs are really good in but GPUs are terrible at, because CPUs are designed to do stuff really fast, while GPUs are designed to do lots of stuff in parallel, efficiently. This kind of brings the CPU a bit closer to the GPU realm, which can be handy for certain applications
@@spicybaguette7706 Thank you for the response :)
These are great!
Like others are saying, Zen 4 (7000 series Ryzen) does support AVX512, but the CPU doesn't actually have 512-bit registers. As I understand it they're double pumping the 256-bit registers to achieve the effect of AVX512, but it probably doesn't have the same performance as an Intel CPU. Most consumer Intel chips don't support AVX512 at all anymore though... There was a way to get early samples of Alder Lake working, but afaik it's impossible with Raptor Lake
Edit: Also, I'm pretty sure early Ryzens did the same thing. Double pumped SSE registers to achieve AVX2 support
AMD announced AVX512 in Ryzen 7000 series in May'22
MISD is not rare at all, that's what microcode is all about actually. On the contrary it's just so common that we forget it's a thing.
But can it help my 7GHz oc’ed, nitrogen cooled cpu run crisis??
Thanks so much for your videos, massively helpful!
Welcome mate! Cheers for watching :)
Hello everyone. Do you know which of these CPUs will perform better in machine learning and data science tasks, no need to say I would use a Nvidia GPU like 3070 besides the cpu, but I wanna choose an appropriate cpu for these types of tasks. These are my choices:
1. 5900X: $250 (used)
2. 13600KF: $400 (used)
3. 13700KF: $500 (new)
But as you know there is another important factor, GPU. if I choose the 5900x then I could spend the extra money on a better GPU. If I wanna summarize both CPU and GPU configuration I afford are these three options:
1. 5900X + 3080Ti
2. 13600KF + 3070Ti
3. 13700KF + 3060 / 3060Ti
which one should be a better combination?
That depends on whether you need to move a lot of data back and forth between RAM, GPU, and CPU when using the GPU solution.
The BUS speed is the slowest and does have the highest performance penalty.
That's why sometimes it's better to do everything on the CPU. The BUS does not come close to the data throughput and clock rate of the CPU.
The situation is different if you can move the data to the VRAM of the dedicated GPU and the data stays there for a long time and is only read out at the end. Then the GPU makes a lot of sense. And then of course there are the cases where a CPU is significantly slower than the GPU. You then have to weigh up whether the penalties for pushing the data over the lame BUS are worth it.
Another way would be to just put the money in a strong CPU and upgrade the GPU later when you have money again. Until then, you can use the old GPU.
After compilation how does software know if it’s running on a cpu which can handle the new instructions? If it’s not running on an equipped model, how does the compiler generate alternate code to compensate? Where does the differentiation take place? At compile time by generating alternate code in the object file or does the program have to branch to an alternate code block at runtime to compensate for the missing instruction capability?
There's compiler flags to tell the compiler what CPU to target. If you then try running the binary on a CPU that doesn't support the instructions, it'll probably crash or throw or an error. If you want runtime detection and branching, you'll have to do that manually.
@@weirddan455 It doesn't crash, the OS will kill it. Politely.
@@weirddan455 Then these new instructions seem pointless if compilers, linkers, and installers can't take advantage by producing object modules which support various architecture features, then can decide which version of the object code block to install.
@@lohphat As AVX-512 becomes more widespread, programmers can be more confident that AVX-512 will be supported. And then at some point the support of AVX-512 will be chosen as the compile target as the lowest common denominator. Today this is the case with SSE2, for example, because every 64-bit x86 CPU can handle SSE2.
What you can also do is simply create several different binaries with AVX-2, AVX-512 and without AVX support and then start a small program first when the user wants to start the program. This small program then checks which features the CPU supports. And only then is the corresponding binary started.
This was done, for example, for the computer game "The Chronicles of Riddick: Escape from Butcher Bay" from 2004. AVX didn't exist at the time, but MMX, SSE, SSE2 and 3dNow! did. The corresponding binary was thus started according to their capabilities.
Of course, this is only a small effort if the code is written in a high-level language and the compiler is advanced enough to optimize for the corresponding SIMD units. Otherwise the code or at least parts of it would have to be written manually for each SIMD unit type and that takes a lot of time and therefore money.
Of course he's a Kung Fury fan and nuts.wad is his favorite map
The 512-bit vector register size will no longer be considered crazy once the latest CPUs go to 2048-bit vector registers. 640K is enough, right?
I agree and we already have several MiB as 2nd Level Cache.
Has AMD adopted the 512 version?
Till Zen2 they didnt. Don't know abt Zen3 though.
Guess what, now *only* AMD supports AVX512...
Well well, how the turntables!
A lot of these seem like things that would be better done on a GPU/can already be done in greater volume on a GPU. Why add these capabilities to CPUs? Doesnt that spend die space and instruction caching efficiency that impacts everything else?
1. Doing work on a dedicated GPU comes with a penalty. The bus is slow, thus data exchange between GPU, CPU and RAM is slow and latency high. For example in games things like physic effects that should affect the game play and not only be eye candy are better done on the CPU.
2. General code does not benefit from dedicated units like a GPU. If code is to use a dedicated unit, it must always be written specifically for the dedicated unit to be used.
It's different with AVX-512, here it is enough to recompile the general code with an optimized compiler and it can already benefit from AVX-512 if AVX-512 can be used profitably for this.
Every PC expert finally should learn,mithat clockit is not „Mhz“ or „MHZ“, but „MHz“ only. 😊
Basically make a cpu more like a gpu
What's a creel name change?
Well spied! Changed it to be simpler. Thanks for watching :)
@@WhatsACreel I still don't know what a creel is haha
@@billowen3285 is it a fish? It sounds like it would be a fish. I m replying before I Google it ofc.. It's tradition
@@arditm2178 In Scotland, it is a device used to catch lobsters. They are basically a wooden and cement-bottomed, net structure (usually a curved roofed barn shape). The "walls" and "roof" is made from rope/net - wide enough for fish to swim through - which is also used to create a funneled entrance into the centre of the creel. The lobster swims into the funnel of the creel, but can't swim out of it again due to magic or a cult. You go out in your boat, and throw the creels into the sea, all tethered together with a bouy at the end for finding them again. On collection, you should have a lobster or two (not in each one, we're not crazy), just waiting there in the creel. Occasionally you'll get some dickhead crab who thinks its hilarious to jump in to the lobster's seat. A quick five minutes in boiling water does the trick.
Kinda funny how AMD supports AVX-512 in their latest processors while Intel doesn't
Part 2: ua-cam.com/video/I3efQKLgsjM/v-deo.html
intel AND amd now :-)
Intel adds more, very complex, instructions while the world steadily migrates to RISC based CPUs.
OK ! hard core c++ ONLI K .
(AMD supports it two years later)... lol
You really look like mark knoppfler
Check out AMD's new Zen 3 architecture with their new Ryzen 5000 reveal. Pretty impressive stuff! Lots of limitations between cores eliminated etc.... there's a video about it here you might like...
ua-cam.com/video/5uWXfoX1x3A/v-deo.html
Is zen 3 going to introduce avx 512?
So far they doesn't mentioned about that.
@@bakulboro2292 I don't have a clue. I was referring mainly to how the cores communicate in it. Which is what I said if you read my comment.
Too bad consumer cpu’s barely even support avx 1 still. Only sse2 is always available.
AVX support started 10 years ago with Haswell. I don't know about AMD.
As soon as Windows 10 is no longer supported with updates, you can rely on AVX2 as a common denominator. Because Windows 11 does not officially support most of the old and broken SPECTRE and MELTDOWN CPUs. So if you want to use Windows 11 with official CPU support, you have to upgrade your CPU as customer anyway. And all new CPUs do all at least support AVX2.
What are the chances that any of this will ever be used? Could you express anything in a language that would tell the code generator to use this stuff and use it well. Do the language developers advance at the rate of the instruction set developers? Languages from their beginning have always used a tiny part of the instruction set. If you use these new instructions aren't you incompatible with everything else. How many version of code do you write and support? You can always access everything from assembly. I write only assembly (for 50 years) for embedded applications but most programmers today can't spell assembly. I love the idea of massively parallel fast architectures that I would program in assembly. But I wonder how many programmers are out there ready to do it. Intel poured millions into this so they must think it is important. I wonder if it will hit .01% of the applications?
You do an excellent job of covering these detailed topics and support assembly very well. Thanks.
They'll probably be used in a few places in a few heavily used libraries, but yeah you have a point
The compiler will support it and produce code that uses it. As more CPUs support AVX512, there will be more applications using it.
2:18 didnt age well
I think Intel really missed an opportunity, however, in that AVX-512 should have used a VLIW model, and should have transitioned to a decimal floating point arithmetic format for that instruction set.
Also: AMD's Zen 4 architecture is planned to include AVX-512.
P.S.: Your playlist is currently arranged backwards.
Intel burned their fingers with VLIW on Itanium 1 and 2. VLIW was therefore a dead end, just like the optimization to super high clock rates with the Pentium 4.
Adding new instructions is kind of bad approach comparing to adding cores. Old software compiled years ago will have no benefits from new instructions, software compiled today with new instructions set will not run on old hardware. While multithreaded application can create as many threads in runtime as you need, at same time will work on old hardware, and have benefits of new hardware with more cores.
Planned obsolescence and support of lazy programers who wanna write single thread applications...