Hey everyone! This is our third and final installment of educational deep dives with Tom! You can watch our previous two below. We'll have some other industry engineer videos from NVIDIA and case manufacturers coming up. I'm working on booking something technical with AMD hopefully in the near future as well! Aside from the big 3 silicon companies, what other engineering professions within the industry would you like to see on this channel? Even if I'm not familiar with the subject matter, I can study enough of it to at least interview someone for the basics like this! Watch our educational video on graphics/video drivers and game optimization: ua-cam.com/video/Qp3BGu3vixk/v-deo.html Watch the video on Simulation Time Error & Presentmon: ua-cam.com/video/C_RO8bJop8o/v-deo.html
Since Nvidia is already planned and AMD too, I hope you bring in Microsoft and Direct Storage would be nice. Haven't got much updates past 1.2. A deep dive would be very cool. Want to see how they will maximize the performance and utilize modern NVME SSDs. Because currently we havent seen much advancement.
TAP is the perfect example why vendors should let their engineers talk to the buyers. The way he makes you understand horribly complex topics is awesome. And it makes you appreciate their products more than any marketing BS.
It really does seem like the companies are allowing engineers in front of our cameras more and more! It's been great for deep dives on technicals we're not familiar with!
@@GamersNexusThe only people capable of making complicated topics seem simple, or easy to understand, are professionals. I love these kinds of videos, keep it up👍
Small correction: YUV 4:4:4 / 4:2:2 / 4:2:0 doesn't describe bits, it describes how many chroma samples are stored. The first '4' says that we are talking about rows of 4 luma samples, the second number describes how many chroma samples are stored in the first of two lines and the second number how many chroma samples are stored in the second line. That means a 4×2 block of luma samples contains • 8 pairs of chroma samples in YUV 4:4:4 • 4 pairs of chroma samples in YUV 4:2:2 • 2 pairs of chroma samples in YUV 4:2:0 Another way to think about it is that in YUV 4:4:4, each luma sample has its own pair of chroma samples, in YUV 4:2:2, each 2×1 block of luma samples shares a pair of chroma samples (the chroma planes have full vertical but half horizontal resolution) and in YUV 4:2:0, each 2×2 block of luma samples shares a pair of chroma samples (the chroma planes are half horizontal and half vertical resolution).
4:2:0 is truly a nonsensical shorthand made by an insane person. Because 4:2:2 and 4:2:0 are the only ones that realistically exist, you could just as easily describe the stored chroma resolution of a 2x2 square of chroma pixels instead of the insane self-referential sample numbers, for example: YUV 2x2 = YUV 4:4:4 YUV 1x2 = YUV 4:2:2 YUV 1x1 = YUV 4:2:0 Even if you argue that you must keep the 4x2 rectangle, describing the stored resolution still works and is way less insane than the "number of changes of chrominance samples between first and second row" like bruh
@@krakow10It is a truly bizarre naming convention, I've heard it explained multiple times and never fail to promptly forget what it means. It's easier to just remember one is full res, then half res, and quarter res .
@@tiarkrezar Indeed. It makes no sense because its roots are in analog TV and the numbers are not bits but factors of carrier frequencies intermingled with interlace logic.
Thanks for this! I asked Intel (since, like I said in the video, I know nothing about this field). Intel said this: "Good catch! Although there is a correlation between how many bits and how many chroma samples you store, the numbers represent the actual amount of chroma samples."
13:00 that's why it's so hard to compress confetti, snow, or other super small moving parts in a video. There's even a term called "compression nightmare" for these scenarios. Videos appear to be at a low bitrate, internet usage spikes, as well as cpu utilization.
Yup... a lot of new information gets introduced and removed by the next frame, which makes it impossible for most algorithms to deal with this situation... But the good news is H265/HEVC/VP9/AV1 encoder/decoders can deal with these situations A LOT better compared to the old ones such as Xvid/DivX/h264/AVC/etc...
and also why common things like a gradual fade in a video is actually quite difficult for compression to deal with, there's very little per-pixel frame-to-frame stability
As others have pointed out, these videos with Tom have been fantastic. I think the information is presented in a way that is not only valuable for gamers, but also for many ComSci students as well. Thanks to everyone involved, and hopefully we can see Tom back in the channel on another occasion!😄
Just sitting here watching this, eating some dinner, and half way through it just becomes immediately apparent to me that real, tangible people figured all of this stuff out and continue persevering and innovating on greater ideas and technologies. It just blows me away how intelligent the people were who designed and produced this stuff. I guess it's just very impressive. I mean, not even 100 years ago did we even have the first television.
early compression was simple stuff most the temporal stuff but yes its as he said magic, honestly people forgot how much tech is involved in just getting a single video delivered in real time, from the undersea cables the high speed fibre and relay to all the tech in chrome/youtube etc.
Something that amazes me even more is how all of this took many many different people contributing to just one great thing of millions created. There should be a class in schools dedicated to kids working together. Or maybe, the school itself should be organized in a way to encourage and teach kids to work together
WRT frequency domain on images. Picture it like this: the corner (0,0) is 0 oscillations -- constant value. the pixel at (0,1) has zero horizontal frequency, but 1 oscillaiton on the vertical, meaning it starts at 1, goes to 0, then back to 1, sine wave style (well, cosine actually but you get the idea, it's smooth and connects cyclically end to end). the pixel at (0,2) is the same but has two oscillations vertically, and so on. This step is usually performed on small blocks, 8x8 or 16x16. So on a block of 8x8, the frequency pixel at (8,8) is a checkerboard, and (0,8) is a series of 8 horizontal lines black white black white etc. the bottom right pixel (N,N) on any resolution ALWAYS coincides with the pattern that gives you a checkerboard.
Edit: Welp, I tried to explain Frequency Quantization and Symbol coding when PAT did it better... When they talk doing it in the residual, that's another layer of optimization in newer video codes, as it works the same as with JPEG images. Just to expand on it, as this was a moment of realisation for me when studying this JPEG compression: For each block of 8x8 pixels of the original image, we will assign a concrete "checkerboard/frequency" coefficient value. Meaning, if the first 8x8 are mostly black, they will recieve the coefficient (0,0). As this is a standard, we know that if we recieve (0,0) this can be reconstructed directly to a set of 8x8 pixels. If you zoom in all the way, you can appreciate the DCT patterns. We can apply a low-pass filter (reducing sharpness) and then can compress further by assigning compression algorithms to how many bits are needed for each coefficient. Meaning that coefficients that are more common (low frequencies), will use fewer bits that higher frequency ones. I probably have some of this mixed up as this was some time ago, but it was cool understanding how videos are compressed.
Computerphile did an introduction on DCT a few years ago going into more details of the math and intuition of the algorithm, in their 3 part series of covering JPEG compression. For those interested, it is a series worth a watch, as video compression seems to be very similar to JPEG compression on differences between frames.
HEIF, the format that is likely going to replace JPEG everywhere, is in fact just a single frame of H265 video. It just makes sense to reuse the same format for many reasons.
3 Blue 1 Brown did an EXCELLENT video on the Fourier Transform which explains how these sinusoid transforms work, and I think explained the tricks they use for the FFT.
It is so great to see a technology channel that actually talks about tech (instead of making funny, reality show-esque videos with graphics cards). The videos you guys made a while back about latency/input lag and GPU drivers were amazing as well.
We'll make sure Tom knows this sentiment! He's very understanding that we want to minimize marketing and maximize engineering. Looking forward to our next discussions with other engineers in the industry as well!
I just realised they wrote L2 cache as L2$ and I'm rolling. 3:10 Edit: I wrote L2 cash at first and was noted it was a mistake, so I changed it to Cache. Thanks to everyone.
What makes GN interviews with specialists and engineers so engaging is that Steve can keep up. Brilliant communicator that can translate the info into ELI5 for us idiots.
Steve saying "I'm coming to this with very little knowledge" is rare and really shows how humble and ready to learn a new thing he is. Love this channel and how the mindset goes. I hate people who act automatically as if they know exactly what someone is talking about when they actually barely have a superficial idea of what the subject could be.
Bro releases a top-class uni masterclass and uploads it to UA-cam for free. As a Computer/Telecomunications uni student, this is really interesting and amazing.
Imagine how much fewer bullcrap would come out of youtube if there were that fewer videos... maybe democracy would be thriving instead of being on the verge of collapse.
Frequency domain analysis is an extremely fun branch of math that has applications in so much stuff Circuits, sound, images, video You can view any information as a combination of several waves and instead of analyzing the signal you study its frequency components One of the steps there is the same as doing a low pass filter in audio, except high frequency in images correspond to sudden changes in values. Clipping it blurs the image (or whatever the equivalent is in YUV) Doing a high pass filter meanwhile is an useful way to get any edges which is useful for image recognition algorithm/AI but makes the image look like a normal map (that's tangent space, not frequency space, but hey)
This was incredible. I've always wanted to know more about compression and although I knew the basics, the step-by-step process overview was super helpful to get a greater understanding of how cool compression is. It's one of those mostly invisible technologies that most people don't know exist but are absolutely essential to keeping everything functioning.
@@GamersNexus see if you can work with some game devs and break down each step of development (storyboarding, writing, coding, modelling, rigging, texturing, lighting, etc.) should be pretty relevant and would give viewers a better understanding of what goes into the games they play
I love that GN does this content. It doesnt have to apeal to everyone. Just genuinely nerdy content that few fully understand (including me) is great. I work in a place with a lot of nerds. Some love to talk about stuff they probably shouldnt, and i love listening to them. Even if i dont really get it. They are so pationate about what they do and its great.
Refreshing to see some more in-depth presentations about how it all works instead of the usual high-level marketing slides, really enjoyed this series.
I'm throwing "TTT (Tom's Tech-Talk or Tom Talks Tech)" in there. And if snippets should get published on TikTok it would be a "Tom's Tech-Talk TikTok". 😄
@@scrittle I apologize but I cannot help the 14 year old creeping around inside me thinking aloud: "I love TT's".... ***snicker*** I do love the Tom Talks and want more of them. ***sigh*** Sometimes I disappoint myself...
Just an outstanding series of videos, real best-of-youtube stuff. Talking to customers in a non-marketing way by showing the breadth, depth and enthusiasm for the subject and how they think about their products. It's advertising that's actually worth something to the consumer. Outstanding.
Video & audio encoding (lossy) is absolutely wild with modern formats. Hats off to the people that came up with it & those that somehow still squeeze more out of it.
As some one who does computer graphics programming for a living, Tom has defined things that I can't even TRY to explain to someone else. Steve, MORE videos with Tom! I would love for him to show how non-realtime applications like Maya and Blender interact with the hardware. Heck Tom start a series on Intel's youtube channel or something where you explain more CG stuff. There is a serious lack of good resources for learning CG.
I used to love when Anandtech and The Tech Report would post technical articles like this back in the day. I'm glad you guys are continuing this tradition on UA-cam.
I loved this 'trilogy' with Tom Peterson, he's such a good presenter and explainer, even for such complex topics and ideas. I hope we see more of him in the future, these content heavy videos are really interesting to say the least
These Tech-Talks with Tom are incredible. Such a wealth of information delivered in a way that even a layman, such as myself can understand. Please keep these coming. 👍🏻
Every video with you and Tom is an absolute delight. Thank you all for the hard work to make these topics approachable. The passion from everyone involved really comes through and means a lot!
It is actually quite interesting to have someone, who is an engineer explain video compression, as someone that has recently learned about how it works on a technical level for my job.
Seriously, modern compression is amazing. It's crazy to see images with 5 megapixels (like 2000x2500 px) that have barely more than 200 kb filesize. At 3 byte per pixel, these are 15 MB raw. And if you edit and re-encode a UA-cam video with older or generally worse settings, you both lose quality and get like 5x the file size. The processes on larger media platforms are super impressive.
This not just interesting but super helpful on learning how encoding/decoding works under the hood. Tom's explanation on colorspace is very easy to understand and perhaps miles better than any text articles do.
I really enjoy these conversations with Tom. Most of it is way over my head but it does give me some insight into what is happening behind the scenes. Thanks for these enlightening videos.
I don't understand half of them, and I barely understand the other, but these are easily some of the best videos on the channel (and the rest are great). Tom is singlehandedly carrying my perception of Intel
I love it when big tech allows the engineers to talk. This is a great overview on a really complex topic. Please continue pushing for engineering talks, they're by far the best marketing possible. :)
Tom is so darn smart and a great presenter, we don't deserve him E: since this does turn into a product pitch for ARC, I wonder if there could be a use case in the future (if it's possible) for streamers to use a dual video card setups, one for the game they're playing and one for video encoding and compression in the same PC. I'd have to guess that this is currently possible with a dual PC setup but certainly a curiosity
This stuff is so interesting to me. In a past life, I did systems level and driver programming (back in the MS-DOS days). It's so interesting to see that the video compression stuff is hardware agnostic - it applies to all hardware - but then the video driver takes that information and makes it specific to (or translates it for) the hardware. This is some nerdy stuff, right here! ❤
I only have a very high level understanding of video encode/decode, so it always melts my brain just how insanely complex it all is, and just how smart everyone is to not only come up with the theory, but then turn that into actual silicon and software to seamlessly perform these tasks This video did a great job of getting deeper into the weeds of it all without being overwhelming and still very interesting, but then again, I'm pretty sure Tom reading a phone directory would be just as fascinating 😂
It's kind of insane how far compression technology has come. I still remember back in the 90's playing with a compression program written in QBasic that could cut your files in half and that was amazing back then, but now a video file can be reduced to a literal hundredth of its original size. The one inaccuracy in what Tom said was that it was lossless. Parts of the compression are, but the majority of it is lossy and information about the image data is lost. It's not as much as it used to be, but enough that multiple edits can lead to compression artifacts becoming more prominent than they otherwise would have been.
How this is lower in view count is beyond me. One of the most descriptive and interesting explanations as to the reason we watch, what we watch. Peter was fu&*ing awesome to be fair.
I don't know how popular this video will be, but you have my gratitude for doing this. I'm pretty deep into video compression with AV1, but I've never really understood the basics of RGB and YUV (although I had a rough idea of what YUV 4:4:4 does). So thanks for the very helpful insight, and please never stop making these kinds of videos!
The YUV explanation is wrong! The 4:4:4, 4:2:2 and 4:2:0 does NOT mean the amount of bits you take for luma and chroma, it represents the downscaling factor of the chrominance plains regarding the luminance plain. The least bad way to explain it would be to say that the first number is the number of columns, the second is the number of chroma samples in the first row and the third number is the number of samples in the second row. E.g., 4:2:2 means that for every 4 luma samples you get 2 chroma samples in both rows, so basically 2:1 compression for the chroma plain. 4:2:0 means that you get 2 chroma samples in the first row and 0 for the second, so basically 1 chroma sample per 4 luma samples (or 4:1 compression for chroma plains). But ANY of these values are typically encoded using 8, 10 or 12 bits each!!! So this has nothing to do with the bits per pixel, but how much chrominance and luminance values you store per pixel.
@@luisjalabert8366 The amount of unreflective adulatory brown-nosing I need to plough through to get to the one informed comment/reply from someone who immediately notices the same goofs as I do becomes increasingly baffling with every video ... usually I would take the same approach as you to get the point across, but to simplify it even further for here I'd put it as :: for each 4x2 pixel block.
Colorspace from (8bit) rgb to limited yuv to rgb, after those conversions you'll only get ~15% the same colors you started with because of rounding errors even with 4:4:4 subsampling. there's another step they might have omitted which was how 4:2:0 was arrived at/extrapolated from, example if you losslessly captured the same file played back with different media players (w/ dithering turned off), or ffmpeg they all might slightly disagree how to display the same frame.
Thanks for these deep dives. If you want to ignore my pedantic nitpicks, feel free to ignore the rest of this (knowing GN viewers, this warning will only attract you) It is true that rods do not differentiate color and that cones do, but the slide is suggesting that rods are responsible for luminance and cones are for chrominance. This is not the case -- rods are for low-light "scotopic" vision, and is rarely used in modern life. If there is any appreciable illumination, then the cones do all the sensing. YUV is a very old colorspace from the NTSC analog broadcast days and there are other better choices nowadays, but the point of having more resolution on the luminance/grayscale part of the image over the chrominance is generally common among them. However, the subsampling rates of 4:4:4 or 4:2:0 do not refer to the bits used, and tbh I forget what they originally meant (were related to some analog-centric way of color transmission. 4:4:4 has colors at full rez, 4:2:0 has color at half-by-half rez, as was briefly mentioned. This terminology is also used in JPEG image compression. I do not think any modern codecs are using pre-determined Huffman coding for symbols. They either using something adaptive and denser (e.g. h264 had CABAC, an adaptive arithmetic encoder) or a simpler encoding.
Seldom do people get insights like this from industry professionals that can explain extremely complicated topics in a somewhat simple manner. Please keep making these! They are amazing.
This is the very reason you can study this stuff at college. So many things we take for granted have a massive rabbit hole behind them. I deep dived a few years ago when transcoding videos from 24p to 60p (basically HFR with frame interpolation). Back then there was a lazy way to do it, which is on the fly transcoding every time you watched a movie. Or a real conversion. The perceived quality increase is massive going from 24 frames to 60 in a movie. Sharp motions look insane. Somehow people hated it back then when the Hobbit made it, but it just feels more immersive/real. Watching Matrix in HFR was so much more fun. Just imagine all slow movement being sharp instead of blurry as it currently is.
I've ALWAYS wanted to know how a service like UA-cam can exist, how so much DATA can be just sit there piling on servers. This can maybe answer some of that!
they exist by burning money as a way to keep people invested in a larger ecosystem, youtube and google go hand in hand, much like twitch and amazon prime have entangled perks. it's commonly called a loss leader.
...and in the long game, you gather absolutely unrivaled amounts of media that you can feed into your AI systems. And nobody can stop you from accessing it.
The YUV explanation is wrong! The 4:4:4, 4:2:2 and 4:2:0 does NOT mean the amount of bits you take for luma and chroma, it represents the downscaling factor of the chrominance plains regarding the luminance plain. The least bad way to explain it would be to say that the first number is the number of columns, the second is the number of chroma samples in the first row and the third number is the number of samples in the second row. E.g., 4:2:2 means that for every 4 luma samples you get 2 chroma samples in both rows, so basically 2:1 compression for the chroma plain. 4:2:0 means that you get 2 chroma samples in the first row and 0 for the second, so basically 1 chroma sample per 4 luma samples (or 4:1 compression for chroma plains). But ANY of these values are typically encoded using 8, 10 or 12 bits each!!! So this has nothing to do with the bits per pixel, but how much chrominance and luminance values you store per pixel.
Right click tab>Bookmark Tab...>Add new folder to bookmarks tab>save to "yt" folder along with L1 Show to watch later because attention is absolutely necessary. 🤘❤
I ignore Intel videos and skip Intel timestamps in videos as their hardware is meh...but I will watch anything with Tom Peterson in it. Just an expert in his field talking about the cool things he's working on and explaining it. I'm here for it.
Excellent video! I'll point friends who are new to video/image compression this way. One thing to note, from haunting video compression forums for many years. While the fixed function decode hardware on GPUs have been very fast and as fully featured as software decode since 2006 or so, and of course the same quality as software decoding, the same can't be said for encode even today. The x264 guys, to my best recollection viewed GPU encode as a marketing exercise, and saw very little speedup in leveraging GPU hardware themselves in comparison to other possible optimizations. Still today I'm not aware of any leveraging of GPU hardware by x265 or rav1e. NVIDIA has the best GPU hw encode quality currently, and they're all somewhat better now, but not much. At least on NVIDIA and AMD, the encoders use much more of the actual programmable shader hardware than decode does (slowing games, etc.), they can't use many of the advanced features of the H.264/HEVC/AV1 formats, and they can struggle to compete on quality with 2-4 modern CPU cores running e.g. x264 --preset veryfast. If you can, try isolating a few cores and running software encode - you may be surprised, especially since you can still use the decode hardware on the input video if it's compressed.
A good way to explain what the 2D DCT is, is to make an analogy with sound and a spectrum analyzer. As the pitch of a sound rises while maintaining the volume, the bars of the analyzer shift to the right without changing the height, and if you have a sound consisting of two tones combined, the analyzer displays two separate spikes. The 2D DCT is exactly the same as a spectrum graph, except it's on two dimensions (width and height) rather than just one (time). Both images and sounds are signals that can be processed.
This stuff always amazes me! But then at the same time, it's not like it just magically came together all at once. Breaking it down step by step puts it into much much simpler and realistic perspective.
Just last year I had a class in university about video encoding. 40 seconds into this video it was already worth it, the 2D discrete cosine transform was part of the exam. The class covered everything up to the Huffman encoding although of course in much more detail. The Huffman encoding was part of a different class. Exciting to see my university classes have some real life uses.
I feel like if I didn't have an education that involved Fourier transforms on images and the frequency space then all of this would've been lost on me. It's really awesome to see something this in-depth on UA-cam and in the real world and not a hypothetical scholastic video.
Probably one of the best introductions into "How does encoding" work I've seen. Sadly it gets A LOT more complicated once you have to deal with the media formats. You can easily "destroy" the color information in a video without even realising it and loose a lot of quality, because you converted from A to B to C and back to A again etc. Very common on UA-cam since 99% of all UA-camrs have no idea what they are doing when they record Video / process it in sth like Premiere / Davinci and upload it to UA-cam.
Really coo video, I don't see enough talk online about the genius of video coding (and even less that is as well explained). I know this is for the sake of simplification, but having literally done a PhD on inter-picture prediction, I find any claims of how much this step reduces size to be weird, because you're missing out on how prediction is all about the compromise of accurate prediction with good motion vectors and cheaply coded ones, For example, a good encoder will not take the best prediction (that results in the smallest residual), it will take one that is good enough for the level of quality but is cheaper to code (by being less precise or being the same as the block next to it). The symbol coding is really one of the most complex and genius part of modern methods. First Huffman has been dead for years (only used in like JPEG and low power AVC), everyone uses arithmetic coding, which means that symbols can take less than one bit and the encoder basically figures it out with some complex math. But the real shit is how instead of fixed probabilities/coding that were implied in the video, it is adaptative and will "learn" depending on what you have been feeding it so far. So if let's say you keep having the same direction for your motion vector, it will cost fewer bits if you use it multiple times. There are now dozens of bins that store how they have been used and adapt the coding through the image. The residual part, even if it looked hard in the video with that good old DCT, is actually the easiest thing and pretty easy to implement, as it is pretty straightforward. Getting the right motion vectors takes a huge amount of time and hardware have to make a lot of compromises to limit the search space since they typically require real-time performance and at least consistent performance and can't afford to spend extra time on an image where it doesn't find something good easily. There are many papers on how to optimize the search, and for hardware in cameras I have also seen the use of motion sensors as a side channel information to guide the search (nice for a camera that moves a lot like a go pro). Last note with is more a pet peeve of mine, chroma subsampling currently has no reason to be used outside of being nicer on hardware, you will always lose quality and not even get any actual efficiency, 4:4:4 with a bit more aggressive quantization on the chroma components usually looks better than 4:2:0 at the same bitrate, but it does uses more memory and hardware circuits (though nowhere near double like memory, a lot of hardware just uses luma for everything intensive then chroma gets quantized and dealt with minimal care). Considering it was done by someone with very little experience in the field, I was still very pleasantly surprised at how accurate this was for such a short explanation. Just a disclaimer on the hardware claims, I have not worked with Intel encoding/decoding hardware and will not mention the ones I used because of NDA (including some projects that aren't yet released/were canned).
11:40 a note about chroma subsampling: It's primarily useful for live video transfer, where you need to get raw data across a studio or building. When it comes to actual video compression, the heavy lifting is done by the encoder, so chroma subsampling doesn't actually do much. This is why I have been frustrated at our insistence of using 420 this whole time versus 444, or even just staying in RGB. You can often have a 4:4:4 full range 10 bit video be *barely* larger than a 4:2:0 partial range 8 bit video, but it can store SO much more information and be more accurate to the source material. It would be way better for VR streaming since the color range would be better and colors would be sharper.
That was amazing. I feel like i just went back to uni and my favourite lecturer explained to us some of the most obtuse information in a way we could all understand. Bravo Tom, Steve and gamers nexus.
16:33 The frequency quantization is really complicated, the best way I’ve heard it explained is as a map of comparison. Basically you’re comparing each pixel value to the pixels next to it to create a map of where the biggest differences contrast are in the photo. It’s basically a mathematical way of figuring out. What the most important data in the image is and then you can discard the information below a certain threshold and when you do the math backwards converting it back to an image it looks like almost nothing has happened.
one thing that was not mentioned when they talked about Temporal Difference is "sun goes down".. A shifting light source *changes the lighting* all over the scene (as opposed to: "car drives by static house w/ambient lighting"). In other words Temporal is more about motion / positional differences than overall lighting. I understand why they did not deep dive into that, as it is a whole tangent - but good to be aware of. Very well done though!
A subject close to my heart. My entire niche is providing the best possible video quality for viewers. Working through how UA-cam compresses things to get the best out of the other end has been a long process. This examination of the pipeline is amazing 😊 I now film in all-intra 4:2:2-10, export in 100000 kbps H.265, and UA-cam seems to do ok with that. The only frontier left is banding that comes with low gradient fill as they transcode to 8-bit. Otherwise you can see the results for yourself. Pick 4k quality if you do.
A slight correction on chroma subsampling: it's not the bits that are subsampled, but the definition of the chroma image components. 4:2:2 means that the definition of the chroma image components is halved horizontally compared to the original definition, and 4:2:0 means it's halved in both directions. Which means for e.g. 4:2:0 - which is the common consumer distribution format - you don't have, say, a 1920x1080 image where each pixel is 12 bits instead of 24, but one 1920x1080 black and white image where each pixels is 8 bits, and then two images for the chroma components, where pixels are also 8 bits but where the definition is 960x540 for each. The chroma images then have to be upscaled back at decoding to match the luma definition (and just for fun different formats also have different specs about how the upscaled chroma pixels are aligned with the full definition luma ones). Some video renderers will allow you to choose between different algorithms for that chroma upscaling step. And an interesting trivia: one of Tom's slides shows 4 byte/pixel for 10-bit HDR video, but the P010 pixel format most commonly used by GPUs for that (at least on Windows) is actually 10 bits of actual sub-pixel information *padded to 16* ! So if for example you use copy-back decoding mode (where the decoding image is transfered back to the CPU), 10-bit HDR video actually uses *two times* the bandwidth of 8-bit video instead of 1.25 times !
Thanks Steve! That was FASCINATING! I do not do low level coding. I do not do data compression nor transforms. My signal analysis basis is crude at best. I had some calc very, very long ago.... STILL that was amazing and accessible at a 30,000 foot view! Thanks to Tom and to you and the entire GN crew. Peaceful Skies.
Hey GN! I think a “How They’re Made” series on all components of a pc would be cool. Seeing how air coolers are manufactured and the science, how mothersboard starts at just the pcb and etc etc.
Fantastic video to finish the series. For all the folks who are wondering what these compression techniques look like quantified with an example - Resident Evil 2 on the Playstation clocked in at 1.5 GB in size. Angel Studios made that fit onto a 64MB Nintendo 64 cartridge.
Tom Scott has made a great video explaining Huffman encoding. The gist of it is that, the more common a pattern is, the less bits we use to encode an occurence of that pattern. It's a clever way to generate an encoding for each pattern in such a way that it achieves that. To really understand the frequency domain and quantization part, you kinda need to know the mathematical background of fourier series and the fourier transform (the discrete cosine transform used here is very similar to the "real" fourier transform as far as I understand it, same concept). 3blue1brown has made great videos explaining those. The basics is that we can think of any data as being generated by a sum of an infinite amount of sine waves or cosine waves, or in practice you'd use a finite amount of waves, the more waves, the closer you get to the original data. So you have for example a 1Hz wave multiplied with some factor a, a 2 Hz wave multiplied with some factor b, a 3 Hz wave multiplied with some factor c, etc, for each frequency you have a value of how strong that wave's influence on the generated data is, you just multiply each wave with that factor and then add all of them together to get the data (getting back the original data from the frequency domain data like this is called an inverse fourier transform). If you were working in 1 dimension, such as with (mono) audio, the frequency domain graph would have frequency on the x axis (eg. 0.1 Hz to 10000 Hz or whatever) and in the y axis for each frequency the "strength" of that frequency, the multiplier. In this case of working with images, you need to do this in 2 dimensions of course, so I assume you have waves going in the x direction and waves going in the y direction. So the frequency domain graph that Tom showed had the frequencies on x and y, and the strength of the frequencies was indicated by the brightness (darker = stronger). We apply the fourier transform to get this frequency domain representation, then we can manipulate it based on the frequencies, in this case cut of all the high frequencies (effectively removing the small, high frequency details of the image data), then apply the inverse fourier transform to get image data back (in this case of course it's not the image itself, it's the residuals). Exactly the same thing happens when you apply a low pass filter in an audio editing program, it uses the fourier transform, then manipulates the data in frequency domain, then uses the inverse fourier transform.
I literally just watched a 30 minute video from a channel named Theo of him reading an article about how h264 works and it couldn't be a more apropried thing to watch to be prepared for this video you guys just posted
Hey everyone! This is our third and final installment of educational deep dives with Tom! You can watch our previous two below. We'll have some other industry engineer videos from NVIDIA and case manufacturers coming up. I'm working on booking something technical with AMD hopefully in the near future as well! Aside from the big 3 silicon companies, what other engineering professions within the industry would you like to see on this channel? Even if I'm not familiar with the subject matter, I can study enough of it to at least interview someone for the basics like this!
Watch our educational video on graphics/video drivers and game optimization: ua-cam.com/video/Qp3BGu3vixk/v-deo.html
Watch the video on Simulation Time Error & Presentmon: ua-cam.com/video/C_RO8bJop8o/v-deo.html
Very cool!
Cerebras wafer scale engine for ai and scientific computing
Since Nvidia is already planned and AMD too, I hope you bring in Microsoft and Direct Storage would be nice. Haven't got much updates past 1.2. A deep dive would be very cool. Want to see how they will maximize the performance and utilize modern NVME SSDs. Because currently we havent seen much advancement.
Please, head up to Accursed Farms and watch the last video from Ross, this is adjacent to GN, but it's still a topic you guys should cover.
Would be good to hear from id Software to know how their games are so scalable and amazing.
TAP is the perfect example why vendors should let their engineers talk to the buyers. The way he makes you understand horribly complex topics is awesome. And it makes you appreciate their products more than any marketing BS.
It really does seem like the companies are allowing engineers in front of our cameras more and more! It's been great for deep dives on technicals we're not familiar with!
@@GamersNexusThe only people capable of making complicated topics seem simple, or easy to understand, are professionals. I love these kinds of videos, keep it up👍
I think part of it is its easier to teach an engineer on how to publicly speak than it is to teach a public speaker engineering.
@@DJFIRESTONE92This be the truth.
This is the right marketing for the technically inclined part of the crowd. Good job Intel for figuring this out
Switching to 144p for increased immersion.
It'll really help relate the topic back!
Make sure you enable 1440 Hz refresh rate as well
make sure you sit on a gaming chair
upto interpretation type shit
It's all "retina" if your eyesight is bad enough 🤔
Please don't give up on those technical interviews. They are what we need.
We may need better affordable sw and hw, but in the meantime we can find interesting actual information being presented to us.
Agreed!
Small correction: YUV 4:4:4 / 4:2:2 / 4:2:0 doesn't describe bits, it describes how many chroma samples are stored. The first '4' says that we are talking about rows of 4 luma samples, the second number describes how many chroma samples are stored in the first of two lines and the second number how many chroma samples are stored in the second line.
That means a 4×2 block of luma samples contains
• 8 pairs of chroma samples in YUV 4:4:4
• 4 pairs of chroma samples in YUV 4:2:2
• 2 pairs of chroma samples in YUV 4:2:0
Another way to think about it is that in YUV 4:4:4, each luma sample has its own pair of chroma samples, in YUV 4:2:2, each 2×1 block of luma samples shares a pair of chroma samples (the chroma planes have full vertical but half horizontal resolution) and in YUV 4:2:0, each 2×2 block of luma samples shares a pair of chroma samples (the chroma planes are half horizontal and half vertical resolution).
4:2:0 is truly a nonsensical shorthand made by an insane person. Because 4:2:2 and 4:2:0 are the only ones that realistically exist, you could just as easily describe the stored chroma resolution of a 2x2 square of chroma pixels instead of the insane self-referential sample numbers, for example:
YUV 2x2 = YUV 4:4:4
YUV 1x2 = YUV 4:2:2
YUV 1x1 = YUV 4:2:0
Even if you argue that you must keep the 4x2 rectangle, describing the stored resolution still works and is way less insane than the "number of changes of chrominance samples between first and second row" like bruh
This comment needs more attention. It’s good knowledge
@@krakow10It is a truly bizarre naming convention, I've heard it explained multiple times and never fail to promptly forget what it means. It's easier to just remember one is full res, then half res, and quarter res .
@@tiarkrezar Indeed. It makes no sense because its roots are in analog TV and the numbers are not bits but factors of carrier frequencies intermingled with interlace logic.
Thanks for this! I asked Intel (since, like I said in the video, I know nothing about this field). Intel said this:
"Good catch! Although there is a correlation between how many bits and how many chroma samples you store, the numbers represent the actual amount of chroma samples."
13:00 that's why it's so hard to compress confetti, snow, or other super small moving parts in a video. There's even a term called "compression nightmare" for these scenarios. Videos appear to be at a low bitrate, internet usage spikes, as well as cpu utilization.
slow mo guys have show this well with their glitter stuff, gavin is actually very knowledgable about this stuff.
I remember part of the reddit blackout protest was to upload videos of static that are essentially impossible to compress.
Yup... a lot of new information gets introduced and removed by the next frame, which makes it impossible for most algorithms to deal with this situation...
But the good news is H265/HEVC/VP9/AV1 encoder/decoders can deal with these situations A LOT better compared to the old ones such as Xvid/DivX/h264/AVC/etc...
@@SkidrowTheThird by blurring them out lmao, gotta love sao in x265
and also why common things like a gradual fade in a video is actually quite difficult for compression to deal with, there's very little per-pixel frame-to-frame stability
As others have pointed out, these videos with Tom have been fantastic. I think the information is presented in a way that is not only valuable for gamers, but also for many ComSci students as well.
Thanks to everyone involved, and hopefully we can see Tom back in the channel on another occasion!😄
Thank you! And looking forward to the next guests also!
Just sitting here watching this, eating some dinner, and half way through it just becomes immediately apparent to me that real, tangible people figured all of this stuff out and continue persevering and innovating on greater ideas and technologies. It just blows me away how intelligent the people were who designed and produced this stuff. I guess it's just very impressive. I mean, not even 100 years ago did we even have the first television.
early compression was simple stuff most the temporal stuff but yes its as he said magic, honestly people forgot how much tech is involved in just getting a single video delivered in real time, from the undersea cables the high speed fibre and relay to all the tech in chrome/youtube etc.
Something that amazes me even more is how all of this took many many different people contributing to just one great thing of millions created. There should be a class in schools dedicated to kids working together.
Or maybe, the school itself should be organized in a way to encourage and teach kids to work together
Me press button, moving photo come out.
Me no understand how works but me happy
WRT frequency domain on images.
Picture it like this: the corner (0,0) is 0 oscillations -- constant value. the pixel at (0,1) has zero horizontal frequency, but 1 oscillaiton on the vertical, meaning it starts at 1, goes to 0, then back to 1, sine wave style (well, cosine actually but you get the idea, it's smooth and connects cyclically end to end). the pixel at (0,2) is the same but has two oscillations vertically, and so on. This step is usually performed on small blocks, 8x8 or 16x16. So on a block of 8x8, the frequency pixel at (8,8) is a checkerboard, and (0,8) is a series of 8 horizontal lines black white black white etc. the bottom right pixel (N,N) on any resolution ALWAYS coincides with the pattern that gives you a checkerboard.
Edit: Welp, I tried to explain Frequency Quantization and Symbol coding when PAT did it better... When they talk doing it in the residual, that's another layer of optimization in newer video codes, as it works the same as with JPEG images.
Just to expand on it, as this was a moment of realisation for me when studying this JPEG compression:
For each block of 8x8 pixels of the original image, we will assign a concrete "checkerboard/frequency" coefficient value. Meaning, if the first 8x8 are mostly black, they will recieve the coefficient (0,0). As this is a standard, we know that if we recieve (0,0) this can be reconstructed directly to a set of 8x8 pixels. If you zoom in all the way, you can appreciate the DCT patterns.
We can apply a low-pass filter (reducing sharpness) and then can compress further by assigning compression algorithms to how many bits are needed for each coefficient. Meaning that coefficients that are more common (low frequencies), will use fewer bits that higher frequency ones.
I probably have some of this mixed up as this was some time ago, but it was cool understanding how videos are compressed.
Computerphile did an introduction on DCT a few years ago going into more details of the math and intuition of the algorithm, in their 3 part series of covering JPEG compression. For those interested, it is a series worth a watch, as video compression seems to be very similar to JPEG compression on differences between frames.
HEIF, the format that is likely going to replace JPEG everywhere, is in fact just a single frame of H265 video. It just makes sense to reuse the same format for many reasons.
@@Finder245 same case with AVIF, but HEIF will definitely not supersede jpegs
3 Blue 1 Brown did an EXCELLENT video on the Fourier Transform which explains how these sinusoid transforms work, and I think explained the tricks they use for the FFT.
@@simulping4371 why not? Apple already uses it in place of JPEG for pictures taken using iPhones.
Veritasium also did a pretty good video on the Fast Fourier Transform that underlies DCT.
It is so great to see a technology channel that actually talks about tech (instead of making funny, reality show-esque videos with graphics cards).
The videos you guys made a while back about latency/input lag and GPU drivers were amazing as well.
These discussions and presentations have been fantastic. Thank you (everyone involved) for producing this.
We'll make sure Tom knows this sentiment! He's very understanding that we want to minimize marketing and maximize engineering. Looking forward to our next discussions with other engineers in the industry as well!
This was AWESOME!
Tom is such an awesome guy he deserves his success 100% really appreciate him doing things like this.
I could watch videos of you and Tom all day and not get bored, and learn many new things. These technical series are fantastic.
Yes, I'm definitely not bored, but my brain may overheat from trying to parse all the data
Video compression + ffmpeg is a modern marvel that powers so much without users knowing.
I just realised they wrote L2 cache as L2$ and I'm rolling.
3:10
Edit: I wrote L2 cash at first and was noted it was a mistake, so I changed it to Cache.
Thanks to everyone.
that’s actually pretty common short hand. it is funny tho
Cache, but yeah.
Cache rules everything around me.
Its not cash rather cache and thats how its commonly referred to.
@@dojelnotmyrealname4018 Fixed it, thanks!
What makes GN interviews with specialists and engineers so engaging is that Steve can keep up. Brilliant communicator that can translate the info into ELI5 for us idiots.
I can't wait for AV1 to truly take off, so that 8K, 120 fps, HDR, 12 bit colors, rec2100, 444 chroma subsampling and all that jazz can become common.
Steve saying "I'm coming to this with very little knowledge" is rare and really shows how humble and ready to learn a new thing he is. Love this channel and how the mindset goes. I hate people who act automatically as if they know exactly what someone is talking about when they actually barely have a superficial idea of what the subject could be.
Agreed! I have more respect for humility than bluffing bravado. Steve you know more that you realized or vocalized. Very impressed with your modesty.
This guy is great. Thanks for collaborating…
Bro releases a top-class uni masterclass and uploads it to UA-cam for free. As a Computer/Telecomunications uni student, this is really interesting and amazing.
It's almost impossible to get bored with GN.
Also starting with the bandwidth UA-cam would need is crazy.
Imagine how much fewer bullcrap would come out of youtube if there were that fewer videos... maybe democracy would be thriving instead of being on the verge of collapse.
I always see LTT fanboys saying Gamers Nexus is "boring"... they're insane lol
This type of content, really makes me appreciate the existence of this channel!
Frequency domain analysis is an extremely fun branch of math that has applications in so much stuff
Circuits, sound, images, video
You can view any information as a combination of several waves and instead of analyzing the signal you study its frequency components
One of the steps there is the same as doing a low pass filter in audio, except high frequency in images correspond to sudden changes in values. Clipping it blurs the image (or whatever the equivalent is in YUV)
Doing a high pass filter meanwhile is an useful way to get any edges which is useful for image recognition algorithm/AI but makes the image look like a normal map (that's tangent space, not frequency space, but hey)
This was incredible. I've always wanted to know more about compression and although I knew the basics, the step-by-step process overview was super helpful to get a greater understanding of how cool compression is. It's one of those mostly invisible technologies that most people don't know exist but are absolutely essential to keeping everything functioning.
It’s always a great day when we get to see Tom and Steve in the same video
Shout out to the video editor, incredible transitions from video footage to the slide deck!
Crazy informative video series
Thank you! Love hosting these because we learn a lot from them also. Now we just need to figure out what topics and companies to work on next!
Sony with their version of dlss
Capcom, micro transactions.
@@GamersNexus see if you can work with some game devs and break down each step of development (storyboarding, writing, coding, modelling, rigging, texturing, lighting, etc.)
should be pretty relevant and would give viewers a better understanding of what goes into the games they play
@@chillnspace777 pisser
I love that GN does this content. It doesnt have to apeal to everyone. Just genuinely nerdy content that few fully understand (including me) is great.
I work in a place with a lot of nerds. Some love to talk about stuff they probably shouldnt, and i love listening to them. Even if i dont really get it. They are so pationate about what they do and its great.
Refreshing to see some more in-depth presentations about how it all works instead of the usual high-level marketing slides, really enjoyed this series.
I look forward to more in the TTAS (Tom Talks About Stuff) series of videos.
I'm throwing "TTT (Tom's Tech-Talk or Tom Talks Tech)" in there. And if snippets should get published on TikTok it would be a "Tom's Tech-Talk TikTok". 😄
Talking Tom
@@scrittle I apologize but I cannot help the 14 year old creeping around inside me thinking aloud: "I love TT's".... ***snicker***
I do love the Tom Talks and want more of them. ***sigh***
Sometimes I disappoint myself...
Just an outstanding series of videos, real best-of-youtube stuff. Talking to customers in a non-marketing way by showing the breadth, depth and enthusiasm for the subject and how they think about their products. It's advertising that's actually worth something to the consumer. Outstanding.
Video & audio encoding (lossy) is absolutely wild with modern formats. Hats off to the people that came up with it & those that somehow still squeeze more out of it.
As some one who does computer graphics programming for a living, Tom has defined things that I can't even TRY to explain to someone else. Steve, MORE videos with Tom! I would love for him to show how non-realtime applications like Maya and Blender interact with the hardware. Heck Tom start a series on Intel's youtube channel or something where you explain more CG stuff. There is a serious lack of good resources for learning CG.
I used to love when Anandtech and The Tech Report would post technical articles like this back in the day. I'm glad you guys are continuing this tradition on UA-cam.
I loved this 'trilogy' with Tom Peterson, he's such a good presenter and explainer, even for such complex topics and ideas. I hope we see more of him in the future, these content heavy videos are really interesting to say the least
I can't even tell how much I enjoy these educational pieces with Tom Peterson and the GN team!
Thx a lot!
These Tech-Talks with Tom are incredible. Such a wealth of information delivered in a way that even a layman, such as myself can understand. Please keep these coming. 👍🏻
Every video with you and Tom is an absolute delight. Thank you all for the hard work to make these topics approachable. The passion from everyone involved really comes through and means a lot!
It is actually quite interesting to have someone, who is an engineer explain video compression, as someone that has recently learned about how it works on a technical level for my job.
7:44 *”UA-cam is suckin’ down the bits”*
Thanks, Tom!
Tom Petersson is such a gem. A really good representative of the industry in general and Intel in particular. I hope PR departments takes note
Seriously, modern compression is amazing.
It's crazy to see images with 5 megapixels (like 2000x2500 px) that have barely more than 200 kb filesize. At 3 byte per pixel, these are 15 MB raw.
And if you edit and re-encode a UA-cam video with older or generally worse settings, you both lose quality and get like 5x the file size. The processes on larger media platforms are super impressive.
You two did a great job going through that. He is a great guest, able to explain things pretty simply.
This not just interesting but super helpful on learning how encoding/decoding works under the hood. Tom's explanation on colorspace is very easy to understand and perhaps miles better than any text articles do.
I really enjoy these conversations with Tom. Most of it is way over my head but it does give me some insight into what is happening behind the scenes.
Thanks for these enlightening videos.
I don't understand half of them, and I barely understand the other, but these are easily some of the best videos on the channel (and the rest are great). Tom is singlehandedly carrying my perception of Intel
I love it when big tech allows the engineers to talk. This is a great overview on a really complex topic.
Please continue pushing for engineering talks, they're by far the best marketing possible. :)
Love to see FFmpeg and gstreamer get a shout-out. Was *not* expecting them to be such prominent parts of the *Windows* media stack.
Tom is so darn smart and a great presenter, we don't deserve him
E: since this does turn into a product pitch for ARC, I wonder if there could be a use case in the future (if it's possible) for streamers to use a dual video card setups, one for the game they're playing and one for video encoding and compression in the same PC. I'd have to guess that this is currently possible with a dual PC setup but certainly a curiosity
Pretty sure some already do this, using something like an Arc A380 in the same PC just to put all the encoding tasks on when streaming.
This stuff is so interesting to me. In a past life, I did systems level and driver programming (back in the MS-DOS days). It's so interesting to see that the video compression stuff is hardware agnostic - it applies to all hardware - but then the video driver takes that information and makes it specific to (or translates it for) the hardware.
This is some nerdy stuff, right here! ❤
F***in love educational UA-cam, dang. Nothing better than listening to an expert talk about their schtick.
GN - the only tech channel that gets me excited about tech.
I only have a very high level understanding of video encode/decode, so it always melts my brain just how insanely complex it all is, and just how smart everyone is to not only come up with the theory, but then turn that into actual silicon and software to seamlessly perform these tasks
This video did a great job of getting deeper into the weeds of it all without being overwhelming and still very interesting, but then again, I'm pretty sure Tom reading a phone directory would be just as fascinating 😂
It's kind of insane how far compression technology has come. I still remember back in the 90's playing with a compression program written in QBasic that could cut your files in half and that was amazing back then, but now a video file can be reduced to a literal hundredth of its original size. The one inaccuracy in what Tom said was that it was lossless. Parts of the compression are, but the majority of it is lossy and information about the image data is lost. It's not as much as it used to be, but enough that multiple edits can lead to compression artifacts becoming more prominent than they otherwise would have been.
How this is lower in view count is beyond me. One of the most descriptive and interesting explanations as to the reason we watch, what we watch. Peter was fu&*ing awesome to be fair.
I don't know how popular this video will be, but you have my gratitude for doing this. I'm pretty deep into video compression with AV1, but I've never really understood the basics of RGB and YUV (although I had a rough idea of what YUV 4:4:4 does).
So thanks for the very helpful insight, and please never stop making these kinds of videos!
The YUV explanation is wrong!
The 4:4:4, 4:2:2 and 4:2:0 does NOT mean the amount of bits you take for luma and chroma, it represents the downscaling factor of the chrominance plains regarding the luminance plain. The least bad way to explain it would be to say that the first number is the number of columns, the second is the number of chroma samples in the first row and the third number is the number of samples in the second row. E.g., 4:2:2 means that for every 4 luma samples you get 2 chroma samples in both rows, so basically 2:1 compression for the chroma plain. 4:2:0 means that you get 2 chroma samples in the first row and 0 for the second, so basically 1 chroma sample per 4 luma samples (or 4:1 compression for chroma plains). But ANY of these values are typically encoded using 8, 10 or 12 bits each!!! So this has nothing to do with the bits per pixel, but how much chrominance and luminance values you store per pixel.
@@luisjalabert8366 The amount of unreflective adulatory brown-nosing I need to plough through to get to the one informed comment/reply from someone who immediately notices the same goofs as I do becomes increasingly baffling with every video ... usually I would take the same approach as you to get the point across, but to simplify it even further for here I'd put it as :: for each 4x2 pixel block.
Colorspace from (8bit) rgb to limited yuv to rgb, after those conversions you'll only get ~15% the same colors you started with because of rounding errors even with 4:4:4 subsampling.
there's another step they might have omitted which was how 4:2:0 was arrived at/extrapolated from, example if you losslessly captured the same file played back with different media players (w/ dithering turned off), or ffmpeg they all might slightly disagree how to display the same frame.
Thanks for these deep dives. If you want to ignore my pedantic nitpicks, feel free to ignore the rest of this (knowing GN viewers, this warning will only attract you)
It is true that rods do not differentiate color and that cones do, but the slide is suggesting that rods are responsible for luminance and cones are for chrominance. This is not the case -- rods are for low-light "scotopic" vision, and is rarely used in modern life. If there is any appreciable illumination, then the cones do all the sensing.
YUV is a very old colorspace from the NTSC analog broadcast days and there are other better choices nowadays, but the point of having more resolution on the luminance/grayscale part of the image over the chrominance is generally common among them. However, the subsampling rates of 4:4:4 or 4:2:0 do not refer to the bits used, and tbh I forget what they originally meant (were related to some analog-centric way of color transmission. 4:4:4 has colors at full rez, 4:2:0 has color at half-by-half rez, as was briefly mentioned. This terminology is also used in JPEG image compression.
I do not think any modern codecs are using pre-determined Huffman coding for symbols. They either using something adaptive and denser (e.g. h264 had CABAC, an adaptive arithmetic encoder) or a simpler encoding.
Seldom do people get insights like this from industry professionals that can explain extremely complicated topics in a somewhat simple manner. Please keep making these! They are amazing.
Saw Tom talking the same thing about this exact topic on Intel Arc's channel a few weeks ago
This is the very reason you can study this stuff at college. So many things we take for granted have a massive rabbit hole behind them.
I deep dived a few years ago when transcoding videos from 24p to 60p (basically HFR with frame interpolation).
Back then there was a lazy way to do it, which is on the fly transcoding every time you watched a movie. Or a real conversion.
The perceived quality increase is massive going from 24 frames to 60 in a movie. Sharp motions look insane.
Somehow people hated it back then when the Hobbit made it, but it just feels more immersive/real.
Watching Matrix in HFR was so much more fun. Just imagine all slow movement being sharp instead of blurry as it currently is.
Great content, can't wait for the next episode and my own testing of PresentMon.
Please Please continue these, there is so Little technical information available online that is presented so well
I've ALWAYS wanted to know how a service like UA-cam can exist, how so much DATA can be just sit there piling on servers. This can maybe answer some of that!
they exist by burning money as a way to keep people invested in a larger ecosystem, youtube and google go hand in hand, much like twitch and amazon prime have entangled perks. it's commonly called a loss leader.
...and in the long game, you gather absolutely unrivaled amounts of media that you can feed into your AI systems. And nobody can stop you from accessing it.
@@Lishtenbird and yet their AI systems are a joke xD
That presentation he showed was insanely helpful and well designed
The YUV explanation is wrong!
The 4:4:4, 4:2:2 and 4:2:0 does NOT mean the amount of bits you take for luma and chroma, it represents the downscaling factor of the chrominance plains regarding the luminance plain. The least bad way to explain it would be to say that the first number is the number of columns, the second is the number of chroma samples in the first row and the third number is the number of samples in the second row. E.g., 4:2:2 means that for every 4 luma samples you get 2 chroma samples in both rows, so basically 2:1 compression for the chroma plain. 4:2:0 means that you get 2 chroma samples in the first row and 0 for the second, so basically 1 chroma sample per 4 luma samples (or 4:1 compression for chroma plains). But ANY of these values are typically encoded using 8, 10 or 12 bits each!!! So this has nothing to do with the bits per pixel, but how much chrominance and luminance values you store per pixel.
Great presentation and really enjoyed the dive into compression process!
Right click tab>Bookmark Tab...>Add new folder to bookmarks tab>save to "yt" folder along with L1 Show to watch later because attention is absolutely necessary. 🤘❤
I ignore Intel videos and skip Intel timestamps in videos as their hardware is meh...but I will watch anything with Tom Peterson in it. Just an expert in his field talking about the cool things he's working on and explaining it. I'm here for it.
Excellent video! I'll point friends who are new to video/image compression this way.
One thing to note, from haunting video compression forums for many years. While the fixed function decode hardware on GPUs have been very fast and as fully featured as software decode since 2006 or so, and of course the same quality as software decoding, the same can't be said for encode even today. The x264 guys, to my best recollection viewed GPU encode as a marketing exercise, and saw very little speedup in leveraging GPU hardware themselves in comparison to other possible optimizations. Still today I'm not aware of any leveraging of GPU hardware by x265 or rav1e.
NVIDIA has the best GPU hw encode quality currently, and they're all somewhat better now, but not much. At least on NVIDIA and AMD, the encoders use much more of the actual programmable shader hardware than decode does (slowing games, etc.), they can't use many of the advanced features of the H.264/HEVC/AV1 formats, and they can struggle to compete on quality with 2-4 modern CPU cores running e.g. x264 --preset veryfast. If you can, try isolating a few cores and running software encode - you may be surprised, especially since you can still use the decode hardware on the input video if it's compressed.
A good way to explain what the 2D DCT is, is to make an analogy with sound and a spectrum analyzer. As the pitch of a sound rises while maintaining the volume, the bars of the analyzer shift to the right without changing the height, and if you have a sound consisting of two tones combined, the analyzer displays two separate spikes. The 2D DCT is exactly the same as a spectrum graph, except it's on two dimensions (width and height) rather than just one (time). Both images and sounds are signals that can be processed.
This stuff always amazes me! But then at the same time, it's not like it just magically came together all at once. Breaking it down step by step puts it into much much simpler and realistic perspective.
Just last year I had a class in university about video encoding. 40 seconds into this video it was already worth it, the 2D discrete cosine transform was part of the exam. The class covered everything up to the Huffman encoding although of course in much more detail. The Huffman encoding was part of a different class.
Exciting to see my university classes have some real life uses.
I feel like if I didn't have an education that involved Fourier transforms on images and the frequency space then all of this would've been lost on me. It's really awesome to see something this in-depth on UA-cam and in the real world and not a hypothetical scholastic video.
This upload/discussion is priceless. I had wondered about YT's compression. Thank you for sharing this discussion. 🌸
Probably one of the best introductions into "How does encoding" work I've seen. Sadly it gets A LOT more complicated once you have to deal with the media formats.
You can easily "destroy" the color information in a video without even realising it and loose a lot of quality, because you converted from A to B to C and back to A again etc. Very common on UA-cam since 99% of all UA-camrs have no idea what they are doing when they record Video / process it in sth like Premiere / Davinci and upload it to UA-cam.
Really coo video, I don't see enough talk online about the genius of video coding (and even less that is as well explained).
I know this is for the sake of simplification, but having literally done a PhD on inter-picture prediction, I find any claims of how much this step reduces size to be weird, because you're missing out on how prediction is all about the compromise of accurate prediction with good motion vectors and cheaply coded ones, For example, a good encoder will not take the best prediction (that results in the smallest residual), it will take one that is good enough for the level of quality but is cheaper to code (by being less precise or being the same as the block next to it).
The symbol coding is really one of the most complex and genius part of modern methods. First Huffman has been dead for years (only used in like JPEG and low power AVC), everyone uses arithmetic coding, which means that symbols can take less than one bit and the encoder basically figures it out with some complex math. But the real shit is how instead of fixed probabilities/coding that were implied in the video, it is adaptative and will "learn" depending on what you have been feeding it so far. So if let's say you keep having the same direction for your motion vector, it will cost fewer bits if you use it multiple times. There are now dozens of bins that store how they have been used and adapt the coding through the image.
The residual part, even if it looked hard in the video with that good old DCT, is actually the easiest thing and pretty easy to implement, as it is pretty straightforward. Getting the right motion vectors takes a huge amount of time and hardware have to make a lot of compromises to limit the search space since they typically require real-time performance and at least consistent performance and can't afford to spend extra time on an image where it doesn't find something good easily. There are many papers on how to optimize the search, and for hardware in cameras I have also seen the use of motion sensors as a side channel information to guide the search (nice for a camera that moves a lot like a go pro).
Last note with is more a pet peeve of mine, chroma subsampling currently has no reason to be used outside of being nicer on hardware, you will always lose quality and not even get any actual efficiency, 4:4:4 with a bit more aggressive quantization on the chroma components usually looks better than 4:2:0 at the same bitrate, but it does uses more memory and hardware circuits (though nowhere near double like memory, a lot of hardware just uses luma for everything intensive then chroma gets quantized and dealt with minimal care).
Considering it was done by someone with very little experience in the field, I was still very pleasantly surprised at how accurate this was for such a short explanation.
Just a disclaimer on the hardware claims, I have not worked with Intel encoding/decoding hardware and will not mention the ones I used because of NDA (including some projects that aren't yet released/were canned).
Super informative - brilliant work Tom and Steve!
11:40 a note about chroma subsampling: It's primarily useful for live video transfer, where you need to get raw data across a studio or building. When it comes to actual video compression, the heavy lifting is done by the encoder, so chroma subsampling doesn't actually do much. This is why I have been frustrated at our insistence of using 420 this whole time versus 444, or even just staying in RGB. You can often have a 4:4:4 full range 10 bit video be *barely* larger than a 4:2:0 partial range 8 bit video, but it can store SO much more information and be more accurate to the source material. It would be way better for VR streaming since the color range would be better and colors would be sharper.
That was amazing. I feel like i just went back to uni and my favourite lecturer explained to us some of the most obtuse information in a way we could all understand. Bravo Tom, Steve and gamers nexus.
16:33 The frequency quantization is really complicated, the best way I’ve heard it explained is as a map of comparison. Basically you’re comparing each pixel value to the pixels next to it to create a map of where the biggest differences contrast are in the photo. It’s basically a mathematical way of figuring out. What the most important data in the image is and then you can discard the information below a certain threshold and when you do the math backwards converting it back to an image it looks like almost nothing has happened.
This is greatly informative. I'm halfway through the video, and it's helped demystify concepts that were previously very fuzzy or esoteric. Thanks.
one thing that was not mentioned when they talked about Temporal Difference is "sun goes down".. A shifting light source *changes the lighting* all over the scene (as opposed to: "car drives by static house w/ambient lighting"). In other words Temporal is more about motion / positional differences than overall lighting. I understand why they did not deep dive into that, as it is a whole tangent - but good to be aware of. Very well done though!
Loving your teaching videos. The info we've all been looking for.1. Brilliant work mate. Love from Aus.
A subject close to my heart. My entire niche is providing the best possible video quality for viewers. Working through how UA-cam compresses things to get the best out of the other end has been a long process. This examination of the pipeline is amazing 😊
I now film in all-intra 4:2:2-10, export in 100000 kbps H.265, and UA-cam seems to do ok with that. The only frontier left is banding that comes with low gradient fill as they transcode to 8-bit.
Otherwise you can see the results for yourself. Pick 4k quality if you do.
A slight correction on chroma subsampling: it's not the bits that are subsampled, but the definition of the chroma image components. 4:2:2 means that the definition of the chroma image components is halved horizontally compared to the original definition, and 4:2:0 means it's halved in both directions.
Which means for e.g. 4:2:0 - which is the common consumer distribution format - you don't have, say, a 1920x1080 image where each pixel is 12 bits instead of 24, but one 1920x1080 black and white image where each pixels is 8 bits, and then two images for the chroma components, where pixels are also 8 bits but where the definition is 960x540 for each.
The chroma images then have to be upscaled back at decoding to match the luma definition (and just for fun different formats also have different specs about how the upscaled chroma pixels are aligned with the full definition luma ones).
Some video renderers will allow you to choose between different algorithms for that chroma upscaling step.
And an interesting trivia: one of Tom's slides shows 4 byte/pixel for 10-bit HDR video, but the P010 pixel format most commonly used by GPUs for that (at least on Windows) is actually 10 bits of actual sub-pixel information *padded to 16* ! So if for example you use copy-back decoding mode (where the decoding image is transfered back to the CPU), 10-bit HDR video actually uses *two times* the bandwidth of 8-bit video instead of 1.25 times !
Thanks Steve!
That was FASCINATING!
I do not do low level coding. I do not do data compression nor transforms. My signal analysis basis is crude at best. I had some calc very, very long ago.... STILL that was amazing and accessible at a 30,000 foot view!
Thanks to Tom and to you and the entire GN crew.
Peaceful Skies.
We need more of these with Tom. Make him come on more often please
Loving all of the Tom Peterson deep dive content
This series, with Tom Petersen, has been is fantastic. I would definitely be interested in more of this kind of thing.
16 times the compression, it just works, Its pretty cool
Hey GN! I think a “How They’re Made” series on all components of a pc would be cool. Seeing how air coolers are manufactured and the science, how mothersboard starts at just the pcb and etc etc.
Fantastic video to finish the series. For all the folks who are wondering what these compression techniques look like quantified with an example - Resident Evil 2 on the Playstation clocked in at 1.5 GB in size. Angel Studios made that fit onto a 64MB Nintendo 64 cartridge.
Tom Scott has made a great video explaining Huffman encoding. The gist of it is that, the more common a pattern is, the less bits we use to encode an occurence of that pattern. It's a clever way to generate an encoding for each pattern in such a way that it achieves that.
To really understand the frequency domain and quantization part, you kinda need to know the mathematical background of fourier series and the fourier transform (the discrete cosine transform used here is very similar to the "real" fourier transform as far as I understand it, same concept). 3blue1brown has made great videos explaining those.
The basics is that we can think of any data as being generated by a sum of an infinite amount of sine waves or cosine waves, or in practice you'd use a finite amount of waves, the more waves, the closer you get to the original data. So you have for example a 1Hz wave multiplied with some factor a, a 2 Hz wave multiplied with some factor b, a 3 Hz wave multiplied with some factor c, etc, for each frequency you have a value of how strong that wave's influence on the generated data is, you just multiply each wave with that factor and then add all of them together to get the data (getting back the original data from the frequency domain data like this is called an inverse fourier transform). If you were working in 1 dimension, such as with (mono) audio, the frequency domain graph would have frequency on the x axis (eg. 0.1 Hz to 10000 Hz or whatever) and in the y axis for each frequency the "strength" of that frequency, the multiplier. In this case of working with images, you need to do this in 2 dimensions of course, so I assume you have waves going in the x direction and waves going in the y direction. So the frequency domain graph that Tom showed had the frequencies on x and y, and the strength of the frequencies was indicated by the brightness (darker = stronger). We apply the fourier transform to get this frequency domain representation, then we can manipulate it based on the frequencies, in this case cut of all the high frequencies (effectively removing the small, high frequency details of the image data), then apply the inverse fourier transform to get image data back (in this case of course it's not the image itself, it's the residuals). Exactly the same thing happens when you apply a low pass filter in an audio editing program, it uses the fourier transform, then manipulates the data in frequency domain, then uses the inverse fourier transform.
This and the other recent vids with TAP were seriously awesome, I hope to see more of him on the channel again!
I've personally waited decades for this video. Thank you Steve, thank you Tom!
This is incredible stuff, thank you for putting this together these conversations!
Learning about compression and encoding/decoding requirements when I started running Plex and Jellyfin almost broke me.
These sessions with engineers like Tom Petersen are awesome...
I literally just watched a 30 minute video from a channel named Theo of him reading an article about how h264 works and it couldn't be a more apropried thing to watch to be prepared for this video you guys just posted