*The new Compressor Head episode is finally here!* We know you've been waiting on pins and needles for the next episode of Compressor Head, and it's finally here! Colt McAnlis returns as the algorithm obsessed geek who loves to share his knowledge with the programmers of the world. He’s joined by fellow Googler Mike Burrows, co-inventor of the Burrows-Wheeler transform, an almost magical algorithm that will turn your head inside out. Stay tuned to the Google Developers UA-cam channel (ua-cam.com/users/googledevelopers) for the soon to be released episodes 5 and 6, coming your way Friday mornings from the Googleplex in Mountain View CA. #CompressorHead #PerfMatters
Karthik Kumar Viswanathan I've ranted about that exact problem before. There's a plethora of existing formats out there to handle this, devs just aren't embracing them (protobuffs, gRPC, captn proto, flatbuffers etc)
great video, one error: Magnus does the sort wrong, and the final matrix comes out wrong as a result: at 7:58, for example, with 3 characters in each row, Magnus sorts NAN before NAB. The final matrix (8:17) has last column BNNAAA (doesn't match the original last column), should be NNBAAA if sorts are done correctly in the invert phase.
I remember reading that article in Dr. Dobb's Journal, and implementing this algorithm was a great exercise. It's a very special algorithm. I _loved_ that story about declining to re-submit the paper, and saying "it's my policy not to explain these decisions." Golden!
Entropy doesn’t “fail to take into account the order of the symbols”, that’s a problem with character codes, not entropy. Entropy is defined for the joint distribution, order and all
Woah this stuff is coool! I was following along the delta coding part, and I might be totally be wrong about this , but isn't the delta coding of [9,2,1,3,4,8,0,6,7,5] equal to [9,-7,-1,2,1,4,-8,6,1,-2] ? your calculation has -1 at the end. Great Video, BWT really is something else.
It originally did use Fenwick arithmetic coding, but the good news is that bwtc was invented which omits the first RLE step of bzip1 and bzip2 and uses arithmetic coding by default.
Great job!! I Love Compressor Head. Keep feeding my grandkids. .Hey. Wasn't Magus riding a bike last we saw?? I/O Bytes? Is he going for a triathalon thingy?? GO Mag ~(8^(!)~
One part I did not understand about the sort was what way was he sorting the data. Does it have to be sorted using a specific sorting method(first to last, in clusters, both ends same time) to work. Or as long as it is the same sorting method used in sorting and unsorting
You want a lexicographical sort (A-Z sorting). Doesn't matter what method you use, as long as it lexicographically results in the same matrix. There's other crazy variants of BWT that use different types of sort methods in order to achieve different results (cryptographic, etc) but we didn't have time to go into those ;)
Colt McAnlis I think there is a better way when doing inverse transforming the BWT output by following the index ... than sorting it over and over again.... It would be great if you share that in the next episode... :)
In 7:58 I think , that it was wrong sorting. 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAN] 6, [NAB]. It must be: 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAB] 6, [NAN]. So if you will continue in sorting , than result it will be: 1, [NABANA] 2, [NANABA] 3, [BANANA] , 4, [ABANAN] 5, [ANABAN] 6, [ANANAB]. BANANA is in the third line , but when you doing lexical sorting on 4:10 , you have BANANA on fourth line. Sorry for English :)
Im sory but I don't think that sorting everything back N times is very efficient, cause if I want to compress the whole Harry Potter book, it would take a lot of time to decode..
Why Move-To-Front rather than RLE after the BW transform? Seems like the clustering of BW makes it a perfect candidate for applying RLE to it, followed by i.e. Huffman encoding?!
Casper Bang RLE works best as an statistical encoder when there's long runs of similar characters, that aren't oddly broken up. Remember that BWT doesn't output homogenous runs of characters. There's no guarantee of that. You may (luckily) end up with a run of 20 of the same character, only to be disrupted by a 20 character run of individual, non-run characters. MTF takes this type of statistical changes into account. If you get 20 "E"s broken up by a single "Z" you don't lose as much encoding efficiency as you would having to restart a run in RLE mode. And to be honest, there's really cool ways to merge MTF and RLE together to make some fancy algorithms. RadixZip introduced something like that : citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3977
Colt McAnlis Ah I see, thanks for the info - just used my ACM subscription to get to the RadixZip paper, it's the second time I hear you mention it so it must be a worthy read. :) Just hope it isn't too encompassed by patents and the like.
The strongest encoding methods for BWT do not use any secondary transform at all. M03, for instance, generally achieves the strongest compression of all the BWT encoding techniques and it uses no secondary transforms whatsoever. Instead, the symbols are encoded as they appear in the BWT but using the statistics gathered from the contexts under which that symbol appeared within the original steam. see: www.michael-maniscalco.com/download.htm www.juergen-abel.info/files/preprints/preprint_post_bwt_stages.pdf
*The new Compressor Head episode is finally here!*
We know you've been waiting on pins and needles for the next episode of Compressor Head, and it's finally here! Colt McAnlis returns as the algorithm obsessed geek who loves to share his knowledge with the programmers of the world. He’s joined by fellow Googler Mike Burrows, co-inventor of the Burrows-Wheeler transform, an almost magical algorithm that will turn your head inside out.
Stay tuned to the Google Developers UA-cam channel (ua-cam.com/users/googledevelopers) for the soon to be released episodes 5 and 6, coming your way Friday mornings from the Googleplex in Mountain View CA.
#CompressorHead #PerfMatters
Nice
chrome-dome nerd
Compressor Head is BACK! And I'm going "bananas" over Episode 4!
(....you won't get the joke until you watch the episode....)
Karthik Kumar Viswanathan I've ranted about that exact problem before. There's a plethora of existing formats out there to handle this, devs just aren't embracing them (protobuffs, gRPC, captn proto, flatbuffers etc)
great video, one error:
Magnus does the sort wrong, and the final matrix comes out wrong as a result:
at 7:58, for example, with 3 characters in each row, Magnus sorts NAN before NAB. The final matrix (8:17) has last column BNNAAA (doesn't match the original last column), should be NNBAAA if sorts are done correctly in the invert phase.
that's why Compressor Head said: "I don't think he got a shot at that one".
These video's are amazing, really like the amount of detail and the length that they have, keep up the good work!
I remember reading that article in Dr. Dobb's Journal, and implementing this algorithm was a great exercise. It's a very special algorithm.
I _loved_ that story about declining to re-submit the paper, and saying "it's my policy not to explain these decisions." Golden!
I can't find the words to describe this. this such so interesting and fun. You're such an amazing head man! Thanks.
Thanks Google Glad to hear words directly from Sir Michael Burrows
.
This was released 8 years aho and I am still able to learn from it , thank you team google!
Entropy doesn’t “fail to take into account the order of the symbols”, that’s a problem with character codes, not entropy. Entropy is defined for the joint distribution, order and all
what was the point of 2:55?
My left ear loved that intro.
Another awesome episode!
Why the entropy from 0 to 9 is 4 at 1:28? 3.32 is what I calculated(base 2)
Will have to put #houseofcards on hold for a bit to watch Compressor Head!
Woah this stuff is coool! I was following along the delta coding part, and I might be totally be wrong about this , but isn't the delta coding of [9,2,1,3,4,8,0,6,7,5] equal to [9,-7,-1,2,1,4,-8,6,1,-2] ? your calculation has -1 at the end. Great Video, BWT really is something else.
Hey man, great video, really entertaining!
Also... "Competetive sorter" is misspelled , the correct spelling is Competitive
Error around 7:50: Linux implementation uses Huffman, not arithmetic coding. Great video.
It originally did use Fenwick arithmetic coding, but the good news is that bwtc was invented which omits the first RLE step of bzip1 and bzip2 and uses arithmetic coding by default.
Great job!! I Love Compressor Head. Keep feeding my grandkids. .Hey. Wasn't Magus riding a bike last we saw?? I/O Bytes? Is he going for a triathalon thingy?? GO Mag ~(8^(!)~
8:37 The last letter of the first row 'B' should be switched with last letter of the 3rd row 'N'.
Great to know about BWT !
from coursera algorithm Ⅱ,the specification of the last assignment is pretty arcane...
One part I did not understand about the sort was what way was he sorting the data. Does it have to be sorted using a specific sorting method(first to last, in clusters, both ends same time) to work. Or as long as it is the same sorting method used in sorting and unsorting
You want a lexicographical sort (A-Z sorting). Doesn't matter what method you use, as long as it lexicographically results in the same matrix.
There's other crazy variants of BWT that use different types of sort methods in order to achieve different results (cryptographic, etc) but we didn't have time to go into those ;)
Thanks
Colt McAnlis I think there is a better way when doing inverse transforming the BWT output by following the index ... than sorting it over and over again.... It would be great if you share that in the next episode... :)
Jelou odsa Lots of perf optimizations/approaches that we didn't get to talk about in this episode. My favorite has to be RadixZip.
In 7:58 I think , that it was wrong sorting. 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAN] 6, [NAB]. It must be: 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAB] 6, [NAN]. So if you will continue in sorting , than result it will be: 1, [NABANA] 2, [NANABA] 3, [BANANA] , 4, [ABANAN] 5, [ANABAN] 6, [ANANAB]. BANANA is in the third line , but when you doing lexical sorting on 4:10 , you have BANANA on fourth line. Sorry for English :)
I agree with you.
8:09 third row doesnt have a 'B' in it?
This shit is Bananas
I am 'gelous', and not because it's a pun :) :)
Can't explain how they came up with it = they killed someone and stole it
BWT would be a great way to prep data for RLE. I know RLE am I in kindergarten or what.
Bzip1, bzip2, and bwtc do use rle after bwt, while bzip1 and bzip2 also have rle as the first step
Im sory but I don't think that sorting everything back N times is very efficient, cause if I want to compress the whole Harry Potter book, it would take a lot of time to decode..
And again, thank you Google. Go free education about cool stuff that teachers don't understand themselves!
Why Move-To-Front rather than RLE after the BW transform? Seems like the clustering of BW makes it a perfect candidate for applying RLE to it, followed by i.e. Huffman encoding?!
Casper Bang
RLE works best as an statistical encoder when there's long runs of similar characters, that aren't oddly broken up.
Remember that BWT doesn't output homogenous runs of characters. There's no guarantee of that. You may (luckily) end up with a run of 20 of the same character, only to be disrupted by a 20 character run of individual, non-run characters.
MTF takes this type of statistical changes into account. If you get 20 "E"s broken up by a single "Z" you don't lose as much encoding efficiency as you would having to restart a run in RLE mode.
And to be honest, there's really cool ways to merge MTF and RLE together to make some fancy algorithms. RadixZip introduced something like that : citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3977
Colt McAnlis Ah I see, thanks for the info - just used my ACM subscription to get to the RadixZip paper, it's the second time I hear you mention it so it must be a worthy read. :) Just hope it isn't too encompassed by patents and the like.
The strongest encoding methods for BWT do not use any secondary transform at all. M03, for instance, generally achieves the strongest compression of all the BWT encoding techniques and it uses no secondary transforms whatsoever. Instead, the symbols are encoded as they appear in the BWT but using the statistics gathered from the contexts under which that symbol appeared within the original steam.
see:
www.michael-maniscalco.com/download.htm
www.juergen-abel.info/files/preprints/preprint_post_bwt_stages.pdf
This was actually painful to watch. lol
Great to know about BWT !