Burrows-Wheeler Transform (Ep 4, Compressor Head) Google

Поділитися
Вставка
  • Опубліковано 28 жов 2024

КОМЕНТАРІ • 50

  • @GoogleDevelopers
    @GoogleDevelopers  9 років тому +96

    *The new Compressor Head episode is finally here!*
    We know you've been waiting on pins and needles for the next episode of Compressor Head, and it's finally here! Colt McAnlis returns as the algorithm obsessed geek who loves to share his knowledge with the programmers of the world. He’s joined by fellow Googler Mike Burrows, co-inventor of the Burrows-Wheeler transform, an almost magical algorithm that will turn your head inside out.
    Stay tuned to the Google Developers UA-cam channel (ua-cam.com/users/googledevelopers) for the soon to be released episodes 5 and 6, coming your way Friday mornings from the Googleplex in Mountain View CA.
    #CompressorHead #PerfMatters

  • @ColtMcAnlis
    @ColtMcAnlis 9 років тому +41

    Compressor Head is BACK! And I'm going "bananas" over Episode 4!
    (....you won't get the joke until you watch the episode....)

    • @ColtMcAnlis
      @ColtMcAnlis 9 років тому

      Karthik Kumar Viswanathan I've ranted about that exact problem before. There's a plethora of existing formats out there to handle this, devs just aren't embracing them (protobuffs, gRPC, captn proto, flatbuffers etc)

  • @danielschreck9083
    @danielschreck9083 4 роки тому +7

    great video, one error:
    Magnus does the sort wrong, and the final matrix comes out wrong as a result:
    at 7:58, for example, with 3 characters in each row, Magnus sorts NAN before NAB. The final matrix (8:17) has last column BNNAAA (doesn't match the original last column), should be NNBAAA if sorts are done correctly in the invert phase.

  • @jackninja1
    @jackninja1 9 років тому +2

    These video's are amazing, really like the amount of detail and the length that they have, keep up the good work!

  • @vonkruel
    @vonkruel 9 років тому +10

    I remember reading that article in Dr. Dobb's Journal, and implementing this algorithm was a great exercise. It's a very special algorithm.
    I _loved_ that story about declining to re-submit the paper, and saying "it's my policy not to explain these decisions." Golden!

  • @mgehad
    @mgehad 8 років тому +2

    I can't find the words to describe this. this such so interesting and fun. You're such an amazing head man! Thanks.

  • @yuvrajgarg3921
    @yuvrajgarg3921 5 років тому +1

    Thanks Google Glad to hear words directly from Sir Michael Burrows
    .

  • @amitsubhash1586
    @amitsubhash1586 Рік тому

    This was released 8 years aho and I am still able to learn from it , thank you team google!

  • @silkwurm
    @silkwurm 4 роки тому

    Entropy doesn’t “fail to take into account the order of the symbols”, that’s a problem with character codes, not entropy. Entropy is defined for the joint distribution, order and all

  • @matthewpobox
    @matthewpobox 3 роки тому +1

    what was the point of 2:55?

  • @BryceDoesLife
    @BryceDoesLife 7 років тому +1

    My left ear loved that intro.

  • @asedaaboagye5877
    @asedaaboagye5877 9 років тому

    Another awesome episode!

  • @simonchou4597
    @simonchou4597 2 роки тому

    Why the entropy from 0 to 9 is 4 at 1:28? 3.32 is what I calculated(base 2)

  • @runlaketahoe
    @runlaketahoe 9 років тому +2

    Will have to put #houseofcards on hold for a bit to watch Compressor Head!

  • @yohanrizk3398
    @yohanrizk3398 9 років тому +2

    Woah this stuff is coool! I was following along the delta coding part, and I might be totally be wrong about this , but isn't the delta coding of [9,2,1,3,4,8,0,6,7,5] equal to [9,-7,-1,2,1,4,-8,6,1,-2] ? your calculation has -1 at the end. Great Video, BWT really is something else.

  • @boushaft3459
    @boushaft3459 5 років тому

    Hey man, great video, really entertaining!

  • @yohanrizk3398
    @yohanrizk3398 8 років тому

    Also... "Competetive sorter" is misspelled , the correct spelling is Competitive

  • @heidimetzler
    @heidimetzler 6 років тому

    Error around 7:50: Linux implementation uses Huffman, not arithmetic coding. Great video.

    • @stgigamovement
      @stgigamovement 5 років тому +1

      It originally did use Fenwick arithmetic coding, but the good news is that bwtc was invented which omits the first RLE step of bzip1 and bzip2 and uses arithmetic coding by default.

  • @Duhdad
    @Duhdad 9 років тому +3

    Great job!! I Love Compressor Head. Keep feeding my grandkids. .Hey. Wasn't Magus riding a bike last we saw?? I/O Bytes? Is he going for a triathalon thingy?? GO Mag ~(8^(!)~

  • @rustix3
    @rustix3 3 роки тому

    8:37 The last letter of the first row 'B' should be switched with last letter of the 3rd row 'N'.

  • @RakhiDhavale
    @RakhiDhavale 9 років тому

    Great to know about BWT !

  • @zihuatanejo7741
    @zihuatanejo7741 4 роки тому

    from coursera algorithm Ⅱ,the specification of the last assignment is pretty arcane...

  • @josh2731
    @josh2731 9 років тому

    One part I did not understand about the sort was what way was he sorting the data. Does it have to be sorted using a specific sorting method(first to last, in clusters, both ends same time) to work. Or as long as it is the same sorting method used in sorting and unsorting

    • @ColtMcAnlis
      @ColtMcAnlis 9 років тому +1

      You want a lexicographical sort (A-Z sorting). Doesn't matter what method you use, as long as it lexicographically results in the same matrix.
      There's other crazy variants of BWT that use different types of sort methods in order to achieve different results (cryptographic, etc) but we didn't have time to go into those ;)

  • @diegonayalazo
    @diegonayalazo 3 роки тому

    Thanks

  • @jelouodsa1395
    @jelouodsa1395 9 років тому

    Colt McAnlis I think there is a better way when doing inverse transforming the BWT output by following the index ... than sorting it over and over again.... It would be great if you share that in the next episode... :)

    • @ColtMcAnlis
      @ColtMcAnlis 9 років тому +1

      Jelou odsa Lots of perf optimizations/approaches that we didn't get to talk about in this episode. My favorite has to be RadixZip.

  • @petergrajciar1504
    @petergrajciar1504 8 років тому

    In 7:58 I think , that it was wrong sorting. 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAN] 6, [NAB]. It must be: 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAB] 6, [NAN]. So if you will continue in sorting , than result it will be: 1, [NABANA] 2, [NANABA] 3, [BANANA] , 4, [ABANAN] 5, [ANABAN] 6, [ANANAB]. BANANA is in the third line , but when you doing lexical sorting on 4:10 , you have BANANA on fourth line. Sorry for English :)

  • @chestergreen8895
    @chestergreen8895 7 років тому

    8:09 third row doesnt have a 'B' in it?

  • @zrouth
    @zrouth 9 років тому

    This shit is Bananas

  • @SiddharthKulkarniN
    @SiddharthKulkarniN 9 років тому +1

    I am 'gelous', and not because it's a pun :) :)

  • @shawn576
    @shawn576 7 років тому

    Can't explain how they came up with it = they killed someone and stole it

  • @blueskyresearch6701
    @blueskyresearch6701 9 років тому

    BWT would be a great way to prep data for RLE. I know RLE am I in kindergarten or what.

    • @stgigamovement
      @stgigamovement 4 роки тому

      Bzip1, bzip2, and bwtc do use rle after bwt, while bzip1 and bzip2 also have rle as the first step

  • @delucabruno
    @delucabruno 8 років тому

    Im sory but I don't think that sorting everything back N times is very efficient, cause if I want to compress the whole Harry Potter book, it would take a lot of time to decode..

  • @Wandelaarke
    @Wandelaarke 9 років тому +2

    And again, thank you Google. Go free education about cool stuff that teachers don't understand themselves!

  • @CasperBang
    @CasperBang 9 років тому

    Why Move-To-Front rather than RLE after the BW transform? Seems like the clustering of BW makes it a perfect candidate for applying RLE to it, followed by i.e. Huffman encoding?!

    • @ColtMcAnlis
      @ColtMcAnlis 9 років тому +4

      Casper Bang
      RLE works best as an statistical encoder when there's long runs of similar characters, that aren't oddly broken up.
      Remember that BWT doesn't output homogenous runs of characters. There's no guarantee of that. You may (luckily) end up with a run of 20 of the same character, only to be disrupted by a 20 character run of individual, non-run characters.
      MTF takes this type of statistical changes into account. If you get 20 "E"s broken up by a single "Z" you don't lose as much encoding efficiency as you would having to restart a run in RLE mode.
      And to be honest, there's really cool ways to merge MTF and RLE together to make some fancy algorithms. RadixZip introduced something like that : citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3977

    • @CasperBang
      @CasperBang 9 років тому

      Colt McAnlis Ah I see, thanks for the info - just used my ACM subscription to get to the RadixZip paper, it's the second time I hear you mention it so it must be a worthy read. :) Just hope it isn't too encompassed by patents and the like.

    • @Michael_19056
      @Michael_19056 7 років тому

      The strongest encoding methods for BWT do not use any secondary transform at all. M03, for instance, generally achieves the strongest compression of all the BWT encoding techniques and it uses no secondary transforms whatsoever. Instead, the symbols are encoded as they appear in the BWT but using the statistics gathered from the contexts under which that symbol appeared within the original steam.
      see:
      www.michael-maniscalco.com/download.htm
      www.juergen-abel.info/files/preprints/preprint_post_bwt_stages.pdf

  • @vatofichor
    @vatofichor 9 років тому +4

    This was actually painful to watch. lol

  • @RakhiDhavale
    @RakhiDhavale 9 років тому

    Great to know about BWT !