Elegant Compression in Text (The LZ 77 Method) - Computerphile

Поділитися
Вставка
  • Опубліковано 12 вер 2024

КОМЕНТАРІ • 309

  • @grayswandir47
    @grayswandir47 9 років тому +258

    I did some experimentation. If you open a zip or gz file with a hex editor there isn't anything human readable. You have to compress using an older format such as .lzo (Not sure anything in Windows will do this but it's a compress option in Fedora Linux.) Opening a .lzo file with a hex editor (I used ghex) shows much of the original text and it's easy to see where text is replaced by pointers as discussed in the video. You can see more frequent pointers as you move farther along in the document. The pointers in the .lzo file were two, three and four bytes long. I could not see any telltale that identifies a pointer versus text. There has to be some method for the decompression algorithm to distinguish between original text and a pointer. Thanks for posting the video. It stirred my curiousity.

  • @AlexanderKrivacsSchrder
    @AlexanderKrivacsSchrder 11 років тому +36

    I remember in my computer science classes on compression, we were taught Lempel-Ziv-Welch (LZW). I thought that one was pretty cool, because instead of looking for specific things it has seen, it just constantly makes a dictionary of combinations of characters it has already seen and can refer back to those as it goes along. The best part is that it doesn't have to store the dictionary anywhere, because the decompressor can rebuild it while it's decompressing, based on the compressed data.

  • @zandaya
    @zandaya 10 років тому +353

    What's with all the hate here?? Can't you just appreciate his work? He's better than my university lecturers and I thank him for that.

  • @TheDave000
    @TheDave000 9 років тому +88

    Can I please have an evening in a pub with Professor Brailsford and just have him explain things to me? I wish he were my grandfather, or at the very least mentor of some sort. He's just such and all round excellent bloke. I think his computer science pupils are very lucky to have such a clear speaking and warm hearted teacher.
    Professor Brailsford is the best!

  • @ffggddss
    @ffggddss 7 років тому +160

    Actually, wouldn't that stored string - "computer" - include the leading space?
    (" computer")
    Because that, too, occurs in both instances.
    Also, it occurs to me that you'll never want to tag & remember a repeated string of 0, 1, or even 2 characters, because the "compressed" version of any string in this scheme, is 2 characters (bytes) long, so you're not actually saving anything until you have a string of 3 or more characters.
    This allows "encrypting" the 4 bits that encode length, to mean the numbers from 3 - 18, rather than the usual 0 - 15.
    These, if they are implemented in L-Z, are refinements that would properly be omitted from this overview, but I thought they might be worth bringing up here in the comments.
    And BTW, thank you for a very clear and concise explanation of how this scheme works!

  • @jakovmatosic4890
    @jakovmatosic4890 9 місяців тому +5

    I love it so much when there's a computerphile episode for something I need to learn, makes it so much better than a random ppt tutorial.

  • @grandadie
    @grandadie 8 років тому +199

    Could a pointer point to a pointer instead of having to remember another instance of the word?

  • @Doniazade
    @Doniazade 9 років тому +89

    Wouldn't it be more efficient to have j be the distance to the END of the string you are referencing? That way you can reference strings (l-1) characters farther back with the same number of bits available for j.
    Basically, go j characters back and then take that character and the preceding (l-1) ones.

  • @etmax1
    @etmax1 8 років тому +46

    People often say that learning times tables etc in this day and age is a waste of time, but I disagree. If you don't need to use a calculator as but it means that you both intrinsically understand the numbers and you can be much faster. I new an accountant that could add up 3 digit columns as fast as I can mentally do single digits. In his job that would have been worth gold. Knowing powers of 2 in computer science is similar

  • @ruddyrdx
    @ruddyrdx 8 років тому +18

    Oh, how beautifully he explained the concept! Totally impressed! Now I wish only if our professor in college had explained LZW this elaborately, but he did not.

  • @Computerphile
    @Computerphile  11 років тому +5

    Hopefully fixed now, thanks for spotting! >Sean

  • @seanld444
    @seanld444 8 років тому +3

    It's hilarious that people at school always say: "you shouldn't have to use your brain during summer!", when I am watching Computerphile, and enjoying learning about compression and stuff. I love this!

  • @Computerphile
    @Computerphile  11 років тому +6

    You can thank Sean for that!
    >Brady

  • @isaac10231
    @isaac10231 11 років тому +1

    I am absolutely loving this channel. Anywhere I look online for an explination for something computer or network related, like TCP, I can't understand any of it. But this channel gives me a foundation to work from, to understand stuff like this.

  • @phookadude
    @phookadude 2 роки тому +6

    One important thing missing there, you need some sort of identification that the 2 bytes are a reference and not just more characters. You could set the highest bit of the 2 character string to on but that kind of shows another problem: most text only uses the lower 7 bits of an 8 bit byte. So if you want to compress things it seems odd to leave that because every character is 12% bigger than it needs to be.

  • @bobbyaldol
    @bobbyaldol 9 років тому +131

    Feels like I am listening to David Attenborough teach Computer Science.

  • @felineboy
    @felineboy 11 років тому +4

    Actually, along with the jump/length tuple, the LZ77 algorithm outputs an optional literal character. So, for example, the very first output of the compressor would be a jump/length tuple with a length of 0, followed by the first byte of the decompressed data.
    Implementations differ on how they specify when there's a literal character or how the length/pair tuple is encoded.

  • @fredhair
    @fredhair 6 років тому

    I think it speaks volumes of prof Brailsford that he can simplify some of the most complex topics and explain them in a way that is a joy to listen to. However when pressed for more details and information he will gladly oblige you and probably tell you all there is to know on the subject! Brilliant man, could listen to him talking on computers all day.

  • @MrCOPYPASTE
    @MrCOPYPASTE 11 років тому

    First of all I would like to thank you and your peers for the time you invest in this videos.
    One thing I didn't grasp was if the implementation of the search of a given match is biased(it's prone to a given design) or it's a "closed analytic solution" and what was the creators approach, thank you for your time.

  • @Chris_Dare
    @Chris_Dare 9 років тому +27

    Question: what's stopping you from referencing the previous pointer if the 'jump' gets too long? Couldn't you point to the previous pointer and grab whatever that is pointing to? Assuming you want maximum compression and don't care about how long it takes to uncompress.

  • @BGBTech
    @BGBTech 11 років тому +2

    typically by flagging them somehow.
    in many variants, this is done by symbol range, like a certain range of symbols will be treated as special;
    some others stuff in extra bits to distinguish between raw characters and runs;
    a few also make raw characters the special case, where you either have a dictionary reference, or 1 to N raw characters.
    often, LZ77 is combined with Huffman, as in Deflate (example: ZIP and GZIP), or arithmetic coding (7zip / LZMA, at least sort of...).

  • @Bovineprogrammer
    @Bovineprogrammer 11 років тому

    I absolutely love LZ compression algorithms. Beautiful to code, beautiful to see in motion, and pretty darn effective too. Not only that, but it's easily tweaked to work with whatever data you're using just by changing a few constants.
    Fantastic.

  • @MECKENICALROBOT
    @MECKENICALROBOT 10 років тому +3

    This sounds literally like how we use our language and even our creative process. Constantly referencing previous instances of a given word, idea, or concept... Im really enjoying this channel

  • @mandolinic
    @mandolinic 9 років тому +19

    Replacing some characters by a two byte combined pointer is only half of the design challenge - the decoder still needs to know whether it's processing text or processing a pointer. This means that you've got to put a special byte marker before pointer, effectively changing a two byte pointer to three bytes and reducing the compression ratio, or you've got to sacrifice some of the bits in the pointer to make a pattern that isn't going to occur normally in the text which has further implications on the compression ratio.

  • @luketimothy
    @luketimothy 11 років тому +2

    I hope you have more of these Information Theory episodes coming! This was my favourite area of study at Uni. (I actually do something related to it for a Job now, but I still find these fun to watch) :)

  • @fredericweisbecker845
    @fredericweisbecker845 9 років тому +24

    I want to go back to university with this man as my professor!

  • @Arkalius80
    @Arkalius80 10 років тому +6

    The common Deflate algorithm used in the popular zip and gzip compression formats uses LZ77, followed by Huffman coding, which essentially encodes more common byte values as shorter bit strings, with a trade-off of uncommon byte values having longer than 8-bit bit strings, but ultimately leading to a net savings. This is why a zip file doesn't have any readable content in it. All of the data has been re-coded in a way that isn't readable.

  • @iabervon
    @iabervon 11 років тому +1

    There's actually a whole family of Lempel-Ziv-based techniques. The basic technique is to have a table of character sequences from the document that you can reuse when they come up again. LZ77 is a particular way of constructing the table, but there are others. Some use relative positioning, some absolute (with a "start anew from here" code, in case the file switches languages in the middle). There are also clever tricks with how you write lengths, and how you store the codes.

  • @powerhouseofthecell9758
    @powerhouseofthecell9758 10 років тому +16

    What Edawan said, could you make an episode about distinguishing pointers from the rest of the text?

  • @stephenhorton
    @stephenhorton 11 років тому +1

    Thanks for the video! Very interesting! To make the pointer even better: You'd never point to something less than 3 bytes, so the length part of the pointer could be seen as the length of the string minus 3, so you can represent up to 10 in 3 bits. Also why point to the start of the string? You're always going backwards, so point to the end, the rest can be worked out, meaning you can stretch 8202 characters back (13 bits + max length).

  • @Humineral
    @Humineral 11 років тому

    This is such a quality channel.

  • @ChrisWalshZX
    @ChrisWalshZX 11 років тому

    I've not looked into LZ but I am familiar with a lot of compression techniques.
    You are of course correct that the has to be some way of differentiation. This can be designed in two different ways.
    METHOD 1: Define a header exactly as you say which marks the following (say 16 bits) as a pointer. You will however need to ''escape' any chance of the same header from appearing in a normal run of the text string. Methods of escape can be done with a different header.

  • @MathAndComputers
    @MathAndComputers 11 років тому +2

    You can take a bitmap image file and put it in a ZIP file to try it out, yourself. :) Some image files, like logos or diagrams, tend to compress quite well with this method, which is why PNG files use a similar compression method, but some image files, like photographs, tend to compress quite poorly with this method, which is why JPG files use a very different (and "lossy") compression method.

  • @styleisaweapon
    @styleisaweapon 11 років тому

    More compression either requires more time or more auxiliary memory. The most common compression method in use today is called DEFLATE (a combination of LZ77 and Huffman) popularized by the ZIP archive format. In the case of DEFLATE the "extra compression" flag is dealing with the huffman part of the algorithm and doesnt have a significant time cost but does have a significant memory cost because it uses a forest of huffman binary trees rather than a single huffman binary tree.

  • @fllthdcrb
    @fllthdcrb 11 років тому

    Good point. It illustrates a difference between the way we humans perceive the text and how the computer perceives it. We tend to see whole words and things starting on word boundaries, but a compressor usually matches things a naive person doesn't expect.

  • @lucdestrube
    @lucdestrube 11 років тому +2

    Love the addition of the forward slash at the end :)

  • @IceMetalPunk
    @IceMetalPunk 11 років тому +10

    This is interesting :) The only thing I wondered throughout this video is, when reading the compressed file, how do you know whether the byte you've read is the start of a 2-byte pointer or just a literal 1-byte character?

  • @stupossibleify
    @stupossibleify 9 років тому +18

    I'm not a computer scientist, but I still see a problem here which isn't dealt with in the video: you would surely have to discriminate the pointer against the other information in the file. In this case, is the binary encoding ASCII or a LZ jump vector, for example? It's an obvious question which needs answering. Great video nonetheless.

  • @steveohbandito
    @steveohbandito 10 років тому +8

    Learning your powers of 2 and 16 times table will help immensely. Also, to all the highschool students going to university/college for computer science, PLEASE, if you do anything, start programming assignments ASAP. CS assignments are usually never something that can be done the night before and quite often take many hours to complete. I'm only a 2nd year CS student but it's something that I've learned is very important to do.

  • @fllthdcrb
    @fllthdcrb 11 років тому

    Incidentally, a little off-topic, but there's another trick here. If we allow the copied string to overlap the current position, we can compress short repeated patterns fairly well. That is to say, the decompressor, when finding a backreference, can simply look back in its output buffer, start reading bytes, and copy them onto the end. If it runs over what was the current position when it started, it will pick up the start of the string and copy it again, thus repeating it for a while. (cont...)

  • @KelvinShadewing
    @KelvinShadewing 8 років тому +8

    So, what if you were to use a sort of nested pointer system where a pointer refers to a previous one and so on until the first word that got compressed? Is that what would be done if the original word is more than 4096 spaces away? Like, say that first pointer to "computer" is decoded, then the next one is out of reach of the first instance but can reach the second one. Would it just take the word from the second one? Or read its location to jump further back to the first?
    Because I'm wondering, if you were to do a nested pointer like this, would the length of the pointer being shorter than the length of the original word affect the data of previous pointers? Or maybe it goes through, counts the words, and then replaces them with pointers later, so that when they're decompressed in order, the string lengths match up again?

  • @iabervon
    @iabervon 11 років тому

    In general, LZ produces a bit stream with longer-than-8-bit codes, of which some are 9-bit codes for 8-bit source bytes and some are 17-bit codes for "this is a pointer" plus 16 bits of pointer value. Then it takes the bit stream and chunks it into bytes for putting in compressed files. This is why LZ-compressed files look like random noise if you try looking at them as characters, while UTF-8 files look only slightly wrong if you don't know the encoding.

  • @TomatoBreadOrgasm
    @TomatoBreadOrgasm 11 років тому

    That story about why the professor learned his 16 times tables really goes to show that you never know what knowledge will be useful to you in the future. So many kids complain that they will "never use any of this", but they really don't know that.

  • @Celrador
    @Celrador 11 років тому +3

    At a certain length of the document, the amount of bytes needed to address the jump, would be too big.
    You want to compress afterall. So with relative positioning you can still have infinite long documents.
    If you take absolute positioning you will at some point have to start using jump-addresses that get bigger than the words you want to compress with it.

  • @typograf62
    @typograf62 9 років тому +19

    The IBM System/370 Reference Summary from 1984 has an error in the Hex conversion table. It says 4069 rather than 4096.
    If one notices that then it is probably time for a vacation.

  • @fllthdcrb
    @fllthdcrb 11 років тому +1

    Well, one must make a distinction between the compressed stream and the uncompressed stream. In the scheme presented, everything is in terms of the uncompressed stream, where there are no pointers (I'll call them backreferences). It's possible, at least theoretically, to reference things in terms of the compressed stream, or at least some form of it. But this would require the compressor to recursively resolve who knows how many backreferences. It might be fairly slow. (cont...)

  • @KubGov
    @KubGov 8 років тому +6

    What is the meaning of the 77 in LZ77?

  • @Grarrgle
    @Grarrgle 11 років тому +1

    This professor is so lovely. What a nice fella.

  • @Morturious
    @Morturious 11 років тому

    Good question. Next step -- google the difference between UTF-8 encoding and LZ compression. Generally speaking, this information is (as far as I am aware, anyway) in some way stated in the header of the file.

  • @Vazzible_gaming
    @Vazzible_gaming 25 днів тому

    guys who are in highschool preparing for this kind of stuff are so lucky, I didn't know I wanted to get into tech. I was out of highschool after I had taken wood shop every year, (biggest mistake ever) now I want to write operating systems programs in C. I literally screamed when I realized I didn't get shit out of going to highschool, I learned how to glue wood togeather when I should've been learning pointers and java, I still feel ignorant and stupid for not taking any tech classes or preparing for college in any way. Last 2 years of college now, it feels like im climbing up a cliff thats sets at a negative slope, no one ever pushed me into learning technology buy myself, I basically have no support structure.

  • @JohnSmith-sw1ng
    @JohnSmith-sw1ng 10 років тому +11

    I like it how he doesn't want to waste the paper with the holes on the side which hasn't been used since the 80's....

  • @Salabar_
    @Salabar_ 11 років тому

    We can add single indicator bit before each symbol. So use "0%UNICODE" to represent characters (they will actually 17 bit characters) or "1%POINTER". It will add insignificant overhead, but for the most of texts, compression will go pretty well.

  • @BGBTech
    @BGBTech 11 років тому

    short answer: better? it depends.
    also short answer: LZMA essentially is LZ77, just with a fairly large window and arithmetic/range coding.
    basically, many variants of LZ77 exist, differing in specifics like encoding, maximum length and pointer distance, and whether or not entropy coding is used. these can be used to balance speed, memory use, compression ratio, and other properties.
    usually, one tries to strike a good balance for the specific use case.

  • @Ma-pz2fy
    @Ma-pz2fy 7 років тому +14

    Can you create a pointer to a pointer?

  • @fllthdcrb
    @fllthdcrb 11 років тому

    In the end, your needs make a difference. If you're desperate to squeeze as much information into a given space as possible and don't care if it takes a while/a lot of computing power, you should try all sorts of tricks. Otherwise, you should trade off compression for speed.
    LZMA (see 7-Zip) and its derivatives are an example of really complex techniques to gain compression, although it's also more normally usable. RAR too, I suppose, but that's proprietary, so few know just what it does.

  • @DigGil3
    @DigGil3 11 років тому

    Pay attention to the text at 8:23 ! In the HTML coding, tags end with «

  • @eddiegerbais-nief7745
    @eddiegerbais-nief7745 7 років тому +8

    How do you know it's a pinter and not a classic caracter ? You may nedd an head value, and an escape caracter no ?

  • @AlexSimpsonsPage
    @AlexSimpsonsPage 11 років тому

    Not always the case, e.g. zip, gzip, etc. is lossless.
    There is usually a memory and computational overhead to the higher compressions, so the higher compressions for a particular algorithm will take longer to run.
    For the example in this video, the dictionary might be bigger and cover a longer range of characters and combinations. Looking everything up to see if it matches in a very very large dictionary could take longer.

  • @styleisaweapon
    @styleisaweapon 11 років тому

    Relative is better because there are strong correlations in locality. Consider an encyclopedia: very few articles will use the term "Space Shuttle" but the ones that do will often use it frequently. As the encoder/decoder progresses, Space Shuttle is more likely to be seen if Space Shuttle was recently seen but very unlikely to be seen if its been a long time since it was last seen. Exploiting correlations in locality is part of the broader topic of Adaptive Compression which is often superior.

  • @Ferrus91
    @Ferrus91 11 років тому

    I was taught LZW at uni, it was kind of interesting to see its predecessor explained

  • @ThePC007
    @ThePC007 2 роки тому

    @Vix
    Not necessarily. By referencing the left end of the string you're referring to, you basically get run-length encoding for free.
    If I were to encode the string “AAAAA”, I'd then be able to do it as follows: “A”
    The algorithm would then copy the newly placed characters and decode the string properly.

  • @AsbjornGrandt
    @AsbjornGrandt 10 років тому +111

    He forgot the space character preceding "computer"
    But this is a brief, and simplified explanation on how compression can work.
    The devil is in the details of course.

  • @BGBTech
    @BGBTech 11 років тому

    Deflate: also used in GZIP, PNG, various network protocols, several video codecs, Minecraft region files, ... so, yeah, pretty common.
    internally, it uses 0-255 for literal bytes, and (IIRC) 256-285 for special things (EOB, also LZ77 runs). for runs, the Huffman-coded symbol (essentially) works as a prefix indicating how many bits follow, with the prefix+bits encoding a value (run length and distance).
    and, also nifty: it Huffman encodes the Huffman table.

  • @Tasarran
    @Tasarran 11 років тому

    With lossy compression, the computer isn't trying to match the match string exactly, it just tries to get close.
    That percentage is usually the amount of similarity allowed, the looser a 'match' is allowed to be, the more compression you will have, and the more loss.
    Its usually acceptable in things like images, where the human eye can't really tell the difference between, for example, 255,255,255 (white) and 253,254,255 (nearly white)...

  • @JerehmiaBoaz
    @JerehmiaBoaz 11 років тому

    There are different ways of distinguishing a literal from a pointer during compression. The easiest way is to always put out a pointer even if it points to a zero length sequence, followed by the first byte that wasn't part of the match. While decompressing you read the pointer, extract the sequence if its length is greater than zero, then you read the literal byte after the pointer, copy to its destination after the extracted sequence, and you're good to read the next 3-byte pointer + literal.

  • @LeethLee1
    @LeethLee1 6 років тому +1

    This was amazing!! Thank you again for sharing all this wonderful knowledge. This is just a good thing for the world, and individuals, specially as a launching off point to learn more about certain areas in detail.

  • @styleisaweapon
    @styleisaweapon 11 років тому

    In classic LZ77 a literal always follows a [j,l] pair. Multiple literals in a row require an [j, l] pair with a length of 0. LZ78 solves the problem by pre-initializing the dictionary with every symbol in the alphabet and specifically protecting this dictionary block (so no literals needed, only [j, l] pairs.) The LZH variant is based off of LZ77 rather than LZ78, and instead uses standard huffman codes for each literal plus a unique code to indicate that an [j, l] pair is to follow.

  • @eideticex
    @eideticex 11 років тому

    Well out of the 256 possible characters in a 1 byte character, there are at most 127 that will actually appear in the document. For example one would likely not use the bell character in a text file, so that's 1 value we have that can serve as the header for a pointer.

  • @TimVerweij
    @TimVerweij 11 років тому

    If each composite pointer is 2 bytes, you can assume that the word you are pointing to has a minimum length of 3 bytes, otherwise there's point in creating a pointer there. So in the 4 bits you're using to specify the length, it could mean a length of 3-18 bytes. (instead of 0-15 or 1-16)

  • @nex
    @nex 11 років тому

    Pretty much, except the flag bit goes either before a character or a pointer+length, the latter of which are usually longer than a byte, so it won't end up being one bit per byte.
    In the meantime I got around to looking it up: LZSS does what I just described. The original LZ77 _always_ outputs the next literal character after a pointer+length (which is a null pointer in the case of no match found), so it's always clear what's what and no flags are needed.

  • @ItsThe1300
    @ItsThe1300 11 років тому

    I love the way Professor Brailsford says "compression"!

  • @Kyuubi840
    @Kyuubi840 11 років тому

    HTML tags start with their name inside < » and end with the same name, but with an added slash:

  • @maartendj2724
    @maartendj2724 8 років тому +1

    Is it fair to say, that for every combination of at least 3 characters that appears in the text for at least 2 times, a pointer should be created? Because the pointer 'costs' 2 (uneven) bytes of storage, and obviously a 3 character combination costs 3 bytes, so this should be the minimal length for which creating a pointer translates to less memory in use. So every 3 letter word like 'the' or 'and' should be pointed to as soon as it reappears, but in that case, even ' or' (note I put a space in front) should be, because a space is a character with a byte code as well.

  • @vkillion
    @vkillion 6 років тому +1

    I wonder if you could use one of these pointers to refer to a previous pointer, in the case where the referred to entry is farther than the maximum jump possible (4096 in his example), but another reference using a pointer was within range. This would in essence be recursive pointers.

  • @AlexanderBollbach
    @AlexanderBollbach 9 років тому +39

    make a video on "middle-out" compression from silicon valley lol

  • @kevnar
    @kevnar 9 років тому +14

    Wouldn't it be easier to just make a dictionary of the most commonly used words in a given language that are longer than two characters and index them with a two-byte key?

    • @JohnHollowell
      @JohnHollowell 9 років тому +16

      But then the dictionary takes up a lot of space, and most of its entries would not pertain to the given text. Also, if you look at code (say HTML), you use the "

    • @redmercer4158
      @redmercer4158 9 років тому +15

      kevnar Not really, and it wouldn't even be any more efficient because text is not necessarily simply made up of dictionary words. Like code

    • @mustangrt8866
      @mustangrt8866 9 років тому +3

      kevnar wouldn't be easier to reduce dictionary to used-only words in the file?

    • @fluffycritter
      @fluffycritter 9 років тому +6

      Mustang Rt Well now you have to transmit which information you need from the dictionary first. And if your dictionary is large enough then that information takes just as much space as the dictionary itself. So now you're back to square 1.
      The way that I've approached this problem in production systems is to make a dictionary for the text, compress that dictionary with LZ, and then use the dictionary to do a dictionary compression of the text (which can then be compressed further with LZ because there'll often be patterns in phrases and so on). But it can still only do so much, and it's incredibly domain-specific (in this case I was using it to compress large single works of fiction). In the general case, straight-up LZMA still performs better.

  • @fllthdcrb
    @fllthdcrb 11 років тому

    That's a sensible possibility. In reality, you'd want to add/subtract offsets on the bitfields. Consider: is it useful to copy a 1-character string? No, because it costs 2 bytes to encode it; that's no longer compression. 2 characters is also pointless. So, why not copy strings of minimum 3 characters at a minimum offset of 3. 0 for the offset field is still reserved to signal literal 0, so subtract 2 from the offset and 3 from the length, and you gain a little range.

  • @VamPDX
    @VamPDX 11 років тому

    it's also possible to include the blanks in the compression so short words like "a" with a blank in the front and in the back could be used to save space

  • @nex
    @nex 11 років тому

    Yeah I kept waiting for the professor to address this issue ...
    You don't need a whole byte, however; a single bit is enough. Another variation is to use a jump distance of 0 combined with a literal character in place of the string length. In that case, 0x00 does indeed signify the start of a literal string, but it doesn't take up an extra byte.
    Over the years quite a few variations have accumulated; unfortunately I haven't got the time right now to look up how the original LZ77 does it ;\

  • @JanCRefsgaard
    @JanCRefsgaard 11 років тому

    1), probably uses 1xxxxxxx for ASCII and 10xxxxxx for UTF-8 (see my other comment)
    2) yes, it sees the documents as ones and zeros, it doesn't go word by word
    3) very good idear! I don't know if the original implementation can, but it could easily be implemented into an algorithm, your pointer could look like this:
    UTF-8: 10xxxxxx(xxxxxxxx)10xxxxxx
    ASCII: 1xxxxxxx(xxxxxxxx)1xxxxxxx

  • @luketimothy
    @luketimothy 11 років тому

    Well, when you decompress the data, you will read it as bytes... so it's not UTF-8 at all.
    Also, I believe this assumes the text is ASCII encoding... in which case you have an extra bit, as ASCII only makes use of 7 bits. This extra bit is usually a parity bit, but in this case, you can say that it denotes a character or pointer data.
    Set this extra bit to a 0 for a character, and 1 for pointer data. Of course, this means you would lose possible dictionary/string size.

  • @SnowRaptor
    @SnowRaptor 11 років тому

    Essentially compressing and decompressing speed. Lower compression deompresses faster. Depending on the application, you prefer a file that is 20% larger but that can be decoded 20% quicker

  • @deividux12
    @deividux12 11 років тому

    because when you "write down" the start and the length you can even reference to the middle of the word for example the word press could be referenced in the word compression (comp press ion)

  • @JanCRefsgaard
    @JanCRefsgaard 11 років тому

    yep, you are the second to point that flaw out :), but it's nice that the internet is full of smart people :)

  • @JanCRefsgaard
    @JanCRefsgaard 11 років тому

    I think it depends on the type of data, if your data is written by a human, words would probably cluster into sections such that relative pointers could capture most words with few bits
    if your data is the human genome, then absolute pointers might be best because the patterns will be scatted all over your document

  • @BlueRavenGT
    @BlueRavenGT 11 років тому

    Even then you can't tell just by looking that two bytes is a pointer instead of two characters. The simplest solution is to just place a character or sequence of characters before the pointer that doesn't otherwise occur in the resulting bitstream. Look up "escape character" to learn more.

  • @xZise
    @xZise 11 років тому

    Although this will reduce the width of the pointer to 14 bits (with the ASCII method) or 12 bits (with the UTF-8 method). But you can tweak this, as you could allow all possibilites for the second byte. So for ASCII it would be 1xxxxxxx xxxxxxxx and for UTF8 10yyyyyy yyyyyyyy. This gives you 15 bits for ASCII and 14 bit for UTF-8.

  • @solhsa
    @solhsa 11 років тому

    Consider a trivial look-behind procedure; if you, at every byte, scan backwards 64k instead of 256 bytes, you get better compression but it takes a lot more time to do so. (This is a trivialized example, but you should see the point anyway..)

  • @Niosus
    @Niosus 11 років тому

    Because the pointer size is the limiting factor. If you have a book of many millions of characters, you'd already need well over 30 bits to just make up the pointer. Any word under 4-5 characters would not be able to be compressed efficiently. Turns out that those words are actually used most often and very close together.
    The big point to learn from this is that no compression algorithm is perfect in every situation. It is even mathematically impossible. Yours might work better in other cases.

  • @henke37
    @henke37 10 років тому +1

    I am disappointed that the trick to do RLE wasn't brought up, it is a quite important trick in LZ77.

  • @ihrbekommtmeinenrichtigennamen
    @ihrbekommtmeinenrichtigennamen 11 років тому

    Escaping with a zero-byte would make sense. When there is a zero in the output-stream, it means the next two bytes are a pointer. And if there would have been an actual zero in the input-stream, there are two zeros in the output-stream, which works, because a backwards-jump by 0 bytes doesn't make any sense. So two zeros in a row can only be an escaped zero-byte from the input-stream.
    Allthough, I don't know how it is actually implemented.

  • @fllthdcrb
    @fllthdcrb 11 років тому

    The professor's example was a bit simplistic, as pedagogical examples tend to be. In reality, you'd need to further encode these backreferences. To add to what Jason Wilkins said, with ASCII you could reserve the high bit to signal a backreference. For UTF-8 and such, it would be more of a problem. In the real world, LZ77 is often combined with some other compression scheme (like Huffman, which combination looks to be covered in the future), and backreference encoding is pretty much a non-issue.

  • @griv2000
    @griv2000 11 років тому

    The implementation is left as an exercise to the reader, but most implementations use dictonaries and hashs to find the matches, it does not always give the best compression (because of hash collisions), but it is a lot fasyer than brute-force

  • @JanCRefsgaard
    @JanCRefsgaard 11 років тому

    the c++, I know basic c++ and program python as my day job, but I am pretty sure than 4 videos in I will start learning new/forgotten things :)

  • @JuddMan03
    @JuddMan03 11 років тому

    because maximum can mean the difference between 30 minutes and 2 hours when compressing enormous files, and the output size between normal and maximum may be really small, such as 5%.

  • @luketimothy
    @luketimothy 11 років тому

    In any case, this algorithm is basically what ZIP uses... so when you zip up a folder, this is (part of) what it does, and it doesn't just compress text.
    However, ZIP uses a more advanced version of this.

  • @Tasarran
    @Tasarran 11 років тому

    He specifically says in the video that there would not really be enclosing brackets, that it would be a binary string.

  • @jeremyelliot4831
    @jeremyelliot4831 8 років тому +11

    If the pointers are embedded in the data stream, how does the decompressor know where the pointers are? Is there a dictionary of pointer locations?

  • @Tupster
    @Tupster 11 років тому

    Pretty sure this code was invented to compress 7-bit ASCII, to extend it to other encodings you would have to invent some way to escape that encoding similarly to how you can escape ASCII by setting the 8th bit.

  • @ivansanchez1988
    @ivansanchez1988 10 років тому +1

    Thank your professor, great explanation... Good Advice, learn your powers of 2 and 16...