Expanding the UTF-8 Character Set to Infinity

Поділитися
Вставка

КОМЕНТАРІ • 16

  • @MatheusAugustoGames
    @MatheusAugustoGames 3 роки тому +19

    Ok I just want to point out the genius that was the creation of UTF-8. Old computers, if they found 8 bits set to 0 in a byte, would interpret the string as finished. This pattern on UTF guarantees that will never happen accidentally.

  • @ybungalobill
    @ybungalobill 2 роки тому +29

    The proposed scheme breaks another genius property of UTF-8: that it's self-synchronizing. You can always determine if a byte is the beginning of a character just by looking at it. This is crucial not only for iterating back and forth through the string, but also for being able to search for substrings using a simple strstr. You can fix your scheme by filling in those ones into the x'es of 10xxxxxx bytes. Eg:
    11111111 10111111 10111111 10111111 10110xxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ...

  • @lelouchvibritannia69yearsa78
    @lelouchvibritannia69yearsa78 2 роки тому +12

    The beginning of a Legendary Game Developer's journey!

  • @sarahdehart1027
    @sarahdehart1027 5 років тому +18

    Lol! That ending was epic! Loved it!

  • @PC_YouTube_Channel
    @PC_YouTube_Channel 2 роки тому

    lmao amazing ending. your channel really gives off some Tom 7 vibes.

  • @luca__3044
    @luca__3044 2 роки тому +1

    Cant wait to express my feelings in a 420bit alien langue!

  • @halftwins
    @halftwins 2 роки тому +3

    I see a couple problems with this, mainly for example, not having clarification on if a character has just started with a byte or is preceded by 11111111. Maybe there's something I'm not noticing, but it seems like for it to really last forever an ending sequence of some kind would be needed(?) Anyway, the video was great and early congrats on 1k!

    • @Magnogen
      @Magnogen 2 роки тому +3

      That's a good point, I was half expecting him to say that if the byte started with 0, then _that_ would be the terminating byte. Something like
      *1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx*
      would then be the corresponding utf-infinity code, and ascii would be the base case of just 0xxxxxxx. Backwards compatibility and all.
      But hey, that's just a thought. I'm not sure how feasible it'd be in practice, as I don't tend to work with memory allocation, but I'd like to know how well it'd work/if it'd work at all.

    • @BGBTech
      @BGBTech 2 роки тому

      @@Magnogen That scheme is actually used for encoding numbers in some file formats.
      One other scheme I had used in some of my formats is:
      0xxxxxxx (0-127), 10xxxxxx-xxxxxxxx (0-16K), 110xxxxx-xxxxxxxx-xxxxxxxx (0-2M), ...
      A lot depends on what properties one wants. There are also various ways these schemes can be extended for signed numbers, to encode variable-length floating point values, ...
      OTOH, while UTF-8 doesn't have the most efficient representation, it does allow re-synchronizing, and in a few odd-cases non-standard encodings are possible (for example, I had used "transposed UTF-8" values in string tables as to encode string length prefixes), noting that it is possible to unambiguously differentiate between normal coded and transposed encodings (and in some cases, it might be preferable to have some way to be able to encode an explicit string length, without needing to count characters until the NUL byte).

  • @sullivanbarnett6904
    @sullivanbarnett6904 5 років тому +1

    Thank you jacob!

  • @TimJSwan
    @TimJSwan 2 роки тому +1

    lol 256 bits enough? more than all the plank lengths in the universe represented...

  • @bored_person
    @bored_person 2 роки тому +1

    Patents expire after 20 years.

  • @robloxxer593
    @robloxxer593 2 роки тому

    Wait why tf are they adding four entire 1's two chracters already had 4 combinations and wouldn't you know when it ends from the bits that told you how long it is? what's the point of the bits in the front of the byte

    • @decare696
      @decare696 2 роки тому +3

      it's so that a byte that's in the middle of some character can't be mistaken for a correct ascii byte by old or bad/lazy software

    • @robloxxer593
      @robloxxer593 2 роки тому

      @@decare696 stupid lazy old software