Code Pages and Kohuepts: The Chaos of 8 Bit Extended ASCII

Поділитися
Вставка
  • Опубліковано 15 січ 2025

КОМЕНТАРІ • 165

  • @mrmimeisfunny
    @mrmimeisfunny 5 місяців тому +50

    Neat fact about the keyboard layout thing. In the 2002 movie "The Bourne Identity" the protagonist assumes the fake identity of a Russian citizen named "Foma Kiniaev". He gets a fake Russian passport but his Russian passport in Cyrillic says "Ащьф Лштшфум". The prop department just set their keyboard to Russian and wrote "Foma Kiniaev" as if it was a qwerty keyboard.
    Turns out it was actually quite realistic. A few years back a guy tried to present a fake Israeli passport in Barbados under the name "Assulin Hormoz", But instead of "Hormoz" his surname in the passport in Hebrew was also typed as if it was Latin so it became "יםרצםז", which was further mangled by being rendered backwards as "זםצרםי" (bidirectional text is something you haven't covered and it's a whole other can of worms). There were also several other Hebrew mistakes in the passport such as text rendered upside down or similar looking letters being mixed up.

  • @vektracaslermd743
    @vektracaslermd743 5 місяців тому +19

    Dylan is easily one of the best presenters I've ever seen. Fantastic work.

  • @TeVolt805
    @TeVolt805 5 місяців тому +39

    Excellent. Can't wait to see what you say about UTF-8.

    • @pleappleappleap
      @pleappleappleap 5 місяців тому +6

      Or UTF-7 even.

    • @thetj8243
      @thetj8243 5 місяців тому +3

      There is an excellent talk from Dylan about "plain text" that is as he told in this video the basis for this video ... And you can find a recording of the talk on UA-cam

    • @mrmimeisfunny
      @mrmimeisfunny 5 місяців тому +4

      Probably something about having Chinese in the event logs.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому +1

      @@pleappleappleap What about UTF-9 and UTF-18? :) :)

    • @enterrr
      @enterrr 5 місяців тому +2

      UTF-7 is a crime against humanity!
      And so are UTF-16 and UTF-32, while we are at it
      UTF-8 FTW!

  • @clasqm
    @clasqm 5 місяців тому +16

    This brought back memories of typing Romanized Sanskrit letters into my dissertation. I had Wordperfect macros typing the letter, then backspacing one position and typing the diacritical mark.

  • @MusicEngineeer
    @MusicEngineeer 5 місяців тому +23

    These little anecdotes from the world of computer history are super entertaining!

  • @GeraldWaters
    @GeraldWaters 5 місяців тому +3

    In the early 1980s, during one of my unemployment phases, I read various books in the nearby university science library - thus I happened to read about the committee processes for establishing ASCII. No idea now what the exact book was. Some of the contentions were whether and what to have for things like logical NOT and OR (as AND was obviously covered by &). I have a more vague memory that collation order was also much debated. Like some other commenters here, I've also written about character encoding history and issues, see my bio for links.

  • @Cmanorange
    @Cmanorange 5 місяців тому +43

    that google tidbit is hilarious 😂

    • @realGBx64
      @realGBx64 5 місяців тому +3

      Same thing works in Korean, too.
      The funny thing was when they used this strategy in the first Bourne movie to write the main character’s name in Cyrillic lol

  • @WilliamHostman
    @WilliamHostman 5 місяців тому +6

    The octothorpe (#) was used in the late 19th and early 20th C for the pound avoirdupois (weight) in the US, especially for pounds of goods sold by the pound... So while it may not have been a Pound sign in the UK, in the US, as a postfix, it indicated weight pounds (not to be confused with pounts force, pounds thrust aka poundals, pounds mass, .nor pounds sterling aka £), and when prefixed, it starts a numeric sequence..
    I encountered this use a lot in late 19th and early 20th C US federal records from the then Territory of Alaska (now a US State) and Hawaii (also now a US state). Especially for the pounds of supplies ordered and delivered. It is still used in the US to indicate either numericity of the following characters, or to indicate a weight in pounds of the preceeding digits.
    As for Cyrillic, it is used in Alaska for Russian, and some dialects of Yupic and Inungan... (most Alaska Natives have now switched to using accented Latin...).

    • @BuildWall
      @BuildWall 4 місяці тому +1

      You'll still see it occasionally to this day at smaller retailers like farmers markets etc.

    • @Roxor128
      @Roxor128 4 місяці тому

      That whole "not to be confused with" section had me wincing and oh, so very glad Australia finished switching to metric before I started school!

    • @BuildWall
      @BuildWall 4 місяці тому

      @@Roxor128 so very glad the US never switched to that backward system

  • @eliavrad2845
    @eliavrad2845 5 місяців тому +7

    Its not just forgetting to change keyboards: Sometimes it doesn't switch, sometimes you try to switch but it was already on the right language, sometimes the operating system gives you an extra keyboard or two for fun...

  • @mrJety89
    @mrJety89 5 місяців тому +14

    Well, you've been ASCIIing for it

    • @edgeeffect
      @edgeeffect 5 місяців тому

      Uuuugh! Dad Joke! :)

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      ASCII kind of sounds like someone sneezing :D

    • @TheUtuber999
      @TheUtuber999 5 місяців тому +2

      That joke is as bad ASCIIt gets.

  • @euromicelli5970
    @euromicelli5970 5 місяців тому +12

    I had never encountered “Kohuept” until I heard of it in Tom Scott’s “Lateral”, Now I can’t _unsee it_ and it seems to pop up somewhere at least once a month

  • @edwardallenthree
    @edwardallenthree 4 місяці тому +1

    The Russian postal worker who translated that code page mistake was doing the Lord's work.

  • @bauckrob
    @bauckrob 5 місяців тому +7

    There were also a ISO 646, which to us Norwegians meant that we could find words like bl}b{rsyltet|y.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      ISO 646 was super common in all "West bloc" European countries that had additional characters.
      R{ksm}rg\s!

  • @MeriaDuck
    @MeriaDuck 5 місяців тому +11

    Of all your stories, that Harry Potter one is one of the very best 😂

    • @edgeeffect
      @edgeeffect 5 місяців тому +2

      I used to have a whole collection of pictures of "mojibake" and that one was always my favourite.

  • @DragoniteSpam
    @DragoniteSpam 5 місяців тому +5

    Growing up in Zimbabwe was not a piece of Dylan Beattie Lore I expected to learn today

  • @edgeeffect
    @edgeeffect 5 місяців тому +4

    That WordStar screenshot is such a goldmine of nostalgia, I used a lot of different CP/M and DOS machines back in the olden days and they all had their differences and "killer apps"... but WordStar was the ONE constant. At college we had realised that the CP/M text editor was the ninth circle of hell and some bright spark realised you could use WordStar in "non document mode" as quite a decent text editor and so, until Microsoft put a cut down version of QBASIC in DOS-5 and called it `EDIT`, WordStar followed me around for many years.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому +1

      I never used Wordstar but Turbo Pascal and the other Turbo products used the same commands. In particular the old Turbo Pascal 3 really required that you knew the Wordstar commands. The end result is that they stuck in my brain and for casual text mode editing in *ix systems I use JOE, which uses Wordstar editing commands.

    • @edgeeffect
      @edgeeffect 5 місяців тому

      @@Thesecret101-te1lm yeah... Turbo C++ and Delphi had lovely editors.

    • @nickwallette6201
      @nickwallette6201 5 місяців тому +2

      IIRC, Edit wasn't a cut-down QBASIC, because QBASIC didn't exist until MS-DOS 5 either. The Edit executable required QBASIC because the latter actually contained the code for the text editing functionality, and EDIT COM was just a stub that launched it in text-editing mode. This changed with the release of Win9x, where, I guess, they decided the extra few dozen KB didn't matter anymore, and having a BASIC interpreter wasn't high on the priority list either.

  • @ArduinoRR
    @ArduinoRR 5 місяців тому +3

    Lovely historical info trove. In the 1960's I grew up on Dartmouth Timesharing Basic on a TeleType ASR 33, so I got to know 6-bit ASCII pretty well. Fast forward to 2004 trying to maintain a Spanish website on JDK1.4, which didn't support UTF-8 in property files. Had to copy and paste UTF-8 from Word documents into an app that converted UTF-8 to Unicode backslash escape characters. You've nicely covered quite an historical odyssey from Baudot to ASCII to EBCIDIC to Code Pages and finally Unicode. Thank you, sir!

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      6-bit???

    • @ArduinoRR
      @ArduinoRR 5 місяців тому

      @@Thesecret101-te1lm Actually, yes. The TeleType ASR-33 didn't print the lowercase letters. I also programmed the 12-bit PDP-8, which packed two 6-bit characters in a word . Seems strange now that you mention it.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      @@ArduinoRR Interesting that it counts as 6-bit, as the teletype needed 7 bits to handle control characters.

  • @Wishbone1977
    @Wishbone1977 5 місяців тому +7

    Ah, encodings... I work in integration, and let me tell you, the chaos is still with us to this day. I have written lengthy articles about various encoding problems, but here I will just touch on a single issue, the "MS Office character replacement problem".
    First, a bit of history. ISO-8859-1 is a single-byte text encoding scheme which extends the 7-bit ASCII character set with most of the characters used in western European languages. When Microsoft made Windows 1.0, they decided to copy this encoding, but rename it Windows-1252. Ever since then, this has been the default encoding on most Windows machines in the western world. Since Windows was for many years only a workstation OS (there was no server version of Windows initially), this led to a situation where a lot of text data was being produced on Windows machines but would eventually have to be processed by other operating systems. Since Windows-1252 was not an official international standard encoding, other operating systems did not have support for it. However, since Windows-1252 was initially identical to ISO-8859-1 which other operating systems _did_ support, it became common for data written in Windows-1252 to be marked as ISO-8859-1. This allowed other OS's to read Windows-1252 data with no problems, and it seemed like a good idea at the time...
    Now, ISO-8859-1 has a gap in its printable character definitions. The byte values 7F-9F (33 characters in all) are undefined. When Microsoft developed Windows 2.0, someone had the thought that it would be great to have a few more characters available, and wouldn't you know it, here were a bunch of character codes that weren't used for anything. So they added a few more character definitions in the space unused by ISO-8859-1. Then they did it again for Windows 3.1 and a final time for Windows 98, so that today all but 5 of the original 33 unused byte values in ISO-8859-1 have character definitions in Windows-1252. As a result, text data written in Windows-1252 can now potentially contain quite a lot of byte values which are undefined in ISO-8859-1. So what are these extra characters? Well, they are mostly typographical characters, by which I mean characters meant to make text "prettier" than the standard characters in ISO-8859-1 allows for. These are things like left and right hand versions of both single and double quotes, two dash characters of different lengths, a bullet point and an ellipsis (three dots). Recall that Windows-1252 encoded data has historically often been intentionally mislabeled as being ISO-8859-1 data, and we begin to see how this could potentially lead to problems.
    Then one glorious day, someone at Microsoft had the brilliant idea of helping end users write prettier text. How, you ask? By having all the MS Office programs (Word, Excel, Outlook, etc.) automatically replace some of the characters the users were typing _as they typed them_ with the "prettier" versions added to Windows-1252. And not as a function people had to switch on if they decided they _wanted_ this to happen to their text, no they did it as a function which was switched on by default when Office was installed and you had to manually find the setting and switch it off if you _didn't_ want it. Unsurprisingly, this aggravated the problem enormously, since so much text data was produced using MS Office. Instead of there being a mere _possibility_ that Windows-1252 encoded data might be decoded as ISO-8859-1 _and_ might contain characters not present in that encoding, it now became _highly probable_ that this would happen. And it did. A lot. And still does. All the time. And then I'm the one who has to fix it 😞
    There is _a lot_ more I could say on this subject, but I think this is probably enough for a UA-cam comment 😀

    • @nickwallette6201
      @nickwallette6201 5 місяців тому +1

      You touched on this, but it bears mentioning explicitly: Outlook, at least for a long time, uses/used the Word engine for the email editor. (Maybe it still does? I dunno, I use a Mac for my day-to-day business needs. While Office technically exists on Mac, it's basically "the version of Office that our intern wrote as a summer project" with enormous chunks missing. Same price though. So that's fun. Anyway...)
      Working in a technical field, I cannot begin to count the number of times someone would send code or configuration snippets that had been through the "pretty text" filter. Naturally, C compilers, bash script interpreters, and network appliances configuration parsers have absolutely no idea what to do with complementary opening and closing quote marks, or command line switches with em-dashes, or passwords with at symbols converted to email addresses. On that topic, why is it every Office application copies email address with a mailto prefix, despite no Office application being smart enough to remove the mailto prefix when you paste it somewhere an email address is expected?

    • @pjl22222
      @pjl22222 5 місяців тому

      Another thing you could say about it is why those characters were unassigned. They were unassigned because they were the same as the control codes but with the high bit set. So if your text goes through a seven bit only system that strips the high bit you now might have a bunch of random control codes in your text instead of just getting the wrong characters like what would happen with the assigned code points.

    • @Wishbone1977
      @Wishbone1977 5 місяців тому

      @@nickwallette6201 Yes, the automatic character replacement functionality of Office has wreaked havoc in many different contexts over the years.
      Interestingly, in order to combat this specific issue, the official HTML5 specification explicitly calls for all pages that state they use ISO-8859-1 to be decoded using Windows-1252. As such, for internet browsers the problem has now been permanently "fixed".

    • @chachachi-hh1ks
      @chachachi-hh1ks Місяць тому

      OP, where can I read your articles?

    • @Wishbone1977
      @Wishbone1977 Місяць тому

      @@chachachi-hh1ks Nowhere at the moment, I'm afraid. They used to be on my company's web page, but are unfortunately no longer available there. I have considered starting a blog, but I have no experience with that sort of thing so I don't really know how to get it going.

  • @kupferdrachevideosfurdich8733
    @kupferdrachevideosfurdich8733 5 місяців тому +6

    It is nearly as hilarious as working with dates and timestamps.

    • @realGBx64
      @realGBx64 5 місяців тому +3

      And mixing dot and comma as the decimal separator in the two languages you usually use.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому +1

      Time zones.
      Every time some software don't handle time zones correctly, I as an European feel some schadenfreude as Americans feel some pain that Europeans don't, as a contrast to every American software that has trouble with international characters :)

  • @Merrinen
    @Merrinen 5 місяців тому +3

    I'm just happy I never have to do a latin-1 to utf-8 database conversion again.
    I'm also happy I never have to fix utf-8 stored with latin-1 connection to utf-8 stored with utf-8 connection again.

    • @edwardallenthree
      @edwardallenthree 4 місяці тому

      I still find problems in an old database. Every time it gets backed up and restored, I'll find a new place where the Unicode has finally devolved into an error. I think my record was 12 characters between the "n" and the "t" in a "don't."

  • @cdreesbach
    @cdreesbach 5 місяців тому +9

    Man, I do _NOT_ miss the old codepage mess AT ALL! Thx for a great trip down memory lane. ;]
    Also, did not know about kohuept - neat! 😂

  • @pebbleschan6085
    @pebbleschan6085 5 місяців тому +3

    Wordstar also worked with non-document text files without affecting the MSBit. It was used often for source code.

    • @edgeeffect
      @edgeeffect 5 місяців тому +1

      Yay... a fellow "non doc mode" aficionado!

  • @JanMichalSzulew
    @JanMichalSzulew 5 місяців тому +10

    9:07 you threw in an extra "T" between "N" (Н) and "Ts" (Ц) that isn't there

    • @DylanBeattie
      @DylanBeattie  5 місяців тому +4

      D'oh. This is what happens when you're concentrating so hard on pronunciation your brain throws in extra letters which aren't there. My bad.

  • @ThomasKnott
    @ThomasKnott 5 місяців тому +9

    And still today in Germany many systems tell you to not use Umlauts (ä, ö, ü) in your username or even your password. Even more weird when they also require the password to contain special characters

    • @martinba9629
      @martinba9629 5 місяців тому +2

      Und recht hamse. Auf Windows kämpft man ja immer noch öfter mit Win1252 - utf8 mismatches.

    • @Pystro
      @Pystro 5 місяців тому +1

      Special characters in passwords is just so people don't use "password1" or "pa$$word1" other similar things that are essentially a single word.
      The more annoying part about that is not that it forces me to put a symbol into passphrases that are already secure enough without symbols, but that it still doesn't force people to use secure passwords.
      And forcing actually secure passwords wouldn't really be that difficult. Your browser would need access to a dictionary sorted by word "frequency" (or really probability of being in a password (in every language you might type a password), plus a globally valid character replacement dictionary. And then any password prompt in the browser would just display a prompt "if you are SETTING a password click here" that then sums up how many bits of entropy are in the password.
      And finally, avoiding codepage-dependent or keyboard-layout-dependent symbols is also useful for when you have to log into things from an internet cafe in a foreign country, or from the computer of a host company on a business trip.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому +2

      In Sweden it's super common to encounter character encoding problems with everything that has anything remotely to do with embedded systems, label printers and whatnot.
      Receipts even at large chains like hamburger chains and whatnot commonly have incorrect encoding...

    • @johnrehwinkel7241
      @johnrehwinkel7241 5 місяців тому +2

      I've run in to that "special characters, but not TOO special" more than once. Often it isn't even documented. What's wrong with “⁋assword”?

  • @YoutubeBorkedMyOldHandle_why
    @YoutubeBorkedMyOldHandle_why 5 місяців тому

    This is great. I've been programming computers since the 1970's. Looks like after a few more of these videos, I might finally start to understand some of this stuff.

  • @Kobold666
    @Kobold666 5 місяців тому +6

    It still is a problem with compilers and editors if you use non-ASCII characters (like ä, ö, ü, ß, copyright, trademark etc.) in string constants (or comments even). The editor might automatically switch to UTF-8 (Notepad++ does that), which the compiler takes for standard ASCII and chokes. Usually you get garbage at some point. I got used to embed such characters as hexadecimal escape codes to avoid that pitfall.

    • @rogo7330
      @rogo7330 5 місяців тому

      I believe most parsers today are capable of reading UTF-8 automatically since UTF-8 just looks like ascii + bytes with value 128 to 255. You just searching for your special ascii characters and tread all other characters as "alphabet" (even if they are invalid UTF-8 since it does not matter and it is your fault).

  • @sponge1234ify
    @sponge1234ify 5 місяців тому +3

    On the keyboard trickery, the video game _Library of Ruina_ have a mid-lategame boss that, in the original Korean, is a jumbled mess of Latin characters. And in English, their name is a jumbled mess of Korean letters. As you can guess, their name is concealed using the "type in the wrong keyboard" method, with the other translations, Chinese and Japanese, _also_ using the same method in their own keybpards, so that their name is gibberish-but-"typable" in _all_ languages.

    • @realGBx64
      @realGBx64 5 місяців тому +1

      This is the coolest thing I heard in the last 20 minutes!

  • @pleappleappleap
    @pleappleappleap 5 місяців тому +31

    "Kohuept" reminds me of people calling "Moscow" as "Mockba".

    • @mrJety89
      @mrJety89 5 місяців тому +3

      Moszkva /hungarian/

    • @musiqtee
      @musiqtee 5 місяців тому +3

      Moskva /Norwegian/

    • @pleappleappleap
      @pleappleappleap 5 місяців тому +3

      @@mrJety89 Yes. The point being that the Cyrillic letter that looks like "C" sounds like "S", and the letter that looks like "B" sounds like "V".

    • @sponge1234ify
      @sponge1234ify 5 місяців тому +7

      The joke I've heard is American/British wondering why are there so many PECTOPAHs around.

    • @musiqtee
      @musiqtee 5 місяців тому +1

      @@sponge1234ify Suddenly hungry, wonder why…🤓

  • @DrCoomerHvH
    @DrCoomerHvH 5 місяців тому +3

    I love these miniature bites of your talks

  • @Posiman
    @Posiman 4 місяці тому

    When my father studied chemistry in 1970s Czechoslovakia he had an assignment he did not know how to do. The teacher told him to look up a book by American physicist Walter D. Knight. My father returned desperate that no library in the whole city has that book.
    "Oh, you were looking under K?"
    The book was not translated to Czech, only to Russian (which every student spoke) and by Russian standards they transliterated the name phonetically as "Найт," therefore every library sorted it according to Czech transliteration under N as in "Najt"

  • @AxlefublrMain
    @AxlefublrMain 4 місяці тому

    I'm russian, and am really happy at you actually pronouncing russian correctly! incredibly rare :D

  • @roeniss
    @roeniss 4 місяці тому

    I love this Harry Poter story so much.

  • @TakeTheRedPill_Now
    @TakeTheRedPill_Now 4 місяці тому

    Superb! Thank you.

  • @imarioable
    @imarioable 5 місяців тому +2

    Looking forward for your Unicode series now! 😅

  • @lareolanKFP
    @lareolanKFP 3 місяці тому

    This was amazingly educational. So much I just accepted for granted but didn't really understand the reasons behind.

  • @Dominik-K
    @Dominik-K 5 місяців тому +1

    That Google and Taylor Swift tidbit had me 😂 so much, it just makes so much sense haha

  • @DanielHauser
    @DanielHauser 5 місяців тому

    At work we're fighting with an old system that has its own bespoke character encoding. It encodes various other charsets, such as one of the ISO-8859 subsets or corresponding Windows codepages. It even has specific codes for various text formatting operations, like bold, italic, underline and even blinking. But that's all single byte. The encoding also supports asian languages - Japanese, Chinese, Korean - half and full width. Those are encoded in multiple bytes, but sadly the charsets those are based on are not documented. Since this encoding is proprietary and there are no libraries to tame it, good luck converting it to UTF-8 and back.

  • @FredrikHistherRasch
    @FredrikHistherRasch 4 місяці тому

    Just FYI, for Norwegian and Danish IBM 437 was actually sufficient. æåÆÅ are present in IBM 437, and the Greek phi was printed as a circle with a stroke through it, making it very similar to øØ

  • @TheJamesM
    @TheJamesM 5 місяців тому

    I’m guessing the Swedish passport/US plane ticket fact will be that Swedish passports have a kind of canonical spelling for names using only the modern Swedish alphabet, whereas in everyday life Swedes will often use the traditional spelling of a name. For example, a person with the surname Wallberg would have it appear as Vallberg on their passport. They might also use the fallback spellings for the additional vowels: å = aa, ä = ae, ö = oe (as those were the letter combinations those characters originally represented).
    In countries unfamiliar with these conventions, it appears that the name the ticket is registered under doesn’t match the passport, which obviously can cause issues.

  • @lforlight
    @lforlight 4 місяці тому

    That last example with Taylor Swift is very relatable. I had a friend with whom I chatted a lot via text. He noticed that writing the Hebrew word for "correct" - נכון - in the English layout by mistake makes "bfui". One day I asked him a question and he answered "bfui"... but transliterated to Hebrew - בפוי. After a bit I figured out that he meant "correct", transliterated from the wrong keyboard layout.
    Nowadays, knowing Google does accept these layout mix-ups, when it happens to me and I notice it halfway through, I need to stop myself from deleting the query and writing it all over again, and continue writing it wrong. It's not perfect, and many times it'll either not suggest the Hebrew layout equivalent, or it'll swap the layouts despite searching an obscure English string such as an error code or a mysterious executable's name. It may also provide gibberish results of pages where Hebrew is written backwards because it goes right to left and that's a whole can of worms...

  • @pquirk99
    @pquirk99 5 місяців тому +1

    A couple of gaps that are worth covering in another video. 1. You didn't cover how applications switched code pages. 2. You made a brief reference to ISO 8859 but didn't discuss the locales that were added to this family of single-byte encodings, and the machinations to standardize collating sequences. In Spanish, the ll and ch digraphs had to be treated as single characters for collating purposes. This is worth discussing as you prepare the audience for Unicode.

  • @greasedweasel8087
    @greasedweasel8087 5 місяців тому +1

    4:38 I was excavating old* servers and ran into one with a BIOS from 2001 running contemporary BSD. Tried to ctrl-C a command and got a smiley face instead

  • @setlonnert
    @setlonnert 5 місяців тому +1

    Yes, before codepages we had some "adjustments" to e.g. Swedish. One such was ISO ESC 2/8 4/7 which actually included the Swedish variants "inside" of ASCII, in the Swedish computer ABC80. And if I remember correctly the same code was used in what was equivalent to the English Prestel, videotext or whatever they were called. Yes, that also had consequences. Early in the internet when we still had that awful ugly quoted printable hack, the mess got so bad that we gave up and spelled our texts with a and o instead of å, ä and ö. Mostly text were still readable, due to context ...

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      Also known as ISO 646
      In addition to being used in all sorts of computers predating the IBM PC, it was also used by teletext (text-tv in Swedish). ABC 80 used a character/font ROM intended for teletext.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      Btw a pet peeve re ISO 646 is that it put the Swedish alphabet in the wrong order. (I assume this was true for Norwegian and Danish too?)
      I assume that it was in order to have Ä and Ö on the same codes as for German. TBH it would had been better if the germans had had to suck it up and have ß between Z and Ä, and have it share code with Å in the Scandinavian languages.
      Perhaps not the biggest problem in the world, but still annoying to have to have a special case for alphabetical sorting.

    • @pjl22222
      @pjl22222 5 місяців тому

      But then whose alphabetical sorting are you referring to? Some languages sort accented letters after their non-accented versions, some mix them in with their non-accented versions as if they didn't have accents, others put them at the end of the alphabet.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      @@pjl22222 Well, the variant of ISO 646 that has åäöÅÄÖ was only used in Finland and Sweden and those countries sort åäöÅÄÖ after the regular letters. (Also they aren't accented but rather umlaut and ring).

  • @McDuffington
    @McDuffington 5 місяців тому

    Great stuff! Hope more is to come.

  • @ralfbaechle
    @ralfbaechle 5 місяців тому +8

    Well done, Dylan!
    ASCII was a good solution considering the technical constraints of the time. It just shouldn't have lived that long. We went through Commodore ASCII (aka not really ASCII at all) and a few other proprietary variants, more official extension such as the three dozen variants of ISO-8859-random_number plus a bunch of national standards and of course the code pages, Amiga ASCII, ATASCII aka Atari ASCII, EBCDIC (punch card compatible but not ASCII-like) and more. After having survived baudot code and what not in the teleprinter age. ASCII was always a standard that was typographically impoverished, just barely good enough - it doesn't even fully cover the character set used in an average newspaper such as proper “quotes”.cent symb ¢ and more.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому +1

      Commodore/"PETSCII" IS actually ASCII, but it's 1963 ASCII rather than the way more common 1967 ASCII that almost everything else is based on. This is the reason for having an up arrow and a left arrow symbol.

    • @greggoog7559
      @greggoog7559 5 місяців тому

      The best AND worst thing about Baudot code was the integrated code page switching (can't remember what it's called). If you missed such a character, the rest of the transmission was garbage 🥴

    • @ralfbaechle
      @ralfbaechle 5 місяців тому

      @@greggoog7559 The Baudot issue was (is!) very annoying with radio teletypes. Somebody of the UTF-8 designers may have been aware of this issue . UTF-8 is designed to quickly resync after a byte has been lost or missed. .

  • @FlameRat_YehLon
    @FlameRat_YehLon 5 місяців тому

    So you are the plain text guy that shows up multiple times in my feed😂. Anyway, I've recently heard a new story about this. For some context, there used to be a popular game genre within the Chinese (the language) internet called 魔塔 (Magic Tower), and one of the famous game among mainland China was made in Hong Kong, and because back then mainland mostly used GBK coding while HK used Big5, the text is turned gibberish, and somehow someone on the internet was dedicated enough to just memorize and read such gibberish in order to progress the game.

  • @louisreinitz5642
    @louisreinitz5642 5 місяців тому +2

    PECTOPAH == RESTORAN (Restaurant)

    • @edwardallenthree
      @edwardallenthree 4 місяці тому

      Your comment has a translate to English button (for me, US English user), which ironically only normalizes the space around the double equals.

  • @m4rt_
    @m4rt_ 5 місяців тому +2

    3:45, Actually Swedish doesn't have æ Æ, they use ä Ä
    but ø Ø is missing, but you could use the Swedish version, ö Ö

  • @dj196301
    @dj196301 5 місяців тому +1

    Riveting!
    2024 and I'm mired in mojibake (文字化け).

    • @MaddTheSane
      @MaddTheSane 5 місяців тому

      Blame that on the three different encoding standards used by the Japanese computer industry. Where it's easier to fax a document than have to worry about the code page used by the other computer.
      You'd think Japan would move to Unicode…

  • @orbik_fin
    @orbik_fin 5 місяців тому

    4:47 More likely garbage would be written directly to video memory (A0000h-AFFFFh) than standard output which is part of DOS API (Int 21h).

  • @kevinmcnamee6006
    @kevinmcnamee6006 5 місяців тому

    Great video. These problems are still with us. I recently bought a new laptop and during the installation process it seemed to think I had a UK keyboard and rendered the @ as a ", which made it very difficult for me to enter my email address. I figured it out.

  • @mynameisben123
    @mynameisben123 4 місяці тому

    I’m sure when making ASCII they didn’t envision such a grand scope, but rather they probably just focused on their application at the then present time.

  • @I.____.....__...__
    @I.____.....__...__ 5 місяців тому +1

    11:11 On a related note, some systems (eg search-engines, auto-correct, etc.) can detect other errors like when you type something with your fingers shifted a key to the left or right. It's a lot of work to encode all of the possible mistakes that could be made to accommodate errors seamlessly, but perhaps it's a job well-suited for machine-learning. 🤔 (On the other hand, look at the mess that Microsoft made by making IE correct for sloppy web-developers. 😒)

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      My impression is that auto correct on iOS sucks at handling some non-US ASCII characters as compared to US ASCII characters. At least it's clueless if you make a typo re the Swedish åäöÅÄÖ, even though it handles typos good for other characters in Swedish words.

  • @probablypablito
    @probablypablito 5 місяців тому

    Amazing! How do you find all of this? This is an incredible amount of knowledge being presented (and very well I might add!)

    • @DylanBeattie
      @DylanBeattie  5 місяців тому

      Travel the world, go to a lot of tech conferences, ask people how computers specifically don't work in their part of the world. 30 years of staring at computer screens going "...but why doesn't it work?!?" also helped.

  • @andreydeev4342
    @andreydeev4342 5 місяців тому

    Как всегда жжёшь! You rock it, as usual =)

  • @joe_z
    @joe_z 5 місяців тому

    I learned about Kohuept from Tom Scott's *Lateral* podcast!

  • @maxmuster7003
    @maxmuster7003 4 місяці тому

    I like to use the extended ASCII character to display 1 bit Pixel Art animation on vga text screen.

  • @SojournerDidimus
    @SojournerDidimus 4 місяці тому

    Yay for utf-8!

  • @WooShell
    @WooShell 5 місяців тому

    holy cow.. I knew that non-US ASCII was a mess, but I had no idea it went to such extents

    • @MaddTheSane
      @MaddTheSane 5 місяців тому +1

      Unicode was a gift. UTF-8 even more so.

  • @TheUtuber999
    @TheUtuber999 5 місяців тому

    Your Microline 320 printer from 2001 should have been perfectly capable of printing the UK Pound symbol (£). All you needed to do is hold the Alt key on your keyboard and then type 156 on the numeric keypad in your word processor. Then it should just print normally and if not, you could have sent the control codes "ESC ! 0" to your printer to select the standard character set.

  • @cmyk8964
    @cmyk8964 5 місяців тому

    Wow, I wonder how much Google will convert between JCUKEN and QWERTY. I’m guessing it’s just a few common words and names, and I’d be surprised if it were a thing for every word.

  • @david.mcmahan
    @david.mcmahan 5 місяців тому

    Totally understand your view on "pound sign". But as an American of Gen X age and having had a grandmother who worked for a "baby" Bell telephone company, # is the "pound key" to me. AT&T and the Bell system instructed us to use the "pound key" for specific dialing situations with touch tone phones.

    • @caerphoto
      @caerphoto 5 місяців тому +1

      A lot of UK companies use American phone menu systems that sometimes tell us to press the Pound key, which obviously doesn't make much sense here. I think most people are aware enough of American culture to figure out what it means, though.

    • @Thesecret101-te1lm
      @Thesecret101-te1lm 5 місяців тому

      TBH there is a certain imperial aura of having a special symbol for money. Afaik it's only dollar, pound, rubles and yen that has special symbols. And then there is the international symbol ¤
      This also leads to some ridiculous things like the screen keyboard on an iPhone when set to Swedish having a "kr" key that just prints the letter k and r, which is the standard abbreviation for "kronor". Zero need for that button, but someone at Apple decided that there should be a button for money at that place I guess?
      P.S. in Sweden # is "square" or uncommonly "lumber yard". :D

    • @pjl22222
      @pjl22222 5 місяців тому

      Back in the day people writing, for instance, a list of goods bought or sold by weight would use # to mean pounds. Like if you bought two pounds of apples then maybe one line of your handwritten receipt would say "apples 3#" and if they were 50¢ per pound it might even say "apples 3# @ 50¢ $1.50"

  • @TheRowi62
    @TheRowi62 5 місяців тому

    There is even a 7Bit ASCII Table for German, where []{}\| and ~ have been used for the umlauts and ß

  • @thejonte
    @thejonte 5 місяців тому

    Love your work! You're a very good speaker.
    I've however got one question: why do you spend your time recording your NDC talks in a studio? Can't you get permission to repost the talks on here as well?

  • @Huntracony
    @Huntracony 5 місяців тому +2

    I must admit, I had no idea Cyrillic had capital letters, I thought that was unique to the Latin alphabet. I should learn about the history of capital letters sometime, it's kinda weird that (at least?) two alphabets have two versions of every letter.

    • @0LoneTech
      @0LoneTech 5 місяців тому +3

      It didn't surprise me since Greek also does.

    • @fullfungo
      @fullfungo 5 місяців тому +5

      Actually, lowercase letters are a very recent invention. They did not exist for centuries until the printing process became wide-spread.
      So you got it the wrong way round (twice)

    • @sponge1234ify
      @sponge1234ify 5 місяців тому +1

      @@Huntracony a lowercase-uppercase system is actually pretty rare, numerically speaking. It's mostly Latin, Greek, and alphabets descended from them, and the rest of them having a high level of correlation that suggest influence (Old Hungarian Runes, Zaghawa, Adlam, Warang Citi, etc.)

  • @danielrhouck
    @danielrhouck 5 місяців тому

    Is this series going to build up to Pike Matchbox?

  • @diakritika
    @diakritika 5 місяців тому

    Thank god for Unicode…

  • @soumen_pradhan
    @soumen_pradhan 5 місяців тому

    Wait a min, Dylan is a Rhodesian! New lore drop.

    • @DylanBeattie
      @DylanBeattie  5 місяців тому +1

      Nah, Dylan's British. Born in Kenya, lived in Zimbabwe 1981-1988, never set foot in any of the various bits of the world that were known as Rhodesia while they were still called that. (But Mum was born in Zambia while it was still called Northern Rhodesia, so maybe that counts.)
      You ever notice that if they'd named the country after Rhodes' first name instead of his last name, it would have been called Cecilia?

    • @soumen_pradhan
      @soumen_pradhan 5 місяців тому

      @@DylanBeattie Well, I stand corrected.

    • @DylanBeattie
      @DylanBeattie  5 місяців тому +1

      @@soumen_pradhan There is a whole gnarly tangled mess of what databases should do with people whose place of birth is a country that no longer exists. That'll probably make a fun topic for another video. :)

  • @The4096Tile
    @The4096Tile 5 місяців тому

    I just got done watching the previous video lol

  • @edgeeffect
    @edgeeffect 5 місяців тому

    Hi @Dylan... I've heard you talk on this subject a couple'a times now... but you never use the term "Mojibake" (文字化) and I've always thought you should... 'cus it's a cool word. ;)

  • @pnadk
    @pnadk 4 місяці тому

    The Japanese have a word for what happened to the Russian address typed in France, they call it Mojibake

  • @marloelefant7500
    @marloelefant7500 5 місяців тому

    Actually, that last part I'm using for some of my passwords. I'm typing something in with the English keyboard layout, but assuming another layout in my mind. The result is a more complicated, but easy to remember password (btw, my passwords also comprise multiple words).

    • @pihungliu35
      @pihungliu35 5 місяців тому +3

      But be careful, do not use any common underlying phrase for your password as many other people have the same idea as you and knows that other layout will use that. An (in)famous example is that a common password-related phrase in Chinese, when assuming Taiwan IME layout, produces a randomly-looking letter and digit combinations, but because it is used so much by my fellow Taiwanese people that it appears amongst the top of commonly used password lists like HIBP.
      (An addendum since other comment mentioned Tom Scott's Lateral: this thing also showed up as a question once!)

    • @peterwmdavis
      @peterwmdavis 5 місяців тому +1

      Right, this is just a basic (and predictable) substitution cypher. A password manager and truly random, long passwords would be much better. Or even a long but memorable multi-word sentence.

    • @marloelefant7500
      @marloelefant7500 5 місяців тому +1

      @@pihungliu35 My passwords usually comprise 2-3 words, together more than 30 characters. I'm pretty sure, not many people have passwords of similar length.

  • @redoktopus3047
    @redoktopus3047 5 місяців тому

    Had no idea Dylan Beattie's parents were Whenwes

    • @DylanBeattie
      @DylanBeattie  5 місяців тому

      Dad was. Mum was born and grew up in Zambia.

  • @pjl22222
    @pjl22222 5 місяців тому

    When the USSR fell apart and Russia decided to start using the standard European license plates they had a problem: you have to use Latin, not Cyrillic, letters but most of their computer systems were only set to handle Cyrillic. Their solution: realizing that a lot of Cyrillic letters look just like Latin letters, they only assigned license plates with those letters. Still Cyrillic for the Russian computers but looks like Latin for all the other European countries they might drive to.

    • @DylanBeattie
      @DylanBeattie  5 місяців тому

      @@pjl22222 PIKE MATCHBOX ftw!

  • @Colaholiker
    @Colaholiker 5 місяців тому

    And you'd think that with UTF-8 these days all of this was just a funny note in history, right?
    Nope. At least not at my workplace.
    Being someone who prefers efficiency over fancy presentation, I have set my email client to send plain-text mails by default. Mostly because Outlook totally messes up when I paste any code snippet from VS Code into an email and forget to paste it as text only. Of course it is set to UTF-8 encoding, as anything should be today. Being a German person working in people with Germans who all speak German, I use all letters that the German alphabet has. With UTF-8, this shouldn't be a problem, right?
    Enter our IT department (some Star Wars Imperial March would work here)
    They put some tool on our mail server to add our signature - for internal purposes just name and contact information, for external purposes also the required legal boilerplate. Apparently, this widget must be from the code page days, because once my mail passes through this filter, it totally messes up all characters that you wouldn't find in 7-bit ASCII. And of course this is what any recipient of the email, both internal and external would get to see...🤣

  • @ailaG
    @ailaG 4 місяці тому

    Ah, encodings. I grew up in the 1990s mostly, and had the role of explaining to relatives and friends of family how to have their documents opened correctly. And when webpages started, the people who just uploaded from Notepad - between Netscape and IE, if my memory serves me right, one guessed the encoding and one didn't. So a webmaster (!) would upload an html file, see that it works fine for them, and then for other people it didn't.
    I live in Israel, btw, so that was because of Hebrew encoding.
    I'm still not done watching the videos but it would be nice to know
    1 if there's any logic behind encoding names (e.g. Hebrew Windows-1255)
    2 if there's anything interesting in the way computers draw final letters in Hebrew ("מ" in the middle of a word, "ם" same letter but in the end of one) and more tricky - how they connect letters in Arabic, which can have up to 4 forms of a letter depending on the place in the word (start, end, middle, single) and has some other special rules, eg some letters will be end letters in the middle of a word.
    There has to be something interesting behind turning that into computer stuff.
    3. If you haven't done so yet, there's some stuff to look at re RTL and LTR. Some zero width characters like "show Latin letters right-to-left from here on" (another reason to filter user input. It can be used for impersonation)
    And interesting bugs to tackle. Say, suppose you write this string
    AAAאאאaaa
    And you want to add a Hebrew letter before the existing Hebrew. You even have a mouse. Where do you click? The beginning of the אאא? The beginning of the aaa? They're the same!

    • @ailaG
      @ailaG 4 місяці тому

      Re keyboard layout switching, there's a large Israeli news site called ynet. But if you're on Hebrew layout and not paying attention you'll type טמקא, gibberish, pronounced Tamka. Some internet folks just call it Tamka jokingly...
      There's also some brilliant things because Hebrew uses shorter words and the vowels aren't typed, so there's a higher probability for these things.
      So when I type API on the wrong layout I get שפן, hare.
      When doing so re CSS you'll get בדד, which means alone.
      Another anecdote re something you've mentioned. I've heard it here and there so it may not be true.
      When I moved to Mac back in 2008, I tried mixing Hebrew and English and it didn't work. Our normal method for doing so fast was, you keep your layout on Hebrew, and with the shift key you can still type uppercase Latin. And that's fine, Hebrew doesn't have uppercase and lowercase.
      But in Macs they typed nothing, or added dots and dashes (vowels) on letters. Why?
      Well, the fanatics over at the local forum claimed that the shift method - that's because Microsoft never prepared for languages without the need for the shift key, so on Windows it didn't know what to do when you pressed shift and a Hebrew letter. So it'd default to the American keyboard instead. Macs didn't have the bug and, said the puritans, that's good.
      ... So I went ahead and one of the first things I've done on my Mac was build a new keyboard layout, with the shift bug coded in.
      Apple does it natively now.

  • @jtsiomb
    @jtsiomb 5 місяців тому +1

    For greek we had codepage 737, or iso8859-7, but since the whole thing was a mess, you had to use a greek font, run a VGA glyph-replacement program, and earlier computers where ascii-only anyway, my generation of computer geeks used to type greek with latin characters. We called it "greeklish" and that's what we defacto used online to communicate. In later years subsequent generations started taking offence at people using greeklish on forums since they grew up with unicode and never had to get used to reading greeklish, but for some of us, having to switch keyboard layouts mid-sentence to type an english term is just unbearable, so we keep using greeklish :) Also I type at half the rate in greek, since I never got used to it.... unbearable.

  • @ALeXKazik
    @ALeXKazik 5 місяців тому

    Luckily I had an Amiga (with ISO-8859-1) and later Mac OS X (with Unicode) and never that PC codepages.

  • @butwhytho6522
    @butwhytho6522 5 місяців тому +3

    Windows and the Byte Order Mark. Open the text file - looks normal. Open the text file on Linux - hey there's extra bytes at the start. Thanks again Microsoft!

    • @kalleguld
      @kalleguld 5 місяців тому +1

      What do you mean "On Windows"? Which editor? It's the editor that inserts a BOM, not the OS

  • @PixelOutlaw
    @PixelOutlaw 5 місяців тому +12

    They took away my beloved box drawing characters on Linux because some European needed 16 versions of the letter 'e'.

  • @Chris-op7yt
    @Chris-op7yt 5 місяців тому

    v. good

  • @Thesecret101-te1lm
    @Thesecret101-te1lm 5 місяців тому

    Two things re code pages:
    You need at least EGA graphics on a PC to use code pages.
    Microsoft/IBM forced users in Sweden to do all the "load code page" crap to be able to select the correct keyboard layout, even though the default "CP437" already had all characters that we really needed. Everything about code pages in DOS has an aura of "pointy haired boss"...

  • @people9178
    @people9178 5 місяців тому

    Ш щаеут ащкпуе ещ срфтпу дфтпгфпу уізусшфддн црут Ш іуфкср штащкьфешщт щт Пщщпду. Щр тщ тще фпфшт!

  • @vanhetgoor
    @vanhetgoor 5 місяців тому

    Klumsy amateurs at IBM, they better skip the I from their name and go further as Local American Business Machines. This cheap skate solution has traumatised the complete computer industry for many years. Luckily the Apple Mac had from the beginning on an international set up that worked. If this would not have been done, international publishing with the help of DTP would have been seriously delayed for five to ten years. This stupid mistake of IBM was the glorious moment of triumph for the Mac.

  • @marloelefant7500
    @marloelefant7500 5 місяців тому

    This is the fourth comment.

    • @DylanBeattie
      @DylanBeattie  5 місяців тому

      Sixth. But who's counting, eh? 😉

    • @marloelefant7500
      @marloelefant7500 5 місяців тому

      @@DylanBeattie UA-cam only showed me 3 other comments at that time, but I guess that's eventual consistency. Or I'm an LLM, who knows.

  • @SteinGauslaaStrindhaug
    @SteinGauslaaStrindhaug 3 місяці тому

    6:30 That's really clever!