23:30 That is the most beautiful thing about human beings that I've heard in a long, long while. God bless that postman who really cared for his job and even was smart enough to figure out that problem. This will make me happy the rest of the day :D
I was a little bit skeptical: how can anyone give a one-hour talk speaking just about 'plain text'? But I have to admit: it was simply AMAZING! Well done!!!
This ascii issue is also a cause of cultural tension in (Republic of) Ireland and (Northern) Ireland, where birth registrations at some hospitals are refused or incorrectly assigned when a child's parents opt to use a Gaelic name, which often includes a bunch of non-ASCII chars. Hospital software is usually pretty archaic and predates a lot of the elegance of UTF. Also. Amazing talk. Funny and interesting the whole way through. Dylan Beattie is a legend!
One of my favourite sorting rules is that for Scottish surnames "Mac" and "Mc", both with and without following space, are considered the same letter that comes after L but before M
There’s a similar quirk with English genealogical documents, such as old church birth registers and ships’ passenger lists. They’ll often use abbreviations of common personal names (and even some surnames) to save space, and when these are sorted-whether in the text itself, or later on by a computer-it may be according to what the abbreviation stands for, not the letters themselves. So you have to just know, for example, that “Hy.” might appear before “Herb.”, because “Henry” comes before “Herbert”. Moreover, some of the abbreviations are based on a Latin and/or Greek transliteration of the name, such as “Iabus” = “Iacobus” = “Jacob” or “Xpr” = “Christopher”.
I’ve also noticed it’s a free for all on whether the word “the” is ignored when sorting lists of names. Steam doesn’t ignore it, for example, and I think Google Play music used to but UA-cam music doesn’t. But to me, it’s correct to ignore it and incorrect not to!
@@paulwesley3862 In that case, the person’s name in everyday life would’ve been Jacob, but if the church records are (partially) in Latin, it’s the Latin form that’s abbreviated. I think just “Iab.” is attested as well, though I’m not sure.
At the risk of being one of those UA-cam comments shown in your next talk, the diacritic you discuss at 29:18 is a diaressis not an umlaut. They look the same and are encoded with the same codepoints, but are pronounced differently. An umlaut changes the quality of the vowel, and can appear on lone vowels in any language that uses them. A diaresis tells readers that the second of two vowels is not to be read as a diphthong, but a separate vowel. That's why English has one on, for example, naïve (nigh-eve, not knave). Coöperation is co + op not co͞op.
Also, the umlaut is used in German and was derived from roundabout there (idk the history too well) whereas the diaresis or trema evolved independently and is notably used in French to mark vowels that may usually be silent, but should be pronounced. This is similar to it's use in coöperation, to basically say that the second o is pronounced distinctly. The two symbols were developed independently.
Recently, a student of mine opened a text file and it was all Chinese gibberish. I remembered your talk and switched the encoding from UTF-8 to UTF-16 or vice versa, and there was a readable file again :)
The 7-bit encoding for SMS messages in GSM is the same as ASCII for most characters but many of the control characters have been replaced text characters that were missing from ASCII. In particular it does not have NUL, 0 encodes the '@' character. So, as one of my colleagues at Ericsson found out the hard way, you cannot use C NUL terminated strings to process SMS messages.
@@Architector_4 In GSM 7-bit encoding 0x40 is inverted exclamation mark, one of the characters missing from ASCII. No idea why they didn't use 0 for this and keep @ where it was.
Absolutely one of the best presentations that I've seen and it was a total shock. I watched this because I'm a geek and I like Dylan Beattie. I never expected it to be this awesome!
The historical context of all Dylan's talks is simply incredible! Born in the early 2000s, I didn't know any of this, and a lot of it makes sense today.
Since Dylan does read comments, here's one of my favorite examples, in Polish: "Zrób mi łaskę" means do me a favor. Most of the characters can be turned to their ASCII lookalikes without any issue whatosever. Except one. "Zrób mi laskę" is asking for a specific sexual act. Just turning ł into l changes the entire meaning of the whole sentence.
The youtube comment near the beginning of this updated version of his previous presentation illustrates the point of the talk powerfully. Dylan is always amazing, but this talk from him is perhaps uniquely important to everyone in the field! From 1st year associates to the most seasoned senior architect, plain text is always less than plain.
Very good talk. Regarding ASCII and punchcards, it's unlikely they would ever meet in the first place. You do course correct a bit w/r/t the DEL character, but punch cards were originally in 6-bit BCDIC (binary-coded decimal interchange code). This was extended to 8-bits to become "Extended" BCDIC, or EBCDIC. The layout of the character set aligned w/ the rows of the punchcard, such that all alphabetic chars were x1 - x9, so in late variants 'A' is 0x11 and 'Z' is 0x39. To get 3 rows of 9 columns to line up, there's a "/" at the start of the last row, 0x31. Interestingly, ASCII was created by Bob Bemer at IBM to solve interop problems between the BCDICs. However, IBM was in so deep w/ their card-based (E)BCDIC, they couldn't use it in any of their operating systems. Note also, EBCDIC is still very much in use. Finally, Multics did not influence Unix, except to serve as a counter-example of design principles.
I have been a softare developer for 20 years and it's only in the last 5 years that I began to realize the actual complexities of good old plain text. Once I realized how complex this issue actually is, I began to wonder why many of the systems I had worked on even WORKED. It's not something they talk about at university or anywhere, so it was nice to see this gets so many views. I haven't watched it yet but I'm sure it will open many people's eyes.
I spent a fair part of my career designing and implementing serial terminals and emulators of the same. For terminals from DEC starting with the VT100 (and other "ANSI" terminals), there was something called "code extension", along with character set designators, graphic sets, and shifts (both locking and single) that were used to mix text from multiple character sets on one screen/page using either 7 or 8 bits per character. This was fine on terminals and printers that had the same character sets available, but caused a lot of grief when a device receiving the text didn't support all of the character sets used. Also, very few editors at the time could handle storing such text. It was a mess, but at least it was better than what it replaced, which was National Replacement Character Sets (NRCS), where it was 7-bit ASCII with the glyphs for some of the code points replaced. There was no way to tell which NRCS had been selected when the file was created, even with a hex editor.
30:00 ij is a dutch letter, not a typesetter's ligature! It's in the extra block at 19:50 left of Ö. Most fonts don't support it and ASCII led to it being written as 2 letters (i and j) because it was the only non-ascii letter in dutch, but all dutch typewriters before PCs were popularized had a dedicated key for it. Fonts that turn it into a ligature often run into problems with words like minijack, Beijing and bijoux. It used to have the same problem as å, with some people turning it into a Y (most famously Cruijff) until it got standardized as I+J.
Also, for future talks you may find the "Greeklish" system interesting: en.wikipedia.org/wiki/Greeklish Basically before Greek language was fully supported, Greek people interacting with electronics came up with mappings between ASCII and Greek. These mappings were unofficial and there are several variations. Even after UTF-8 was implemented and got more and more adoption, lots of young people still utilized Greeklish in SMSs to send messages to each other because you'd get charged by the number of bytes you used (in groups of bytes) and not by the actual number of characters used. This is also an issue in a lot of fields that have a byte limit instead of a character limit. On a parallel note... If you do a bit of time travel, and go to Greek villages in Anatolia during the time of the Ottoman Empire, you'll find the Greek alphabet being used to write Turkish text: en.wikipedia.org/wiki/Karamanli_Turkish
Having recently delved into utf8, unicode, etc, I knew a lot of this, but learned a few new things as well, either way it was thoroughly interesting. Well done!
Lovely talk, like going down memory lane. Spent a lot of time dealing with this. From writing xmodem and ymodem, to parsing csv files, converting bin to text, and back.
I love them YT commentators. World would be a much more imperfect place without them. Btw i thought i knew a lot about plaintext, but turns out i knew something about plaintext. Thank you!
Late to the party, but I enjoyed that. So much so that I started watching at about 1am thinking of just catch the intro before I went to sleep to determine if it's something I want to keep watching, and ended up watching over half of it before finally deciding I was too tired. Also I learned something new that will solve an issue with one of my applications, so that was a bonus!
Now this was incredible funny, entertaining, intelligent and interesting. Not expecting this. Incredibly done. We need more talks like this in IT and not so serious and boring. Well done :)
In days of 7 bit ASCII, there were lots of workarounds in non-English speaking countries. For example, in order to be able to print umlauts, printers had special character sets that had umlauts where normally the characters {, [, ], }, \ and | were, because nobody needed them when writing a letter. However, if a C or C++ programmer would use such a printer, his code would look quite funny. In parts that's the reason why some languages have special replacements for these characters, called digraphs and trigraphs. This all sound like multiple layers of duck tape putting on top of one another but it kind of worked.
This popped up at the right time; while messing around with Notepad++ I looked up the purpose of carriage return, line feed, and tricks like *bolding,* underlining, and -strikethrough- with typewriters and teletext. I've since come across resources like Typography for Lawyers that, apart from being an excellent reference for general formatting, advocate the end of shortcuts picked up from typewriters and a return to form for good typefaces and typesetting.
At first glance the headline of this video/presentation seems dull but it ended up being extremely interessting! - Very good video and very informative!
The reason why you get smiley faces when DOS crashes is not because there is something trying to generate the stop character. The reason is that often it starts executing random garbage or tries to print a message that became random garbage due to memory corruption. In a piece of program data the values 1 and 2 would be quite common if you have some counters that did not fit into your registers, and maybe they encode some common x86 instruction as well. The string terminator in the common OS interface for printing strings was the dollar sign rather than nul on DOS operating system. The dollar sign is much less common than nul and smiley faces in random garbage so you will likely get some smiley faces printed. Note also that 'plain text' is just a binary format (or more precisely a family of binary formats with ASCII, EBCDIC, various code pages, JIS, BIG5, GB 18030, UCS-2, UTF-7, UTF-8, big endian and little endian UTF-16/UTF-32,...) for which there happens to be a lot of editors and viewers. In the end it's all binary bits. One specific property that 'plain text' has over many other binary formats is that it has very little structure and can still be of some use when some bits are flipped or bytes missing as opposed to, say a compressed JPEG image with the caveat that the multibyte encodings are much more fragile.
Well that was another exceptional video from the master. I found that extremely enjoyable and informative. Unsurprisingly I didn't know a lot of the histrionics
God I have homework but now I have an irresistible urge to research unicode cause this was fascinating. Its amazing how clever some of their solutions are
I'm famous. I vaguely remember the train of thought I had with that WWIII joke. That you posted a meme on twitter that was so funny that it prevented WWIII, and with you erased from existence by time travel shenanigans, that meme never gets posted and thus WWIII happens. I know I can get long winded especially when I talk about technical stuff, which is probably why I put that joke in there at the end. It's like a reward for sitting down and reading all that stuff about base64 and how vim fucks up binary encoding. Also, how dare you say the End Of Transmission character, Ctrl-D, is unimportant. How else would I log out of my Linux terminal in one keystroke?
Rewatching this talk proved very useful today. Currently dealing with the lexer for my programming language project failing unit tests on the Windows runner for GitHub actions. Wanna guess why? I’ll give you a hint: newline tokens report their span to be exactly one character later than expected.
Dylan, thanks for a speech! It was really interesting to hear all this historic details and understand more how unicode works. And my gratitude for your support of Ukraine! Слава Україні!
Great lecture -- super fun and informative, thanks! And now I'd love see a follow-up that touches upon those lovely grey areas of A) finding out the encoding of a given "plain" text file, and B) UTF-16 surrogate characters. Especially the latter is quite important, because I'd guess that 95% of all applications using UTF-16 are broken, in the sense of not being able to deal with any text that contains Unicode codepoints which can not be encoded in the 16-bit units of UTF-16.
I couldn't imagine working at an airline, where I know for sure that names will be scrutinized in every detail, and deciding "eh, I'll just strip diacritics off of everything." Having scanned passports before, there are very well-publicised and clear standards for how to transliterate any Unicode character into that strip at the bottom.
You're probably not American or English, then, where diacritics are uncommon and used only by foreigners. Yes, if you think about it that's a bit parochial but that shows the difference between programmers working for commercial companies with a certain market and the people who write standards like the one that allowed all those different forms in an email address.
Nothing wrong with writing JavaScript in Ukrainian: 1. It runs fine. 2. In production build, the minifier will take it all out and replace it with single-character ascii names. 3. Source maps will work fine.
I have never underestimated a plane. Be it a machine that can carry me to the sky, or an infinite flat set of points in 3D space, or a tool used to smooth wooden surfaces, they are always quite intimidating.
Great presentation! I code systems that use control codes all the time for work; they are still widely used and accepted (receipt printers, barcode scanners, serial comms, etc).
UTF-8 is rarely slower to process than UTF-16, and because UTF-16 only has the BMP in a single code unit, you can't rely on that for counting codepoints anyway; furthermore, rarely do you want to count codepoints, you generally want to count graphemes.
UTF-16 is actually just another hack to fix UCS-2, which is the fixed 16-bit Universal Coded Character Set. It was intended to contain all the codepoints until we discovered that 16 bits were actually too few bits to contain the set. It really is hacks and partial backwards compatibility all the way down. Windows extended their API to work with wide characters to support UCS-2 before UTF-16 or UTF-8 was a thing, and when UCS-2 died they were kinda screwed and couldn't update their design. So that's how we ended up here.
I guess there's not much hope for doing a cameo in the next version of the presentation, but I'll try anyway. Using Cyrillic, or any other local writing system in JavaScript is probably a bad idea in any production code, for sure, and it's universally frowned upon for a reason. Universality, you know - if you write science in Medieval Europe, use Latin, don't be a dick. But, there's a "but"! Teaching programming to newbies with no STEM background whatsoever, who also don't happen to be fluent in English (you can imagine), I suddenly found allowing them to use the words of their native language as names in their source code very, very useful. Separation of concerns and cognitive load reduction, I guess. As a bonus, there's a clear distinction between library entities and the locally introduced ones, which is also a good thing for the newbies. In fact, the role of English in international software development is a huge topic with a ton of practical consequences. Some Chinese have already stopped giving shit on this "you must write everything in English" thing, and it's not gonna stop there. I LOVE FiraCode, BTW!
36:09 I see this a lot in the large German company I work for, specifically this example of having to select a country from a dropdown list. The countries' English names are displayed, but ordered as if they're German names.
43:35 Generally a very solid talk, but the section about UTF-16 was kinda inaccurate. UTF-16 is not actually a fixed-length encoding and you cannot get the number of bytes just from the number of contained characters (e.g. Emoji need two UTF-16 code units forming a surrogate pair). The actual reason that so many of these 90s systems use UTF-16 is that this was the time of the fixed-size 16 bit UCS-2 encoding ( "65k characters ought to be enough for everyone"), which was later expanded to become UTF-16 when they ran out of code points. Instead, the range of code points U+D800 to U+DFFF was permanently snapped out of existence, so that UTF-16 could use them to encode higher code points as multi-word sequences. This is also the reason why not every String in C#, Java, or JS is Unicode; these languages allow you to have unpaired surrogates which are not valid UTF-16 (they are not scalar values). See the "History" section of UTF-16 on Wikipedia. And this entire paragraph was even without going into that dreaded word "character". If you take character to mean code point, then doubling the number of characters to get the number of bytes is almost correct (so long as you don't care about anything outside the BMP, aka basically all instant messaging, social media, ...). But as we've seen one "character" can be made of many many code points and each of those code points can be multiple code units. And if sequence of code points is displayed as one "character" or multiple depends on the display technology you're ultimately using (wtf is an extended grapheme cluster?). In fact, the Unicode standard doesn't define what a character is. So, ultimately, there is no actual correspondence between the number of "characters" in a string and the number of UTF-16 code units, the concept of a character varies from use to use, and UTF-16 falls short of even the most charitable interpretation of "character = code point". Additionally, the reason that UTF-8 stops at four bytes is actually because Unicode is a 21-bit scheme. Unicode has made guarantees that it will only ever go up to U+10FFFF and this, again, stems from the fact that they weren't able to squeeze more bits out of UCS-2. In summary, UTF-16 is weird a legacy encoding resulting from expanding UCS-2 to a set of code points it was never meant for. In doing so, UTF-16 has lost a key property of UCS-2 (being a fixed-length encoding for scalars), while only displaying the lack of this property for (until recently) uncommon inputs. It now has both the disadvantages of UTF-8 (variable length) and UTF-32 (wasted space, ASCII incompatibility) while introducing additional drawbacks (byte order confusion, false belief in being fixed-size). Unicode has had to insert multiple hacks just to keep this mess going. UTF-16 is Unicode's original sin. Every emoji broken by a Java developer using "char", every "Bush hid the facts" censored by IsTextUnicode, and every broken API call from mishandling wchar_t is a punishment from the tech gods themselves. In our hubris we believed that there were less than 2^16, so now we must suffer forevermore.
Small comment from a Dane. Aarhus is at the start of the alphabet then spelled with a double aa atleast acording to any convention I have seen in use here in Denmark. Eventhough aa and å represents the same letter we still keep the alphabetic order distinct. Implying that Aabenraa is first in a alphabetically sorted list of city names in Denmark.
What you call a 'weird stylistic thing' for the word 'coöperation' is actually a common feature in Dutch. We call it a 'trema' and it is different from an umlaut (in its use, not in look obviously): it is used to indicate diaeresis, whereas an umlaut changes the sound of the vowel it is on (as others have already pointed out). Great talk. I knew about the many control characters in ASCII but I never realised that that is where the use of ctrl-c to abort a program comes from. Also, that story about that postman was amazing, what a legend that guy. Finally, as a programmer, thanks for ruining (ruïning?) the simplicity of alphabetical order for me. I thought it was mostly date/times that were a headache, guess I was wrong...
Java now uses UTF-8 internally. They dropped UTF-16 when Java 8 came out. An hour on plain text? I would not have believed it until I watched it. Just awesome.
28:36 Æ is totally a letter in English. It's called the letter æsc, which sounds like "ash", because it represents the tree ash. And for completeness I should also mention the letter œthel, which sounds like Ethel, the personal name. They appear in obviously english words like encyclopædia, manœuvre and Cat7 UTP Æthernet cable. … Not to mention archæologist. I may have cheated a little bit with one of mine, but why doesn't that count?
It’s obviously an English word, right? And everyone knows that’s a valid spelling for it. The cheaty one is manœuvre, because that’s a French word. But I don’t get why he doesn’t count archæologist? Maybe it’s in the same way as because Latin only has the letter K in one word, it’s not considered part of the Roman alphabet. And to be fair, Æsh and Œthel don’t come up very often. Œstrogen is another one, but that’s basically a Latin word. I don’t know any non-borrowed words containing œ that are still in modern English. Unlike æther.
On alphabetical ordering in Finnish... back when I was in school in the 1990s, I was taught that V and W actually are considered equal in Finnish. So going through a list of Finnish surnames, Valli, Waris, Virtanen, Wirtanen (tiebreaker here, I suppose) would be in correct order. But having googled this a bit more, this is apparently nowadays (since 2000) somehow dependent on context -- mixed with foreign words and names such as Vanderbilt and Wolf, it's OK to sort them all V first, then W. So I don't know if even printed dictionaries use this sorting today. I don't think this peculiarity is even well-known, IIRC this surprised many of my Finnish coworkers.
@@cameron7374 Never noticed a system that would (probably in part because W is in Finnish only in names (outside of possibly loanwords), and even there it is very rare). But after a quick googling, apparently at least in 2006 PostgreSQL allowed for this at least in Swedish.
ua-cam.com/video/gd5uJ7Nlvvo/v-deo.html The Danish letters "æ" and "ø" are much older than the spelling reform in 1948. The only new letter that was introduced in that reform was "å". It is correct that the reform did make Danish orthography more distinct from German - but the main reason for this is that the reform removed the capitalization of nouns.
They sang Odoia on the Billie Joel concert, which is a Georgian folk song!!! It is entered as Odoya in the beginning of the album shown... What the heck. I did not know of this. Cool!
Pike Matchbox is going to be one of those thing like when someone said Parachuting Buffaloes for lead on the Periodic Table, I'll never forget it because it is such a weird thing.
Did I really spend an hour listening to a guy talk about text formats in the middle of the night‽ Yes I did. What a fun and interesting presentation. Thank you Dylan!
29:25 They don't spell it with an umlaut; they spell it with a dieresis. (Not sure I spelled that right. Look at the name of the Unicode code point). The former is a shift in pronunciation, the latter means that the vowel is pronounced distinctly rather than being part of a digraph.
23:30 That is the most beautiful thing about human beings that I've heard in a long, long while. God bless that postman who really cared for his job and even was smart enough to figure out that problem. This will make me happy the rest of the day :D
I was a little bit skeptical: how can anyone give a one-hour talk speaking just about 'plain text'? But I have to admit: it was simply AMAZING! Well done!!!
Dylan Beattie can make pretty much anything interesting. I think he likes to challenge himself that way.
Same here. I was expecting to go to something else within a minute or two, but stayed for the whole thing.
@@tharfagreinirhis art of code talk is one of my all time favorite talks. This one is also way up there in the top 5 or so.
This ascii issue is also a cause of cultural tension in (Republic of) Ireland and (Northern) Ireland, where birth registrations at some hospitals are refused or incorrectly assigned when a child's parents opt to use a Gaelic name, which often includes a bunch of non-ASCII chars. Hospital software is usually pretty archaic and predates a lot of the elegance of UTF.
Also. Amazing talk. Funny and interesting the whole way through. Dylan Beattie is a legend!
it's like Slavic names in Germany
This is one of the best presentations I've seen in a long time. Amazing content!
One of my favourite sorting rules is that for Scottish surnames "Mac" and "Mc", both with and without following space, are considered the same letter that comes after L but before M
There’s a similar quirk with English genealogical documents, such as old church birth registers and ships’ passenger lists. They’ll often use abbreviations of common personal names (and even some surnames) to save space, and when these are sorted-whether in the text itself, or later on by a computer-it may be according to what the abbreviation stands for, not the letters themselves.
So you have to just know, for example, that “Hy.” might appear before “Herb.”, because “Henry” comes before “Herbert”. Moreover, some of the abbreviations are based on a Latin and/or Greek transliteration of the name, such as “Iabus” = “Iacobus” = “Jacob” or “Xpr” = “Christopher”.
@@EvincarOfAutumninteresting! Just wondering why Jacob was abbreviated with another 5 letter word? 🤔
I’ve also noticed it’s a free for all on whether the word “the” is ignored when sorting lists of names. Steam doesn’t ignore it, for example, and I think Google Play music used to but UA-cam music doesn’t. But to me, it’s correct to ignore it and incorrect not to!
@@paulwesley3862 In that case, the person’s name in everyday life would’ve been Jacob, but if the church records are (partially) in Latin, it’s the Latin form that’s abbreviated. I think just “Iab.” is attested as well, though I’m not sure.
At the risk of being one of those UA-cam comments shown in your next talk, the diacritic you discuss at 29:18 is a diaressis not an umlaut. They look the same and are encoded with the same codepoints, but are pronounced differently. An umlaut changes the quality of the vowel, and can appear on lone vowels in any language that uses them. A diaresis tells readers that the second of two vowels is not to be read as a diphthong, but a separate vowel. That's why English has one on, for example, naïve (nigh-eve, not knave). Coöperation is co + op not co͞op.
Something Nordic readers of Tolkien would do well to be aware of -- I'm referring to Eärendil etc.
Well done. I am not sufficiently educated to know if you are right, but the criticism is concise and I recognize the words if not what they mean.
Saved me posting the same, but having to look up all the terms to double check myself. Thanks!
and that specific diaresis is called a trema in almost every other language that uses it...
Also, the umlaut is used in German and was derived from roundabout there (idk the history too well) whereas the diaresis or trema evolved independently and is notably used in French to mark vowels that may usually be silent, but should be pronounced. This is similar to it's use in coöperation, to basically say that the second o is pronounced distinctly. The two symbols were developed independently.
Watching this for the second time (I watched the video referenced several times in this talk). Absolutely brilliant and I learned a lot
Recently, a student of mine opened a text file and it was all Chinese gibberish.
I remembered your talk and switched the encoding from UTF-8 to UTF-16 or vice versa, and there was a readable file again :)
Meanhile in areas people actually use Chinese, well, time to try all the encodings.
The 7-bit encoding for SMS messages in GSM is the same as ASCII for most characters but many of the control characters have been replaced text characters that were missing from ASCII. In particular it does not have NUL, 0 encodes the '@' character. So, as one of my colleagues at Ericsson found out the hard way, you cannot use C NUL terminated strings to process SMS messages.
Interesting!
This video also explained to me why SMS becomes converted to MMS if just put in a few emojis. Because the emojis take so many bytes.
wait, what about ASCII 0x40? Isn't that an @?
@@Architector_4 In GSM 7-bit encoding 0x40 is inverted exclamation mark, one of the characters missing from ASCII. No idea why they didn't use 0 for this and keep @ where it was.
@@chascuk ...huh. That's fun, thank you lol
Absolutely one of the best presentations that I've seen and it was a total shock. I watched this because I'm a geek and I like Dylan Beattie. I never expected it to be this awesome!
The historical context of all Dylan's talks is simply incredible! Born in the early 2000s, I didn't know any of this, and a lot of it makes sense today.
Since Dylan does read comments, here's one of my favorite examples, in Polish: "Zrób mi łaskę" means do me a favor. Most of the characters can be turned to their ASCII lookalikes without any issue whatosever. Except one. "Zrób mi laskę" is asking for a specific sexual act. Just turning ł into l changes the entire meaning of the whole sentence.
The youtube comment near the beginning of this updated version of his previous presentation illustrates the point of the talk powerfully. Dylan is always amazing, but this talk from him is perhaps uniquely important to everyone in the field! From 1st year associates to the most seasoned senior architect, plain text is always less than plain.
Plain text but the 'l' is silent
44:14 Glad that my comment in the previous talk video was found helpful :)
Very good talk. Regarding ASCII and punchcards, it's unlikely they would ever meet in the first place. You do course correct a bit w/r/t the DEL character, but punch cards were originally in 6-bit BCDIC (binary-coded decimal interchange code). This was extended to 8-bits to become "Extended" BCDIC, or EBCDIC. The layout of the character set aligned w/ the rows of the punchcard, such that all alphabetic chars were x1 - x9, so in late variants 'A' is 0x11 and 'Z' is 0x39. To get 3 rows of 9 columns to line up, there's a "/" at the start of the last row, 0x31.
Interestingly, ASCII was created by Bob Bemer at IBM to solve interop problems between the BCDICs. However, IBM was in so deep w/ their card-based (E)BCDIC, they couldn't use it in any of their operating systems. Note also, EBCDIC is still very much in use.
Finally, Multics did not influence Unix, except to serve as a counter-example of design principles.
I've always wondered how come EBCDIC was "extended", thanks for that.
Yay I love this guy, I binged all his talks like a month ago
You have impeccable taste, bravo!
I have been a softare developer for 20 years and it's only in the last 5 years that I began to realize the actual complexities of good old plain text. Once I realized how complex this issue actually is, I began to wonder why many of the systems I had worked on even WORKED. It's not something they talk about at university or anywhere, so it was nice to see this gets so many views. I haven't watched it yet but I'm sure it will open many people's eyes.
I spent a fair part of my career designing and implementing serial terminals and emulators of the same. For terminals from DEC starting with the VT100 (and other "ANSI" terminals), there was something called "code extension", along with character set designators, graphic sets, and shifts (both locking and single) that were used to mix text from multiple character sets on one screen/page using either 7 or 8 bits per character. This was fine on terminals and printers that had the same character sets available, but caused a lot of grief when a device receiving the text didn't support all of the character sets used. Also, very few editors at the time could handle storing such text.
It was a mess, but at least it was better than what it replaced, which was National Replacement Character Sets (NRCS), where it was 7-bit ASCII with the glyphs for some of the code points replaced. There was no way to tell which NRCS had been selected when the file was created, even with a hex editor.
30:00 ij is a dutch letter, not a typesetter's ligature! It's in the extra block at 19:50 left of Ö. Most fonts don't support it and ASCII led to it being written as 2 letters (i and j) because it was the only non-ascii letter in dutch, but all dutch typewriters before PCs were popularized had a dedicated key for it. Fonts that turn it into a ligature often run into problems with words like minijack, Beijing and bijoux. It used to have the same problem as å, with some people turning it into a Y (most famously Cruijff) until it got standardized as I+J.
Also, for future talks you may find the "Greeklish" system interesting: en.wikipedia.org/wiki/Greeklish
Basically before Greek language was fully supported, Greek people interacting with electronics came up with mappings between ASCII and Greek.
These mappings were unofficial and there are several variations.
Even after UTF-8 was implemented and got more and more adoption, lots of young people still utilized Greeklish in SMSs to send messages to each other because you'd get charged by the number of bytes you used (in groups of bytes) and not by the actual number of characters used.
This is also an issue in a lot of fields that have a byte limit instead of a character limit.
On a parallel note...
If you do a bit of time travel, and go to Greek villages in Anatolia during the time of the Ottoman Empire, you'll find the Greek alphabet being used to write Turkish text: en.wikipedia.org/wiki/Karamanli_Turkish
That sounds similar to what many Arabic speakers use, numbers in place of characters.
this is an amazing talk. i already knew some of this, but it still is nice to get a reminder on this stuff :)
Fantastiskt innehåll. Borde vara allmänbildning för alla som jobbar med IT och utveckling.
Having recently delved into utf8, unicode, etc, I knew a lot of this, but learned a few new things as well, either way it was thoroughly interesting. Well done!
Lovely talk, like going down memory lane. Spent a lot of time dealing with this. From writing xmodem and ymodem, to parsing csv files, converting bin to text, and back.
I love them YT commentators. World would be a much more imperfect place without them.
Btw i thought i knew a lot about plaintext, but turns out i knew something about plaintext. Thank you!
Late to the party, but I enjoyed that. So much so that I started watching at about 1am thinking of just catch the intro before I went to sleep to determine if it's something I want to keep watching, and ended up watching over half of it before finally deciding I was too tired. Also I learned something new that will solve an issue with one of my applications, so that was a bonus!
Now this was incredible funny, entertaining, intelligent and interesting. Not expecting this. Incredibly done. We need more talks like this in IT and not so serious and boring. Well done :)
In days of 7 bit ASCII, there were lots of workarounds in non-English speaking countries. For example, in order to be able to print umlauts, printers had special character sets that had umlauts where normally the characters {, [, ], }, \ and | were, because nobody needed them when writing a letter.
However, if a C or C++ programmer would use such a printer, his code would look quite funny. In parts that's the reason why some languages have special replacements for these characters, called digraphs and trigraphs. This all sound like multiple layers of duck tape putting on top of one another but it kind of worked.
Pleonasm. We’re still in the days of 7-bit ASCII. ASCII is 7 bits. Forever.
This popped up at the right time; while messing around with Notepad++ I looked up the purpose of carriage return, line feed, and tricks like *bolding,* underlining, and -strikethrough- with typewriters and teletext.
I've since come across resources like Typography for Lawyers that, apart from being an excellent reference for general formatting, advocate the end of shortcuts picked up from typewriters and a return to form for good typefaces and typesetting.
At first glance the headline of this video/presentation seems dull but it ended up being extremely interessting! - Very good video and very informative!
The reason why you get smiley faces when DOS crashes is not because there is something trying to generate the stop character. The reason is that often it starts executing random garbage or tries to print a message that became random garbage due to memory corruption. In a piece of program data the values 1 and 2 would be quite common if you have some counters that did not fit into your registers, and maybe they encode some common x86 instruction as well. The string terminator in the common OS interface for printing strings was the dollar sign rather than nul on DOS operating system. The dollar sign is much less common than nul and smiley faces in random garbage so you will likely get some smiley faces printed.
Note also that 'plain text' is just a binary format (or more precisely a family of binary formats with ASCII, EBCDIC, various code pages, JIS, BIG5, GB 18030, UCS-2, UTF-7, UTF-8, big endian and little endian UTF-16/UTF-32,...) for which there happens to be a lot of editors and viewers. In the end it's all binary bits. One specific property that 'plain text' has over many other binary formats is that it has very little structure and can still be of some use when some bits are flipped or bytes missing as opposed to, say a compressed JPEG image with the caveat that the multibyte encodings are much more fragile.
2:57 My favourite part of this is your youtube suggested videos are all ones I've watched!
Well that was another exceptional video from the master. I found that extremely enjoyable and informative. Unsurprisingly I didn't know a lot of the histrionics
God I have homework but now I have an irresistible urge to research unicode cause this was fascinating. Its amazing how clever some of their solutions are
It's the most funny IT conference's speech I've ever seen in years!
I'm famous. I vaguely remember the train of thought I had with that WWIII joke. That you posted a meme on twitter that was so funny that it prevented WWIII, and with you erased from existence by time travel shenanigans, that meme never gets posted and thus WWIII happens. I know I can get long winded especially when I talk about technical stuff, which is probably why I put that joke in there at the end. It's like a reward for sitting down and reading all that stuff about base64 and how vim fucks up binary encoding.
Also, how dare you say the End Of Transmission character, Ctrl-D, is unimportant. How else would I log out of my Linux terminal in one keystroke?
Rewatching this talk proved very useful today.
Currently dealing with the lexer for my programming language project failing unit tests on the Windows runner for GitHub actions. Wanna guess why? I’ll give you a hint: newline tokens report their span to be exactly one character later than expected.
You know, featuring the youtube comments in the talk only embodlens us.
Dylan, thanks for a speech! It was really interesting to hear all this historic details and understand more how unicode works. And my gratitude for your support of Ukraine! Слава Україні!
53:42 According to Apple, Dylan was in Denmark. According to Microsoft, he supports Donkey Kong. Both very respectable!
Great lecture -- super fun and informative, thanks! And now I'd love see a follow-up that touches upon those lovely grey areas of A) finding out the encoding of a given "plain" text file, and B) UTF-16 surrogate characters. Especially the latter is quite important, because I'd guess that 95% of all applications using UTF-16 are broken, in the sense of not being able to deal with any text that contains Unicode codepoints which can not be encoded in the 16-bit units of UTF-16.
I couldn't imagine working at an airline, where I know for sure that names will be scrutinized in every detail, and deciding "eh, I'll just strip diacritics off of everything." Having scanned passports before, there are very well-publicised and clear standards for how to transliterate any Unicode character into that strip at the bottom.
You're probably not American or English, then, where diacritics are uncommon and used only by foreigners. Yes, if you think about it that's a bit parochial but that shows the difference between programmers working for commercial companies with a certain market and the people who write standards like the one that allowed all those different forms in an email address.
Nothing wrong with writing JavaScript in Ukrainian:
1. It runs fine.
2. In production build, the minifier will take it all out and replace it with single-character ascii names.
3. Source maps will work fine.
You should never underestimate things labeled “simple” or “plane” )) Thanks, Dylan! Appreciate so much everything you’re doing for the community.
I have never underestimated a plane. Be it a machine that can carry me to the sky, or an infinite flat set of points in 3D space, or a tool used to smooth wooden surfaces, they are always quite intimidating.
@@NeatNit 🤣🤣You got the point!
Good talk, I did a lot of research on this about 20 years ago but I always forget. BTW, the two dots in English are a diaresis, not umlaut.
Great presentation! I code systems that use control codes all the time for work; they are still widely used and accepted (receipt printers, barcode scanners, serial comms, etc).
When I was working with ASCII terminals, I liked to use BEL to sound the squeaky buzzer of the terminal.
UTF-8 is rarely slower to process than UTF-16, and because UTF-16 only has the BMP in a single code unit, you can't rely on that for counting codepoints anyway; furthermore, rarely do you want to count codepoints, you generally want to count graphemes.
UTF16 generally sucks and was the bane of my existence for many years, thanks for nothing windows as usual.
UTF-16 is actually just another hack to fix UCS-2, which is the fixed 16-bit Universal Coded Character Set. It was intended to contain all the codepoints until we discovered that 16 bits were actually too few bits to contain the set. It really is hacks and partial backwards compatibility all the way down. Windows extended their API to work with wide characters to support UCS-2 before UTF-16 or UTF-8 was a thing, and when UCS-2 died they were kinda screwed and couldn't update their design. So that's how we ended up here.
I guess there's not much hope for doing a cameo in the next version of the presentation, but I'll try anyway.
Using Cyrillic, or any other local writing system in JavaScript is probably a bad idea in any production code, for sure, and it's universally frowned upon for a reason. Universality, you know - if you write science in Medieval Europe, use Latin, don't be a dick.
But, there's a "but"! Teaching programming to newbies with no STEM background whatsoever, who also don't happen to be fluent in English (you can imagine), I suddenly found allowing them to use the words of their native language as names in their source code very, very useful. Separation of concerns and cognitive load reduction, I guess. As a bonus, there's a clear distinction between library entities and the locally introduced ones, which is also a good thing for the newbies.
In fact, the role of English in international software development is a huge topic with a ton of practical consequences. Some Chinese have already stopped giving shit on this "you must write everything in English" thing, and it's not gonna stop there.
I LOVE FiraCode, BTW!
36:09 I see this a lot in the large German company I work for, specifically this example of having to select a country from a dropdown list. The countries' English names are displayed, but ordered as if they're German names.
Loved the talk. Well done, Dylan! 👌
I really wish Dylan talks about Han Unification.
Like, it's just such a cursed aspect of Unicode. I really wish more people know about it.
43:35 Generally a very solid talk, but the section about UTF-16 was kinda inaccurate. UTF-16 is not actually a fixed-length encoding and you cannot get the number of bytes just from the number of contained characters (e.g. Emoji need two UTF-16 code units forming a surrogate pair). The actual reason that so many of these 90s systems use UTF-16 is that this was the time of the fixed-size 16 bit UCS-2 encoding ( "65k characters ought to be enough for everyone"), which was later expanded to become UTF-16 when they ran out of code points. Instead, the range of code points U+D800 to U+DFFF was permanently snapped out of existence, so that UTF-16 could use them to encode higher code points as multi-word sequences. This is also the reason why not every String in C#, Java, or JS is Unicode; these languages allow you to have unpaired surrogates which are not valid UTF-16 (they are not scalar values). See the "History" section of UTF-16 on Wikipedia.
And this entire paragraph was even without going into that dreaded word "character". If you take character to mean code point, then doubling the number of characters to get the number of bytes is almost correct (so long as you don't care about anything outside the BMP, aka basically all instant messaging, social media, ...). But as we've seen one "character" can be made of many many code points and each of those code points can be multiple code units. And if sequence of code points is displayed as one "character" or multiple depends on the display technology you're ultimately using (wtf is an extended grapheme cluster?). In fact, the Unicode standard doesn't define what a character is. So, ultimately, there is no actual correspondence between the number of "characters" in a string and the number of UTF-16 code units, the concept of a character varies from use to use, and UTF-16 falls short of even the most charitable interpretation of "character = code point".
Additionally, the reason that UTF-8 stops at four bytes is actually because Unicode is a 21-bit scheme. Unicode has made guarantees that it will only ever go up to U+10FFFF and this, again, stems from the fact that they weren't able to squeeze more bits out of UCS-2.
In summary, UTF-16 is weird a legacy encoding resulting from expanding UCS-2 to a set of code points it was never meant for. In doing so, UTF-16 has lost a key property of UCS-2 (being a fixed-length encoding for scalars), while only displaying the lack of this property for (until recently) uncommon inputs. It now has both the disadvantages of UTF-8 (variable length) and UTF-32 (wasted space, ASCII incompatibility) while introducing additional drawbacks (byte order confusion, false belief in being fixed-size). Unicode has had to insert multiple hacks just to keep this mess going.
UTF-16 is Unicode's original sin. Every emoji broken by a Java developer using "char", every "Bush hid the facts" censored by IsTextUnicode, and every broken API call from mishandling wchar_t is a punishment from the tech gods themselves. In our hubris we believed that there were less than 2^16, so now we must suffer forevermore.
Thank you for this wonderful talk 🙏
A great, informative and oh-so-entertaining talk 🥰 !
Small comment from a Dane. Aarhus is at the start of the alphabet then spelled with a double aa atleast acording to any convention I have seen in use here in Denmark. Eventhough aa and å represents the same letter we still keep the alphabetic order distinct. Implying that Aabenraa is first in a alphabetically sorted list of city names in Denmark.
What you call a 'weird stylistic thing' for the word 'coöperation' is actually a common feature in Dutch. We call it a 'trema' and it is different from an umlaut (in its use, not in look obviously): it is used to indicate diaeresis, whereas an umlaut changes the sound of the vowel it is on (as others have already pointed out).
Great talk. I knew about the many control characters in ASCII but I never realised that that is where the use of ctrl-c to abort a program comes from. Also, that story about that postman was amazing, what a legend that guy.
Finally, as a programmer, thanks for ruining (ruïning?) the simplicity of alphabetical order for me. I thought it was mostly date/times that were a headache, guess I was wrong...
That postal worker deserved a raise lol.
Mistake in 50:23: the rocket emoji is U+1F680, not U+1F680D
This was a really fun talk, and very well-delivered.
Java now uses UTF-8 internally. They dropped UTF-16 when Java 8 came out. An hour on plain text? I would not have believed it until I watched it. Just awesome.
Emoji existed in the West long before iPhones did. It came to us with things like instant messaging platforms. ICQ, MSN messenger, even facebook.
The best one I watched last year!
Special thanks for supporting Ukraine! Pike matchbox!!!
I remember running echo ^G in DOS as a teen.
28:36 Æ is totally a letter in English. It's called the letter æsc, which sounds like "ash", because it represents the tree ash. And for completeness I should also mention the letter œthel, which sounds like Ethel, the personal name. They appear in obviously english words like encyclopædia, manœuvre and Cat7 UTP Æthernet cable.
… Not to mention archæologist. I may have cheated a little bit with one of mine, but why doesn't that count?
Laughed at Cat7 UTP Æthernet cable. And realised it's perfectly correct.
It’s obviously an English word, right? And everyone knows that’s a valid spelling for it.
The cheaty one is manœuvre, because that’s a French word. But I don’t get why he doesn’t count archæologist? Maybe it’s in the same way as because Latin only has the letter K in one word, it’s not considered part of the Roman alphabet. And to be fair, Æsh and Œthel don’t come up very often. Œstrogen is another one, but that’s basically a Latin word. I don’t know any non-borrowed words containing œ that are still in modern English. Unlike æther.
On alphabetical ordering in Finnish... back when I was in school in the 1990s, I was taught that V and W actually are considered equal in Finnish. So going through a list of Finnish surnames, Valli, Waris, Virtanen, Wirtanen (tiebreaker here, I suppose) would be in correct order. But having googled this a bit more, this is apparently nowadays (since 2000) somehow dependent on context -- mixed with foreign words and names such as Vanderbilt and Wolf, it's OK to sort them all V first, then W. So I don't know if even printed dictionaries use this sorting today.
I don't think this peculiarity is even well-known, IIRC this surprised many of my Finnish coworkers.
So, do computers ever deal with this or do they just sort V first, then W?
@@cameron7374 Never noticed a system that would (probably in part because W is in Finnish only in names (outside of possibly loanwords), and even there it is very rare). But after a quick googling, apparently at least in 2006 PostgreSQL allowed for this at least in Swedish.
that was a very interesting watch, thank you!
Amazing content - mega cool Präsentation 🈶
I see what you did there.
@@JeremyAndersonBoise I was going to comment "I see what you did there".... but then I saw what YOU did THERE.... so couldn't.
Brilliant lecture!! They didn't teach this in the 1980's when I studied computer science. ☝🙃
I love watching different versions of the same talk... :)
Is there another version where it carries on past the intruiging statement 'and this is where the version for youtube ends' ?
I forgot about the ending. I've always known this as the Kohuept talk :D
I thought it was boring, but surprise! I watched it to the end. 😁
I've read the SO post, buy I never knew there was a name for Zalgo Text! Fantastic talk.
Omg that was god level summarising at the end
Good talk. I was a bit disappointed that you did not even touch on the whole EBCDIC vs ASCII situation.
This Guy is true Gem 💎.
As somebody from Aachen, I appreciate the choice of examples :D
It was interesting to see the origins of Dwarf Fortress UI!
ua-cam.com/video/gd5uJ7Nlvvo/v-deo.html
The Danish letters "æ" and "ø" are much older than the spelling reform in 1948. The only new letter that was introduced in that reform was "å". It is correct that the reform did make Danish orthography more distinct from German - but the main reason for this is that the reform removed the capitalization of nouns.
They sang Odoia on the Billie Joel concert, which is a Georgian folk song!!! It is entered as Odoya in the beginning of the album shown... What the heck. I did not know of this. Cool!
Pike Matchbox is going to be one of those thing like when someone said Parachuting Buffaloes for lead on the Periodic Table, I'll never forget it because it is such a weird thing.
Great talk. I was wondering, what would hide behind that title, and I was not disappointed.
That Russian postal service anecdote is just so wholesome.
Great talk and thanks for supporting Ukraine! 🇺🇦
I'm on windows right now -- I assume that would be a flag on a different operating system after watching this talk! :)
11:42 Anyone else had to try this when viewing the video? - It works!
I wanted to try it out, but which key does he mean with "echo"?
@@theburner4522 Not a key. You open a shell (e.g. "cmd.exe" and literally type "echo" followed by a space and then Ctrl-key together with G, Enter.
Awesome job..... blown away
this was beautifully interesting, thanks!
"your recording sounds great! What mic do you use?"
"Rødgrød med fløde"
Absolutely brilliant great speech
Awesome video Calum
Key takeaways :
1. Try out FIRA Code.
2. Gay Pirates are always Winning.
3. In Soviet Russia, Post Office fixes YOUR code-page mistakes.
54:10 I literally bursted in to hard laughter for Windows' statement " 🏳🌈🏴☠🏁Gay pirates are winning!", hilarious mate. Amazing :D
Did I really spend an hour listening to a guy talk about text formats in the middle of the night‽ Yes I did. What a fun and interesting presentation. Thank you Dylan!
Wow That was a pretty interesting and fun talk!
Great talk! Thank you.
Line feed on its own is useful in dramatic texts.
The Wheatstone bridge was invented by Samuel Hunter Christie and improved and popularised by Sir Charles Wheatstone.
29:25 They don't spell it with an umlaut; they spell it with a dieresis. (Not sure I spelled that right. Look at the name of the Unicode code point).
The former is a shift in pronunciation, the latter means that the vowel is pronounced distinctly rather than being part of a digraph.
This was incredible
This is amazing!
Brilliant 👍