Plain Text - Dylan Beattie - NDC Copenhagen 2022

Поділитися
Вставка
  • Опубліковано 4 сер 2022
  • Software is complicated. Machine learning, microservice architectures, message queues... every few months there's another revolutionary idea to consider, another framework to learn. And underneath so many of these amazing ideas and abstractions is text.
    When you work in software, you spend your life working with text. Some of those text files are source code, some are configuration files, some of them are documentation. Editors, revision control systems, programming languages - everything from C# and HTML to Git and VS Code is based on the idea of "plain text files". But... what if I told you there's no such thing? When we say something is a "plain text file", we're relying on a huge number of assumptions - about operating systems, editors, file formats, language, culture, history... and, most of the time, that's OK. But when it goes wrong, "plain text" can lead to some of the weirdest bugs you've ever seen... why is there Chinese in the event logs? Why is the city of Aarhus in the wrong place? And why does Magnus Mårtensson always have trouble getting into the USA? Join Dylan Beattie for a fascinating look into the hidden world of text files - from the history of mechanical teletypes to encodings, collations and code pages. We'll look at some memorable bugs, some golden rules for working with plain text - and we'll even find out the story behind the mysterious phrase "pike matchbox" and what it has do with driving in Belarus.
    Check out more of our featured speakers and talks at
    www.ndcconferences.com
    ndccopenhagen.com/
  • Наука та технологія

КОМЕНТАРІ • 268

  • @jonnilazzerini9085
    @jonnilazzerini9085 Рік тому +429

    I was a little bit skeptical: how can anyone give a one-hour talk speaking just about 'plain text'? But I have to admit: it was simply AMAZING! Well done!!!

    • @tharfagreinir
      @tharfagreinir Рік тому +15

      Dylan Beattie can make pretty much anything interesting. I think he likes to challenge himself that way.

    • @hansbaeker9769
      @hansbaeker9769 8 місяців тому +2

      Same here. I was expecting to go to something else within a minute or two, but stayed for the whole thing.

    • @crax83
      @crax83 4 місяці тому +2

      ​@@tharfagreinirhis art of code talk is one of my all time favorite talks. This one is also way up there in the top 5 or so.

  • @f.d.3289
    @f.d.3289 8 місяців тому +40

    23:30 That is the most beautiful thing about human beings that I've heard in a long, long while. God bless that postman who really cared for his job and even was smart enough to figure out that problem. This will make me happy the rest of the day :D

  • @merthyr1831
    @merthyr1831 10 місяців тому +47

    This ascii issue is also a cause of cultural tension in (Republic of) Ireland and (Northern) Ireland, where birth registrations at some hospitals are refused or incorrectly assigned when a child's parents opt to use a Gaelic name, which often includes a bunch of non-ASCII chars. Hospital software is usually pretty archaic and predates a lot of the elegance of UTF.
    Also. Amazing talk. Funny and interesting the whole way through. Dylan Beattie is a legend!

  • @malcolmhutchison
    @malcolmhutchison Рік тому +79

    One of my favourite sorting rules is that for Scottish surnames "Mac" and "Mc", both with and without following space, are considered the same letter that comes after L but before M

    • @EvincarOfAutumn
      @EvincarOfAutumn 9 місяців тому +7

      There’s a similar quirk with English genealogical documents, such as old church birth registers and ships’ passenger lists. They’ll often use abbreviations of common personal names (and even some surnames) to save space, and when these are sorted-whether in the text itself, or later on by a computer-it may be according to what the abbreviation stands for, not the letters themselves.
      So you have to just know, for example, that “Hy.” might appear before “Herb.”, because “Henry” comes before “Herbert”. Moreover, some of the abbreviations are based on a Latin and/or Greek transliteration of the name, such as “Iabus” = “Iacobus” = “Jacob” or “Xpr” = “Christopher”.

    • @paulwesley3862
      @paulwesley3862 8 місяців тому +3

      ​@@EvincarOfAutumninteresting! Just wondering why Jacob was abbreviated with another 5 letter word? 🤔

    • @altreusplays
      @altreusplays 8 місяців тому +2

      I’ve also noticed it’s a free for all on whether the word “the” is ignored when sorting lists of names. Steam doesn’t ignore it, for example, and I think Google Play music used to but UA-cam music doesn’t. But to me, it’s correct to ignore it and incorrect not to!

    • @EvincarOfAutumn
      @EvincarOfAutumn 8 місяців тому +2

      @@paulwesley3862 In that case, the person’s name in everyday life would’ve been Jacob, but if the church records are (partially) in Latin, it’s the Latin form that’s abbreviated. I think just “Iab.” is attested as well, though I’m not sure.

  • @NicholasShanks
    @NicholasShanks Рік тому +189

    At the risk of being one of those UA-cam comments shown in your next talk, the diacritic you discuss at 29:18 is a diaressis not an umlaut. They look the same and are encoded with the same codepoints, but are pronounced differently. An umlaut changes the quality of the vowel, and can appear on lone vowels in any language that uses them. A diaresis tells readers that the second of two vowels is not to be read as a diphthong, but a separate vowel. That's why English has one on, for example, naïve (nigh-eve, not knave). Coöperation is co + op not co͞op.

    • @jkollin4875683F
      @jkollin4875683F Рік тому +17

      Something Nordic readers of Tolkien would do well to be aware of -- I'm referring to Eärendil etc.

    • @EricChipko
      @EricChipko Рік тому +14

      Well done. I am not sufficiently educated to know if you are right, but the criticism is concise and I recognize the words if not what they mean.

    • @stevecarter8810
      @stevecarter8810 Рік тому +5

      Saved me posting the same, but having to look up all the terms to double check myself. Thanks!

    • @TonyCoyle
      @TonyCoyle Рік тому +9

      and that specific diaresis is called a trema in almost every other language that uses it...

    • @Shack263
      @Shack263 11 місяців тому +5

      Also, the umlaut is used in German and was derived from roundabout there (idk the history too well) whereas the diaresis or trema evolved independently and is notably used in French to mark vowels that may usually be silent, but should be pronounced. This is similar to it's use in coöperation, to basically say that the second o is pronounced distinctly. The two symbols were developed independently.

  • @nsulikow
    @nsulikow Рік тому +122

    This is one of the best presentations I've seen in a long time. Amazing content!

  • @chascuk
    @chascuk Рік тому +129

    The 7-bit encoding for SMS messages in GSM is the same as ASCII for most characters but many of the control characters have been replaced text characters that were missing from ASCII. In particular it does not have NUL, 0 encodes the '@' character. So, as one of my colleagues at Ericsson found out the hard way, you cannot use C NUL terminated strings to process SMS messages.

    • @UliTroyo
      @UliTroyo 11 місяців тому +5

      Interesting!

    • @flammungous3068
      @flammungous3068 8 місяців тому +3

      This video also explained to me why SMS becomes converted to MMS if just put in a few emojis. Because the emojis take so many bytes.

    • @Architector_4
      @Architector_4 7 місяців тому +1

      wait, what about ASCII 0x40? Isn't that an @?

    • @chascuk
      @chascuk 7 місяців тому +4

      @@Architector_4 In GSM 7-bit encoding 0x40 is inverted exclamation mark, one of the characters missing from ASCII. No idea why they didn't use 0 for this and keep @ where it was.

    • @Architector_4
      @Architector_4 7 місяців тому

      @@chascuk ...huh. That's fun, thank you lol

  • @notthedroidsyourelookingfo4026
    @notthedroidsyourelookingfo4026 Рік тому +22

    Recently, a student of mine opened a text file and it was all Chinese gibberish.
    I remembered your talk and switched the encoding from UTF-8 to UTF-16 or vice versa, and there was a readable file again :)

    • @FlameRat_YehLon
      @FlameRat_YehLon Рік тому +3

      Meanhile in areas people actually use Chinese, well, time to try all the encodings.

  • @HasanSIM14
    @HasanSIM14 Рік тому +34

    Watching this for the second time (I watched the video referenced several times in this talk). Absolutely brilliant and I learned a lot

  • @jandorniak6473
    @jandorniak6473 Рік тому +6

    Since Dylan does read comments, here's one of my favorite examples, in Polish: "Zrób mi łaskę" means do me a favor. Most of the characters can be turned to their ASCII lookalikes without any issue whatosever. Except one. "Zrób mi laskę" is asking for a specific sexual act. Just turning ł into l changes the entire meaning of the whole sentence.

  • @drullo
    @drullo Рік тому +19

    Absolutely one of the best presentations that I've seen and it was a total shock. I watched this because I'm a geek and I like Dylan Beattie. I never expected it to be this awesome!

  • @NicolasChanCSY
    @NicolasChanCSY Рік тому +11

    44:14 Glad that my comment in the previous talk video was found helpful :)

  • @jeberle1
    @jeberle1 Рік тому +37

    Very good talk. Regarding ASCII and punchcards, it's unlikely they would ever meet in the first place. You do course correct a bit w/r/t the DEL character, but punch cards were originally in 6-bit BCDIC (binary-coded decimal interchange code). This was extended to 8-bits to become "Extended" BCDIC, or EBCDIC. The layout of the character set aligned w/ the rows of the punchcard, such that all alphabetic chars were x1 - x9, so in late variants 'A' is 0x11 and 'Z' is 0x39. To get 3 rows of 9 columns to line up, there's a "/" at the start of the last row, 0x31.
    Interestingly, ASCII was created by Bob Bemer at IBM to solve interop problems between the BCDICs. However, IBM was in so deep w/ their card-based (E)BCDIC, they couldn't use it in any of their operating systems. Note also, EBCDIC is still very much in use.
    Finally, Multics did not influence Unix, except to serve as a counter-example of design principles.

    • @edgeeffect
      @edgeeffect Рік тому +4

      I've always wondered how come EBCDIC was "extended", thanks for that.

  • @JeremyAndersonBoise
    @JeremyAndersonBoise Рік тому +13

    The youtube comment near the beginning of this updated version of his previous presentation illustrates the point of the talk powerfully. Dylan is always amazing, but this talk from him is perhaps uniquely important to everyone in the field! From 1st year associates to the most seasoned senior architect, plain text is always less than plain.

  • @f.d.3289
    @f.d.3289 8 місяців тому +4

    I have been a softare developer for 20 years and it's only in the last 5 years that I began to realize the actual complexities of good old plain text. Once I realized how complex this issue actually is, I began to wonder why many of the systems I had worked on even WORKED. It's not something they talk about at university or anywhere, so it was nice to see this gets so many views. I haven't watched it yet but I'm sure it will open many people's eyes.

  • @ayle1312
    @ayle1312 10 місяців тому +7

    30:00 ij is a dutch letter, not a typesetter's ligature! It's in the extra block at 19:50 left of Ö. Most fonts don't support it and ASCII led to it being written as 2 letters (i and j) because it was the only non-ascii letter in dutch, but all dutch typewriters before PCs were popularized had a dedicated key for it. Fonts that turn it into a ligature often run into problems with words like minijack, Beijing and bijoux. It used to have the same problem as å, with some people turning it into a Y (most famously Cruijff) until it got standardized as I+J.

  • @braveatnight
    @braveatnight Рік тому +9

    Yay I love this guy, I binged all his talks like a month ago

  • @feisty-trog-12345
    @feisty-trog-12345 11 місяців тому +2

    43:35 Generally a very solid talk, but the section about UTF-16 was kinda inaccurate. UTF-16 is not actually a fixed-length encoding and you cannot get the number of bytes just from the number of contained characters (e.g. Emoji need two UTF-16 code units forming a surrogate pair). The actual reason that so many of these 90s systems use UTF-16 is that this was the time of the fixed-size 16 bit UCS-2 encoding ( "65k characters ought to be enough for everyone"), which was later expanded to become UTF-16 when they ran out of code points. Instead, the range of code points U+D800 to U+DFFF was permanently snapped out of existence, so that UTF-16 could use them to encode higher code points as multi-word sequences. This is also the reason why not every String in C#, Java, or JS is Unicode; these languages allow you to have unpaired surrogates which are not valid UTF-16 (they are not scalar values). See the "History" section of UTF-16 on Wikipedia.
    And this entire paragraph was even without going into that dreaded word "character". If you take character to mean code point, then doubling the number of characters to get the number of bytes is almost correct (so long as you don't care about anything outside the BMP, aka basically all instant messaging, social media, ...). But as we've seen one "character" can be made of many many code points and each of those code points can be multiple code units. And if sequence of code points is displayed as one "character" or multiple depends on the display technology you're ultimately using (wtf is an extended grapheme cluster?). In fact, the Unicode standard doesn't define what a character is. So, ultimately, there is no actual correspondence between the number of "characters" in a string and the number of UTF-16 code units, the concept of a character varies from use to use, and UTF-16 falls short of even the most charitable interpretation of "character = code point".
    Additionally, the reason that UTF-8 stops at four bytes is actually because Unicode is a 21-bit scheme. Unicode has made guarantees that it will only ever go up to U+10FFFF and this, again, stems from the fact that they weren't able to squeeze more bits out of UCS-2.
    In summary, UTF-16 is weird a legacy encoding resulting from expanding UCS-2 to a set of code points it was never meant for. In doing so, UTF-16 has lost a key property of UCS-2 (being a fixed-length encoding for scalars), while only displaying the lack of this property for (until recently) uncommon inputs. It now has both the disadvantages of UTF-8 (variable length) and UTF-32 (wasted space, ASCII incompatibility) while introducing additional drawbacks (byte order confusion, false belief in being fixed-size). Unicode has had to insert multiple hacks just to keep this mess going.
    UTF-16 is Unicode's original sin. Every emoji broken by a Java developer using "char", every "Bush hid the facts" censored by IsTextUnicode, and every broken API call from mishandling wchar_t is a punishment from the tech gods themselves. In our hubris we believed that there were less than 2^16, so now we must suffer forevermore.

  • @heinzk023
    @heinzk023 Рік тому +11

    In days of 7 bit ASCII, there were lots of workarounds in non-English speaking countries. For example, in order to be able to print umlauts, printers had special character sets that had umlauts where normally the characters {, [, ], }, \ and | were, because nobody needed them when writing a letter.
    However, if a C or C++ programmer would use such a printer, his code would look quite funny. In parts that's the reason why some languages have special replacements for these characters, called digraphs and trigraphs. This all sound like multiple layers of duck tape putting on top of one another but it kind of worked.

  • @MrIkariaman
    @MrIkariaman 10 місяців тому +16

    Also, for future talks you may find the "Greeklish" system interesting: en.wikipedia.org/wiki/Greeklish
    Basically before Greek language was fully supported, Greek people interacting with electronics came up with mappings between ASCII and Greek.
    These mappings were unofficial and there are several variations.
    Even after UTF-8 was implemented and got more and more adoption, lots of young people still utilized Greeklish in SMSs to send messages to each other because you'd get charged by the number of bytes you used (in groups of bytes) and not by the actual number of characters used.
    This is also an issue in a lot of fields that have a byte limit instead of a character limit.
    On a parallel note...
    If you do a bit of time travel, and go to Greek villages in Anatolia during the time of the Ottoman Empire, you'll find the Greek alphabet being used to write Turkish text: en.wikipedia.org/wiki/Karamanli_Turkish

    • @deus_ex_machina_
      @deus_ex_machina_ 9 місяців тому +1

      That sounds similar to what many Arabic speakers use, numbers in place of characters.

  • @filker0
    @filker0 Рік тому +5

    I spent a fair part of my career designing and implementing serial terminals and emulators of the same. For terminals from DEC starting with the VT100 (and other "ANSI" terminals), there was something called "code extension", along with character set designators, graphic sets, and shifts (both locking and single) that were used to mix text from multiple character sets on one screen/page using either 7 or 8 bits per character. This was fine on terminals and printers that had the same character sets available, but caused a lot of grief when a device receiving the text didn't support all of the character sets used. Also, very few editors at the time could handle storing such text.
    It was a mess, but at least it was better than what it replaced, which was National Replacement Character Sets (NRCS), where it was 7-bit ASCII with the glyphs for some of the code points replaced. There was no way to tell which NRCS had been selected when the file was created, even with a hex editor.

  • @vincentvega7908
    @vincentvega7908 Рік тому +5

    The reason why you get smiley faces when DOS crashes is not because there is something trying to generate the stop character. The reason is that often it starts executing random garbage or tries to print a message that became random garbage due to memory corruption. In a piece of program data the values 1 and 2 would be quite common if you have some counters that did not fit into your registers, and maybe they encode some common x86 instruction as well. The string terminator in the common OS interface for printing strings was the dollar sign rather than nul on DOS operating system. The dollar sign is much less common than nul and smiley faces in random garbage so you will likely get some smiley faces printed.
    Note also that 'plain text' is just a binary format (or more precisely a family of binary formats with ASCII, EBCDIC, various code pages, JIS, BIG5, GB 18030, UCS-2, UTF-7, UTF-8, big endian and little endian UTF-16/UTF-32,...) for which there happens to be a lot of editors and viewers. In the end it's all binary bits. One specific property that 'plain text' has over many other binary formats is that it has very little structure and can still be of some use when some bits are flipped or bytes missing as opposed to, say a compressed JPEG image with the caveat that the multibyte encodings are much more fragile.

  • @sauliustb
    @sauliustb Рік тому +3

    this is an amazing talk. i already knew some of this, but it still is nice to get a reminder on this stuff :)

  • @serpent77
    @serpent77 Рік тому +4

    Having recently delved into utf8, unicode, etc, I knew a lot of this, but learned a few new things as well, either way it was thoroughly interesting. Well done!

  • @henrikholst7490
    @henrikholst7490 Рік тому +5

    Fantastiskt innehåll. Borde vara allmänbildning för alla som jobbar med IT och utveckling.

  • @SiriusXification
    @SiriusXification Рік тому +6

    You know, featuring the youtube comments in the talk only embodlens us.

  • @etmax1
    @etmax1 Рік тому +1

    Well that was another exceptional video from the master. I found that extremely enjoyable and informative. Unsurprisingly I didn't know a lot of the histrionics

  • @bujin1977
    @bujin1977 11 місяців тому

    Late to the party, but I enjoyed that. So much so that I started watching at about 1am thinking of just catch the intro before I went to sleep to determine if it's something I want to keep watching, and ended up watching over half of it before finally deciding I was too tired. Also I learned something new that will solve an issue with one of my applications, so that was a bonus!

  • @zuao76
    @zuao76 5 місяців тому +1

    Now this was incredible funny, entertaining, intelligent and interesting. Not expecting this. Incredibly done. We need more talks like this in IT and not so serious and boring. Well done :)

  • @BradenBest
    @BradenBest Рік тому +4

    I'm famous. I vaguely remember the train of thought I had with that WWIII joke. That you posted a meme on twitter that was so funny that it prevented WWIII, and with you erased from existence by time travel shenanigans, that meme never gets posted and thus WWIII happens. I know I can get long winded especially when I talk about technical stuff, which is probably why I put that joke in there at the end. It's like a reward for sitting down and reading all that stuff about base64 and how vim fucks up binary encoding.
    Also, how dare you say the End Of Transmission character, Ctrl-D, is unimportant. How else would I log out of my Linux terminal in one keystroke?

  • @deus_ex_machina_
    @deus_ex_machina_ 9 місяців тому

    This popped up at the right time; while messing around with Notepad++ I looked up the purpose of carriage return, line feed, and tricks like *bolding,* underlining, and -strikethrough- with typewriters and teletext.
    I've since come across resources like Typography for Lawyers that, apart from being an excellent reference for general formatting, advocate the end of shortcuts picked up from typewriters and a return to form for good typefaces and typesetting.

  • @colinmaharaj
    @colinmaharaj 7 місяців тому

    Lovely talk, like going down memory lane. Spent a lot of time dealing with this. From writing xmodem and ymodem, to parsing csv files, converting bin to text, and back.

  • @user-oc3mi2ct6t
    @user-oc3mi2ct6t 9 місяців тому +2

    Small comment from a Dane. Aarhus is at the start of the alphabet then spelled with a double aa atleast acording to any convention I have seen in use here in Denmark. Eventhough aa and å represents the same letter we still keep the alphabetic order distinct. Implying that Aabenraa is first in a alphabetically sorted list of city names in Denmark.

  • @TooLazyToFail
    @TooLazyToFail 7 місяців тому

    This was a really fun talk, and very well-delivered.

  • @Kitulous
    @Kitulous 2 місяці тому

    that was a very interesting watch, thank you!

  • @DerekCroxtonWestphalia
    @DerekCroxtonWestphalia 8 місяців тому +1

    Good talk, I did a lot of research on this about 20 years ago but I always forget. BTW, the two dots in English are a diaresis, not umlaut.

  • @Rx7man
    @Rx7man Рік тому +1

    2:57 My favourite part of this is your youtube suggested videos are all ones I've watched!

  • @f.d.3289
    @f.d.3289 8 місяців тому +1

    Great lecture -- super fun and informative, thanks! And now I'd love see a follow-up that touches upon those lovely grey areas of A) finding out the encoding of a given "plain" text file, and B) UTF-16 surrogate characters. Especially the latter is quite important, because I'd guess that 95% of all applications using UTF-16 are broken, in the sense of not being able to deal with any text that contains Unicode codepoints which can not be encoded in the 16-bit units of UTF-16.

  • @Carewolf
    @Carewolf 9 місяців тому +1

    Emoji existed in the West long before iPhones did. It came to us with things like instant messaging platforms. ICQ, MSN messenger, even facebook.

  • @JonathanPlasse
    @JonathanPlasse 2 місяці тому

    Thank you for this wonderful talk 🙏

  • @dmurvihill
    @dmurvihill 10 місяців тому +2

    I couldn't imagine working at an airline, where I know for sure that names will be scrutinized in every detail, and deciding "eh, I'll just strip diacritics off of everything." Having scanned passports before, there are very well-publicised and clear standards for how to transliterate any Unicode character into that strip at the bottom.

    • @theelmonk
      @theelmonk 7 місяців тому

      You're probably not American or English, then, where diacritics are uncommon and used only by foreigners. Yes, if you think about it that's a bit parochial but that shows the difference between programmers working for commercial companies with a certain market and the people who write standards like the one that allowed all those different forms in an email address.

  • @hfranke07
    @hfranke07 Рік тому

    Awesome job..... blown away

  • @dgsagoskis1851
    @dgsagoskis1851 9 місяців тому

    I love them YT commentators. World would be a much more imperfect place without them.
    Btw i thought i knew a lot about plaintext, but turns out i knew something about plaintext. Thank you!

  • @CRBarchager
    @CRBarchager Рік тому +4

    At first glance the headline of this video/presentation seems dull but it ended up being extremely interessting! - Very good video and very informative!

  • @BenjaminAster
    @BenjaminAster Рік тому +2

    Mistake in 50:23: the rocket emoji is U+1F680, not U+1F680D

  • @AshtonSnapp
    @AshtonSnapp Рік тому +3

    Rewatching this talk proved very useful today.
    Currently dealing with the lexer for my programming language project failing unit tests on the Windows runner for GitHub actions. Wanna guess why? I’ll give you a hint: newline tokens report their span to be exactly one character later than expected.

  • @microcolonel
    @microcolonel Рік тому +6

    UTF-8 is rarely slower to process than UTF-16, and because UTF-16 only has the BMP in a single code unit, you can't rely on that for counting codepoints anyway; furthermore, rarely do you want to count codepoints, you generally want to count graphemes.

    • @tappy8741
      @tappy8741 10 місяців тому +4

      UTF16 generally sucks and was the bane of my existence for many years, thanks for nothing windows as usual.

    • @Karreth
      @Karreth 7 місяців тому +1

      UTF-16 is actually just another hack to fix UCS-2, which is the fixed 16-bit Universal Coded Character Set. It was intended to contain all the codepoints until we discovered that 16 bits were actually too few bits to contain the set. It really is hacks and partial backwards compatibility all the way down. Windows extended their API to work with wide characters to support UCS-2 before UTF-16 or UTF-8 was a thing, and when UCS-2 died they were kinda screwed and couldn't update their design. So that's how we ended up here.

  • @rustkitty
    @rustkitty 6 місяців тому +1

    53:42 According to Apple, Dylan was in Denmark. According to Microsoft, he supports Donkey Kong. Both very respectable!

  • @fedormalyshkin
    @fedormalyshkin Рік тому +4

    It's the most funny IT conference's speech I've ever seen in years!

  • @qm3ster
    @qm3ster 11 місяців тому +8

    Nothing wrong with writing JavaScript in Ukrainian:
    1. It runs fine.
    2. In production build, the minifier will take it all out and replace it with single-character ascii names.
    3. Source maps will work fine.

  • @pepijnkrijnsen4
    @pepijnkrijnsen4 Рік тому +2

    36:09 I see this a lot in the large German company I work for, specifically this example of having to select a country from a dropdown list. The countries' English names are displayed, but ordered as if they're German names.

  • @SerrinTheElf
    @SerrinTheElf 10 місяців тому +3

    That postal worker deserved a raise lol.

  • @gbeziuk
    @gbeziuk 8 місяців тому +1

    I guess there's not much hope for doing a cameo in the next version of the presentation, but I'll try anyway.
    Using Cyrillic, or any other local writing system in JavaScript is probably a bad idea in any production code, for sure, and it's universally frowned upon for a reason. Universality, you know - if you write science in Medieval Europe, use Latin, don't be a dick.
    But, there's a "but"! Teaching programming to newbies with no STEM background whatsoever, who also don't happen to be fluent in English (you can imagine), I suddenly found allowing them to use the words of their native language as names in their source code very, very useful. Separation of concerns and cognitive load reduction, I guess. As a bonus, there's a clear distinction between library entities and the locally introduced ones, which is also a good thing for the newbies.
    In fact, the role of English in international software development is a huge topic with a ton of practical consequences. Some Chinese have already stopped giving shit on this "you must write everything in English" thing, and it's not gonna stop there.
    I LOVE FiraCode, BTW!

  • @chernyshovandrew
    @chernyshovandrew Рік тому

    Great talk! Thank you.

  • @jalexanderdatkins
    @jalexanderdatkins 9 місяців тому +2

    28:36 Æ is totally a letter in English. It's called the letter æsc, which sounds like "ash", because it represents the tree ash. And for completeness I should also mention the letter œthel, which sounds like Ethel, the personal name. They appear in obviously english words like encyclopædia, manœuvre and Cat7 UTP Æthernet cable.
    … Not to mention archæologist. I may have cheated a little bit with one of mine, but why doesn't that count?

    • @theelmonk
      @theelmonk 7 місяців тому +1

      Laughed at Cat7 UTP Æthernet cable. And realised it's perfectly correct.

    • @jalexanderdatkins
      @jalexanderdatkins 7 місяців тому

      It’s obviously an English word, right? And everyone knows that’s a valid spelling for it.
      The cheaty one is manœuvre, because that’s a French word. But I don’t get why he doesn’t count archæologist? Maybe it’s in the same way as because Latin only has the letter K in one word, it’s not considered part of the Roman alphabet. And to be fair, Æsh and Œthel don’t come up very often. Œstrogen is another one, but that’s basically a Latin word. I don’t know any non-borrowed words containing œ that are still in modern English. Unlike æther.

  • @Jayderzomb
    @Jayderzomb 8 місяців тому

    this was beautifully interesting, thanks!

  • @GuildOfCalamity
    @GuildOfCalamity Рік тому +2

    Great presentation! I code systems that use control codes all the time for work; they are still widely used and accepted (receipt printers, barcode scanners, serial comms, etc).

    • @heinzk023
      @heinzk023 Рік тому +1

      When I was working with ASCII terminals, I liked to use BEL to sound the squeaky buzzer of the terminal.

  • @helmanfrow
    @helmanfrow Рік тому

    Thanks, this was awesome!

  • @akirachisaka9997
    @akirachisaka9997 11 місяців тому +2

    I really wish Dylan talks about Han Unification.
    Like, it's just such a cursed aspect of Unicode. I really wish more people know about it.

  • @jkollin4875683F
    @jkollin4875683F Рік тому +6

    On alphabetical ordering in Finnish... back when I was in school in the 1990s, I was taught that V and W actually are considered equal in Finnish. So going through a list of Finnish surnames, Valli, Waris, Virtanen, Wirtanen (tiebreaker here, I suppose) would be in correct order. But having googled this a bit more, this is apparently nowadays (since 2000) somehow dependent on context -- mixed with foreign words and names such as Vanderbilt and Wolf, it's OK to sort them all V first, then W. So I don't know if even printed dictionaries use this sorting today.
    I don't think this peculiarity is even well-known, IIRC this surprised many of my Finnish coworkers.

    • @cameron7374
      @cameron7374 Рік тому

      So, do computers ever deal with this or do they just sort V first, then W?

    • @jkollin4875683F
      @jkollin4875683F Рік тому

      ​@@cameron7374 Never noticed a system that would (probably in part because W is in Finnish only in names (outside of possibly loanwords), and even there it is very rare). But after a quick googling, apparently at least in 2006 PostgreSQL allowed for this at least in Swedish.

  • @jensGC
    @jensGC 10 місяців тому +3

    ua-cam.com/video/gd5uJ7Nlvvo/v-deo.html
    The Danish letters "æ" and "ø" are much older than the spelling reform in 1948. The only new letter that was introduced in that reform was "å". It is correct that the reform did make Danish orthography more distinct from German - but the main reason for this is that the reform removed the capitalization of nouns.

  • @emmafountain2059
    @emmafountain2059 9 місяців тому +1

    God I have homework but now I have an irresistible urge to research unicode cause this was fascinating. Its amazing how clever some of their solutions are

  • @fieryscorpion
    @fieryscorpion Рік тому

    Wow That was a pretty interesting and fun talk!

  • @pyropunk51
    @pyropunk51 7 місяців тому

    Good talk. I was a bit disappointed that you did not even touch on the whole EBCDIC vs ASCII situation.

  • @zoltanreisz2228
    @zoltanreisz2228 Рік тому

    Nagyon köszönöm (mange tak) :D

  • @RoamingAdhocrat
    @RoamingAdhocrat Рік тому

    Didn't the Amstrad 6128 come with a 7-bit proprietary printer cable, or was that just the 464? There was a DIP setting on Amstrad printers to use one of the 0xxx xxxx characters as £

  • @davidpetersonharvey
    @davidpetersonharvey 9 місяців тому

    This is amazing!

  • @warwickleahyssw4163
    @warwickleahyssw4163 9 місяців тому

    Awesome video Calum

  • @imranhussain8700
    @imranhussain8700 Рік тому +1

    This Guy is true Gem 💎.

  • @nneddenn6207
    @nneddenn6207 9 місяців тому +1

    Dylan, thanks for a speech! It was really interesting to hear all this historic details and understand more how unicode works. And my gratitude for your support of Ukraine! Слава Україні!

  • @stevecarter8810
    @stevecarter8810 Рік тому

    Omg that was god level summarising at the end

  • @acobster
    @acobster 8 місяців тому

    I've read the SO post, buy I never knew there was a name for Zalgo Text! Fantastic talk.

  • @bommel88
    @bommel88 Рік тому

    As somebody from Aachen, I appreciate the choice of examples :D

  • @pawelhepnar1608
    @pawelhepnar1608 Рік тому

    Absolutely brilliant great speech

  • @illegalcoding
    @illegalcoding 11 місяців тому

    This was incredible

  • @Fetrovsky
    @Fetrovsky Рік тому +2

    I remember running echo ^G in DOS as a teen.

  • @NonTwinBrothers
    @NonTwinBrothers 7 місяців тому

    I forgot about the ending. I've always known this as the Kohuept talk :D

  • @kevinfleischer2049
    @kevinfleischer2049 10 місяців тому

    Great talk. I was wondering, what would hide behind that title, and I was not disappointed.

  • @nikneumann1752
    @nikneumann1752 7 місяців тому

    I thought it was boring, but surprise! I watched it to the end. 😁

  • @KangoV
    @KangoV 8 місяців тому

    Java now uses UTF-8 internally. They dropped UTF-16 when Java 8 came out. An hour on plain text? I would not have believed it until I watched it. Just awesome.

  • @junestorm
    @junestorm 9 місяців тому

    Brilliant lecture!! They didn't teach this in the 1980's when I studied computer science. ☝🙃

  • @maximvoloshin7602
    @maximvoloshin7602 Рік тому +6

    You should never underestimate things labeled “simple” or “plane” )) Thanks, Dylan! Appreciate so much everything you’re doing for the community.

    • @NeatNit
      @NeatNit 11 місяців тому +5

      I have never underestimated a plane. Be it a machine that can carry me to the sky, or an infinite flat set of points in 3D space, or a tool used to smooth wooden surfaces, they are always quite intimidating.

    • @maximvoloshin7602
      @maximvoloshin7602 9 місяців тому

      @@NeatNit 🤣🤣You got the point!

  • @manuelvicente9614
    @manuelvicente9614 Рік тому

    Really interesting thanks

  • @MeriaDuck
    @MeriaDuck 9 місяців тому +1

    That Russian postal service anecdote is just so wholesome.

  • @secondengineer9814
    @secondengineer9814 11 місяців тому

    It was interesting to see the origins of Dwarf Fortress UI!

  • @sportundwein
    @sportundwein Рік тому +4

    Amazing content - mega cool Präsentation 🈶

    • @JeremyAndersonBoise
      @JeremyAndersonBoise Рік тому +1

      I see what you did there.

    • @edgeeffect
      @edgeeffect Рік тому

      @@JeremyAndersonBoise I was going to comment "I see what you did there".... but then I saw what YOU did THERE.... so couldn't.

  • @richardtwyning
    @richardtwyning 8 місяців тому

    Brilliant 👍

  • @dr.c2195
    @dr.c2195 7 місяців тому

    What is sequal server? Is it like SQL server or is it a completely different product?

  • @daniilboiko
    @daniilboiko 9 місяців тому

    The best one I watched last year!
    Special thanks for supporting Ukraine! Pike matchbox!!!

  • @jmkok
    @jmkok 9 місяців тому +1

    A fantastic talk about letters. However you use font with an incorrect letter "g" in "Mange tak!" (58:42). Is this by accident or an easter egg?

    • @awelotta
      @awelotta 9 місяців тому

      good eye! the g should be single story or double story with the bottom "reversed". interesting. maybe its supposed to be a single story g with a very loopy tail?, especially since the a's are single story and it's slanted, i.e. it's cursive

  • @bluenuttefly8813
    @bluenuttefly8813 9 місяців тому

    They sang Odoia on the Billie Joel concert, which is a Georgian folk song!!! It is entered as Odoya in the beginning of the album shown... What the heck. I did not know of this. Cool!

  • @JoseJimeniz
    @JoseJimeniz 7 місяців тому

    @33:14 Maybe it's just because it's a default in the en-US culture, but every install of SQL Server i've ever done (and i pre-date SQL Server supporting collation), it always defaults to *case-insensitive* (and accent-sensitive).

  • @Proppeti
    @Proppeti Місяць тому

    Amazing, informative and pretty entertaining!! 😮😅

  • @wagyourtai1
    @wagyourtai1 10 місяців тому

    I love watching different versions of the same talk... :)

    • @theelmonk
      @theelmonk 7 місяців тому

      Is there another version where it carries on past the intruiging statement 'and this is where the version for youtube ends' ?

  • @JamesSmith-ix5jd
    @JamesSmith-ix5jd 11 місяців тому

    Why would you need a bell sequence? At first I thought it was used in different ttys to track the finishing of jobs, but it doesn't seem to be working like that in modern linux.
    Was it an actual bell on a typewriter terminal in like 60's, I couldn't find information ragarding that.

    • @Hauketal
      @Hauketal 10 місяців тому

      Yes, a real bell was included in teletypes. Terminal emulations often support the sound too, but have an option "visual bell", where text and background are inverted for a moment.
      Doesn't work in dialog boxes at all.

  • @fnige
    @fnige 9 місяців тому

    Very VERY minor nitpick but at 53:10, the highlighted flag on the right should be the flag to the left instead of the one currently highlighted

  • @byteseq
    @byteseq 7 місяців тому

    Brillant!

  • @yugoprowers
    @yugoprowers 11 місяців тому

    Pike Matchbox is going to be one of those thing like when someone said Parachuting Buffaloes for lead on the Periodic Table, I'll never forget it because it is such a weird thing.

  • @lazykbys
    @lazykbys 8 місяців тому

    Just to add a bit more pedantry, ASCII is not in alphabetical order since uppercase A comes after lowercase Z. I didn't realize this until I started typing a post to complain about how Windows 10 (unlike Windows 7) sorts Japanese hiragana and katakana, then noticed something similar happened with the English alphabet. Odd how things don't seem strange when you're used to it. :)

  • @AlastairMontgomery
    @AlastairMontgomery Рік тому

    Great talk