How do Spell Checkers work? Levenshtein Edit Distance

Поділитися
Вставка
  • Опубліковано 8 чер 2024
  • Support What's a Creel? on Patreon: / whatsacreel
    Office merch store: whats-a-creel-3.creator-sprin...
    FaceBook: / whatsacreel
    This video is all about an interesting algorithm commonly used in spell-checkers. It is called the Levenshtein Edit Distance algorithm and it determines the distance between two strings of symbols in terms of how many edits would be required to transform one string into another.
    The source code and accompanying e-book are available on Patreon.
    Wikipedia Levenshtein: en.wikipedia.org/wiki/Levensh...
    RosettaCode (featuring Martin Ettl's C++ single row version): rosettacode.org/wiki/Levensht...
    Source for the dictionary used in this video: www.gwicks.net/dictionaries.htm
    Software used to make this vid:
    Visual Studio 2019 Community: www.visualstudio.com/downloads/
    Audacity: www.audacityteam.org/
    OBS: obsproject.com/
    Davinci Resolve 16: www.blackmagicdesign.com/prod...
    OpenOffice: www.openoffice.org/
    Gimp: www.gimp.org/

КОМЕНТАРІ • 144

  • @shards1627
    @shards1627 3 роки тому +128

    "this right here is... ummm... code" why yes, yes it is creel, good job

    • @msoulforged
      @msoulforged 3 роки тому +18

      Hmm, yes...the code here is made of code.

  • @zactron1997
    @zactron1997 3 роки тому +140

    What's really interesting about the full-matrix version of this algorithm is when you compute the values you have the option to make a much smarter system for little extra work. Instead of weighting each edit operation equally, you can weight them based on the probability of the user having mistakenly performed the inverse operation. It's easy to type a "w" instead of and "e" on a QWERTY keyboard for example, but much harder to fumble an "m"
    Additionally, you could assign weights to the words themselves based on the probability the user would've typed the word at all. A user is far more like to type "absolutely" than "aardvark", for example.
    A very cool algorithm with lots of room for extension!

    • @WhatsACreel
      @WhatsACreel  3 роки тому +31

      Brilliant insight! I think the original Lev used different weights too. It cost 2 to perform the sub operation, if I'm not mistaken. Your suggestion is very flexible!! Cheers for watching, and cheers for sharing :)

    • @nate_d376
      @nate_d376 3 роки тому +6

      I was thinking about this very thing before I saw your comment. It breaks down a little if a user is guessing at the spelling though, instead of just a typo.
      Maybe flipping those weights if the weights get too large for common words, like if someone was guessing at spelling an uncommon word, thus it would do the comparison on them instead. Just a thought.

    • @mikec518
      @mikec518 3 роки тому +6

      Yeah, cool modification idea!
      One thing to note with variations on the original unit distances idea, is that Levenshtein distance will no longer behave as a metric. In other words, the computed distance will no longer satisfy the triangle inequality.

    • @gummansgubbe6225
      @gummansgubbe6225 3 роки тому +3

      People looking at protein sequences are doing that. One is called "Grantham's distance", but there are others.

    • @maxfinnian
      @maxfinnian 2 роки тому +3

      Another very useful modification of this is to weight each character by their visual similarity in some font, that way you can easily account for mistakes in a machine learning OCR model. Surprisingly though there is little resources online for creating this visual edit distance, but it can be fairly simply implemented by printing each character of the font to a small image and calculating the SSIM metric between each character and normalizing them into a lookup matrix. This matrix can then be used in a modified levenshtein function for weighting each swap distance (I found that using the quartic of this matrix is the most useful). Because of how it is set up it is effectively a normalized pixel swap distance between the current word and the target word. This makes working with OCR mistakes such as I->l, I->T, 1->I, etc. much less of a pain.

  • @TheTheHellbean
    @TheTheHellbean 3 роки тому +81

    For a spellchecker, since you are checking the same word against your entire dictionary, you can also optimize by reusing the first bits of the matrix up until the nth row, where n is the length of the common substring between the word you checked before this and the one you are checking now. No need to recompute the entire matrix after checking "dinosaur" if you are now checking "dinosaurs"

    • @WhatsACreel
      @WhatsACreel  3 роки тому +17

      Wow! That would certainly be much faster! Cheers for sharing this idea and cheers for watching :)

    • @mikec518
      @mikec518 3 роки тому +3

      This idea lends itself well to the use of a trie or finite state transducer.
      Andrew Gallant has a really cool write-up on this idea, regarding his transducer library: blog.burntsushi.net/transducers/

  • @Stelios.Posantzis
    @Stelios.Posantzis Рік тому +2

    Best quick explanation of the Ficher-Wagner method that is sufficiently detailed to convince one that it works and demonstrates how it works in less than ten minutes.

  • @bird536
    @bird536 3 роки тому +25

    I love this algorithm, the first time I used it was for comparing strands of DNA, which took a lot of memory without optimizations. We had to compare two strings that were 10000000+ characters. You do the math for the space required.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +7

      Wow! Pure class mate! Well done ya :)

    • @idjles
      @idjles 3 роки тому +15

      @@WhatsACreel You can massively speed up Lenvenshtein for very large strings with this recursive approach that I invented (I use this for comparing texts between two versions of a book). It is 1000 times faster for 1,000,000 elements and uses 1,000,000 times less memory. And it spits out the subs/insert/del for each DNA step.
      1. Break the DNA strand into groups of 5 or 10 or whatever.
      2. Build a hash of these groups and remove from the hash any duplicates. O(n)
      3. Do the same with the other strand. O(n)
      4. Compare the two hashes and remove elements that are not in the other. O(n)
      5. You now have a hash of elements that are unique in each strand and are also in the other strand.
      6. Now find the longest stretches in both that are indentical. O(n)
      7. Split the strand into the left substrand, the middle identical substrand and the right substrand. We are going to recurse the left and right sub-strands.
      8. If the longest matching section of a substrand is less than some value X then perform Levensthein on that sequence. O(n^2) on short sections. otherwise recurse from step 6.
      9. Calculate that best backtracking path by prefering the step the gets you closer to the origin each step) through the matrix and store into the Output. O(n)
      So with basic Levensthein, you have 1,000,000 characters and 10^12 comparisons in a matrix of 10^12
      With my method you have 100,000 sections, and you can be sure that the sections that don't match are under 1000. so the comparisions are 1000^2*1000 = 10^9, and the matrix is only about 10^6. This is 1000 times faster than basic Levensthein, and uses a million times less memory.
      You MUST keep the entire Matrix

    • @dstgre
      @dstgre 3 місяці тому

      I am realizing that this operation can be also done piecemeal. And as someone else pointed out, the similar comparisons can be memoized out

  • @ir2001
    @ir2001 3 роки тому +16

    Aside, this video could also be a very good introduction for those who are looking for the whys and hows of Dynamic Programming

  • @nayjames123
    @nayjames123 3 роки тому +14

    There's also another optimization available for things like spell checks. If you have a maximum allowed edit distance, you can break out once you know the edit distance will be greater than that. A trivial short circuit is if the length of 2 strings is greater than that, then do no work, but you can work out the smallest value in the row and take that as the minimum possible edit distance.

  • @AnyVideo999
    @AnyVideo999 2 роки тому +2

    For anyone looking for the general lesson to apply to other algorithms, this is a classic case where the first version is easy to understand, but as it compares the execution to the problem on several small sub-executions, usually the accumulative recursion version works better.

  • @waaaaaaah5135
    @waaaaaaah5135 3 роки тому +8

    Levenshtein edit distance is how I got my current job :)

    • @mikec518
      @mikec518 3 роки тому

      Wow really? Did they ask it during your interview or something?

    • @waaaaaaah5135
      @waaaaaaah5135 3 роки тому +1

      @@mikec518 Nope, I was given a project to do, and the levenshtein edit distance was an important part of it

    • @mikec518
      @mikec518 3 роки тому +1

      Sound like a lot of fun, hope that project goes well

  • @DrakiniteOfficial
    @DrakiniteOfficial 3 роки тому +2

    Wow. I'm astonished that all that calculation can be done in 50 milliseconds. Modern processors are amazing.

  • @williamrutherford553
    @williamrutherford553 2 роки тому +1

    Looking at the recursive definition actually gives you insight into the three boxes you look at! Instead of doing a recursive call, it's just the value to the right, up, or diagonal. When the characters are equal, that's just the call where you return the distance of tail(a) and tail(b). It's not really important to knowing the algorithm, but it's a great way of understanding WHY it works.
    I think this is also a great example of why recursive algorithms can be so inefficient, and why iterative is almost always better. The number of repeated operations is massive. Generating a Fibonacci sequence is another great example, a recursive function goes from the top down, an iterative solution goes from the bottom up. Just like how you can do Levenshtein distance in two or one rows, you can calculate Fibonacci numbers using just the previous two numbers.

  • @RickeyBowers
    @RickeyBowers 3 роки тому +14

    Nice presentation.
    The compiler doesn't know that the edit distance will never be greater than the maximum string length (probably < 256). This fact increases the parallelization possible.
    Tarjan's SCC algorithm also reduces the problem to a single array - could make an interesting video as well. (c:

    • @mikec518
      @mikec518 3 роки тому +3

      The maximum possible edit distance for two sequences is the length of the longer. Maybe an assertion that both lengths are under 256 could help the compiler in that way?
      And that’s a really interesting fact about Tarjan’s SCC algorithm! I had no idea it was related

    • @meneldal
      @meneldal 3 роки тому +1

      The actual maximum edit distance doesn't really matter for a spell checker.
      You ought to abort when you go over 5, don't bother computing more than that.

  • @dstgre
    @dstgre 3 місяці тому

    I like this explanation much better as a lot of the others are just passing on rote information with missing component of real understanding

  • @zperk13
    @zperk13 3 роки тому +3

    I wrote the first function (levenshtein) in code and also added a thing to track stuff. I did the "cabbages" and "rabbit" like you did. The deepest depth of the recursion calls was 14 and the number of times the function was called was 7,749

  • @djnn22
    @djnn22 3 роки тому +5

    That was really interesting! Thank you for your enthusiasm on such a niche subject!

  • @black_platypus
    @black_platypus 10 місяців тому

    This is so satisfying! 🤩
    Abstracting away everything until we just need to check for equality of a character, this is giving me quite the software hard-on 😁

  • @uKhyta
    @uKhyta Рік тому

    The best explanation for the Levenshtein Edit Distance on the Interwebs. Kudos to you for explaining it so good!

  • @JohnnyWednesday
    @JohnnyWednesday Рік тому

    You're a fantastic teacher - master of analogies - thanks for sharing your hard earned knowledge :)

  • @msoulforged
    @msoulforged 3 роки тому +1

    This was very eye-opening and fun to watch, thanks! And we cannot wait for the priority queue and maybe few other important/common data structures video :)

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @jeroenodb
    @jeroenodb 3 роки тому +10

    He Creel, loved this video! I think the time it takes to evaluate the original levenshtein recurrence is actually omega(3^n) when you have two words of n size, so quite a bit slower than n cubed.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +5

      Ouch!! I knew it was slow, but that's really something! Cheers for sharing and watching :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 3 роки тому +1

      well, you would usually write 2^n or e^n instead of 3^n since 3^n=e^{n ln(3)}

    • @jeroenodb
      @jeroenodb 3 роки тому +4

      @@gideonmaxmerling204 Actually with logarithms this works, but with Theta(a^n) and Theta( b^n) it does not work. These are really different complexities. You can't rewrite 3^n as C*2^n. 3^n will eventually outgrow c*2^n whichever constant c you choose. Also my original bound is not totally correct, but say you have the strings aa...aa and bb...bb. This is an example where the running time is omega(3^n)

    • @user-qq6si7zv3t
      @user-qq6si7zv3t 2 роки тому

      @@jeroenodb I'm not 100% certain, but as someone who has taken a course on algorithm design and analysis, I think it would be valid to simply write Theta(exp(n)) or Theta(e^n). The parameter of the exponential function is not important. All that matters is the pattern of growth. Another example, 3n is bigger than 2n, but we still just write Theta(n).

    • @klobiforpresident2254
      @klobiforpresident2254 2 роки тому

      @@user-qq6si7zv3t
      I'll just for the moment drop Theta and use O (since Theta follows from O and Omega that should be fine).
      The was this works is that if we have a function f(n) describing the class in O, commonly n, n^2, log(n), and so forth, and a function g(n) describing the algorithm's running time then the algorithm is in O(f(n)) if, and only if, we can find a constant C and an n\_{0} such that C f(n) ≥ g(n) for all n > n\_{0}.
      This means g(n) = 2n and g(n) = 3n will be in O(n) if we set C = 3 (here n\_{0} 1). Indeed they're also in O(n^2) since for C = 3 and n\_{0} = 1 the same logic works.
      For logarithms we can change the Bede simply. logb(x) = loga(x) / loga(b). This means we can set C = 1 / loga(b).
      I am unaware of such a system for exponents. A previous post here talks about a way of introducing a constant factor into the exponent. That is indeed possible but that unfortunately does not suffice.

  • @PaulaBean
    @PaulaBean 10 місяців тому

    Very informative, like many of your videos!

  • @tahacaliskanileenbasitseki4356
    @tahacaliskanileenbasitseki4356 26 днів тому

    thank you so much from Turkey

  • @jackkensik7002
    @jackkensik7002 3 роки тому +1

    great video, this is the best teaching of cs concepts I have seen keep it up!

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @darske1
    @darske1 3 роки тому

    Man, your videos are awesome, thanks and keep it up!

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @devtthompson3646
    @devtthompson3646 2 роки тому

    Well presented with a great explanation and detail. Nice one. :)

  • @ChrisHalden007
    @ChrisHalden007 10 місяців тому

    Great video. Thanks

  • @lcarliner
    @lcarliner 2 роки тому

    There is one non-word that would trip up so many early technology spelling checkers was “irregardless!”

  • @MrUtak
    @MrUtak 3 роки тому

    Man, I recently had to find a code of levenshtein Distance for excel because the quiz-level of my class didn't check for plagiarism. So I had to run it myself, on huge strings. And it works!!!

    • @MrUtak
      @MrUtak 3 роки тому

      Checking the levenshtein distance between a whole document takes a lot of brain power for excel, but it does work remarkably fast. all things considered. It must use one of the later two algorithms.

  • @coder2k
    @coder2k 3 роки тому

    Very interesting topic, great explanation as always! Thanks! :)

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @Dave_thenerd
    @Dave_thenerd 2 роки тому +1

    This seems like it could be easily parallelized with Multi-Threading and AVX.

  • @jupjyotsinghkhosla5681
    @jupjyotsinghkhosla5681 2 роки тому

    Amazing Video!!!

  • @mikec518
    @mikec518 3 роки тому +2

    There are a few additional optimizations you can make to the Wagner-Fischer algorithm too!
    For instance, if you don’t need to keep track of the edit operations, you only need one row in memory at a time.
    Also, there is a threshold variation, requiring only distance subcomputations that are within the threshold away from the main diagonal.
    This is one of my favorite things to research in my spare time, thanks for making a video on it! :)

  • @CheezePie
    @CheezePie 2 місяці тому

    Great video.. you earned a subscriber!

  • @mapclickerandy
    @mapclickerandy 3 роки тому

    Brilliant video, keep it up !

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @joaquinglez6889
    @joaquinglez6889 3 роки тому +1

    Wow! You alway inspire me! Looks like a great algorithm to create some sort of super-grep version for linux, where you type a word.and you might add as a parameter how close you want the word of the returned line be similar to your original word. Now I wonder why is it not even there

  • @phillies4eva
    @phillies4eva 2 роки тому +1

    Great video! Can you explain why Google nearly 100% of the time gives me the correct spelling when the default spell check fails?

  • @akashparhi7004
    @akashparhi7004 3 роки тому

    Thank you

  • @PeriOfTheGee
    @PeriOfTheGee 2 роки тому

    What I believe could speed up the full matrix version and the 1 row too to a comparable speed, minus cache locality, is if we reused the same matrix instance rather than allocating a new one for each call.

  • @briandhackney
    @briandhackney 2 роки тому

    Great content, love the channel, wish you did one a week ;)

  • @nightpicnicwii
    @nightpicnicwii 3 роки тому

    Very cool!

  • @ricos1497
    @ricos1497 3 роки тому

    Well, that was a good viewing. Thanks! Have you ever done a video on your background? Covering things like subjects you studied etc, the type of thing that has led you to do in your work and so forth. That would make for an interesting video.

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      I studied music and software engineering. My main area of interest is discrete maths, boolean algebra and sets, that kind of thing. Cheers for watching :)

    • @ricos1497
      @ricos1497 3 роки тому

      @@WhatsACreel thanks for the reply! Interesting. I've always been too indiscreet with my maths. People are always saying that about me. Put your maths away Rico, you're making a fool of yourself, they say. Now I know where I went wrong.

  • @quantumastrologer5599
    @quantumastrologer5599 Рік тому

    Thx!

  • @klobiforpresident2254
    @klobiforpresident2254 2 роки тому

    At 00:56 I, who is way too tired and forgot the video title, audibly wondered, "wouldn't they just use a weighted edit distance?"
    Yeah. Good night, folks.

  • @MagicGonads
    @MagicGonads 2 роки тому

    The checkerboard pattern made me think that "Spell Checkers" was a variant on the game of Checkers

  • @NicosLeben
    @NicosLeben 3 роки тому +3

    Can you modify it in a way that some substitutions have a higher score than others? For example if you have a typo you mostly hit a character in the neighbourhood of the character you were targeting for. It is not very likely to substitute a Q by a O, it is more like you tried to hit W, A or S. Of course this depends on the keyboard layout you are using.

    • @klobiforpresident2254
      @klobiforpresident2254 2 роки тому +1

      Yes, this is very much possible to do. Modern spellcheckers often do this. In theory it's as simple as giving adjacent keys (for example {r, t, d, g c} for f on a standard QWERTY or QWERTZ layout) a lower weight of e.g. half. More complex approaches exist but this should illustrate the point.

  • @willofirony
    @willofirony 3 роки тому +2

    Awesome! So, its fast enough for autocomplete on an edit box; especially as the dictionary count will be in the 100's rather than 10,000's. I have GOT to play with that. Thanks, Mate.

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      I reckon it would be excellent for that application!

  • @nate_d376
    @nate_d376 3 роки тому

    Very cool. Thank you for explaining this. I guess us engineers find this stuff interesting.

  • @-notakil
    @-notakil 3 роки тому

    Very interesting!

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @nytrex2001
    @nytrex2001 2 роки тому

    So is this the algorithm they use for code-completion in IDE's when they list possible functions/commands you may be intending to type?

  • @incription
    @incription 11 місяців тому

    I always thought they calculated this based on the distance to the key on the keyboard, I guess this is the case more for autocorrect

  • @damian_smith
    @damian_smith 3 роки тому

    I enjoyed that - thank you. The algorithm assumes that dropping, adding, or mistyping to an arbitrary letter are all equally likely, but I suspect that's not a valid assumption and the actual error rates for the 3 classes of error, and for which letter gets swapped are different. Do other distance metrics for offering choices work by weighting those errors differently, or even via a different approach? I'm very impressed by the swiping keyboards on mobile phones, for example - I imagine they might be doing something like this?

    • @mikec518
      @mikec518 3 роки тому +3

      Yeah, you could probably adjust the value added on mismatch based on anything you wanted, like distance on a keyboard.
      Though, keep in mind that this will probably affect the triangle-inequality conforming property of the standard Levenshtein distance

  • @karoshi2
    @karoshi2 2 роки тому +1

    Did a full text indexing project once and actually had to implement a trie by hand. That's kind of a tree which you fill with all the words known from your dictionary: take the root node, link it to another one "consuming" the first letter, continue with the remaining string from then on and mark the final node for this word so you know it is contained in your dictionary.
    Once built up its rather simple to traverse: take the word you search for, see if there is a node following the first letter. If there is one, edit distance remains unchanged. Then follow all the other possible routes with increased distance.
    When you have a maximum edit distance it terminates quite fast: if the maximum distance is used up stop the process and terminate no result. Otherwise return all the results found so far.
    Recursive but quite easy to understand and implement, also somewhat object oriented.
    Was fun to implement and must admit I was quite proud as I came up with the algorithm and only then realised that it already had been invented the very same way but there was no implementation in the language we used at that time.
    Vanilla implementation that came with the search engine library did a brute force Levenshtein on every known word, then threw away those that were too distant. Took approx 30 seconds for every run. Trie took a few ms only.

  • @MegaMJoker
    @MegaMJoker 3 роки тому +1

    Ah, the magic of dynamic programming :)

  • @bart2019
    @bart2019 3 роки тому +2

    I was wondering, the scan of the dictionary itself is brute force, one word at a time.
    I assume there must be some way to scan the dictionary smarter, by subdividing it into groups of words that are rather similar, so you can eliminate a whole lot of them as not interesting simply because you found one word that is quite similar...?
    Well, I would hope so, though I would not immediately know how.

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p 2 роки тому

      I think alphabetical order would actually be a decent start, since it order words by initial substring, which you can then reuse, so effectively you're only calculating one or a couple new letter combinations per word (supposing a "dense" dictionary). Depending on the relative size of the words this also lets you know that the distance can only grow if the non-fixed word grows. But maybe there's some sort of pre-treatment that could help extend that?

  • @coomservative
    @coomservative Рік тому

    Can you do a video on how grammar checkers work? So much content about parsing/trees and I just want someone to walk through the problem first which is How Do You Make a Word Processor Detect Incorrect Grammar? The stuff out there just assumes familiarity with the concepts

  • @BlokeBritish
    @BlokeBritish 2 роки тому

    im building a google sheets app for my users to do a search.
    i have a database of around 2 million terms or sentences.
    the user types something and all closely matching terms from the list must be displayed.
    will this be the best technique to use ?
    i just tested this and it seems really quick.
    my only concern is that it doesnt account for swapped letters, probably mistyped.
    for eg. the edit distance between
    'the' and 'teh'
    returns '2'
    now the words are 2 letters off but really its just a mistyped swap.
    i was thinking of comparing the users entry by additionlly swapping their search words too.
    so if they searched for 'debit'
    then i will additionally search for
    edbit
    dbeit
    deibt
    debti
    so even if the mistakenly swapped 2 adjacent letters i will catch that first.
    if not caught ,
    only then compare the edit distance.
    in my case the search could be a set of words.
    'debit card declined'
    so the results must include all those which have at least one word from the searched term.
    so something like google search.
    i think this levenstein techinique will be best suitable for me with little tweaks

  • @redjr242
    @redjr242 3 роки тому +2

    This algorithm looks to be symmetric, in that d(a,b)=d(b,a). Shouldn't that mean that you only have to compute half the matrix?

    • @WhatsACreel
      @WhatsACreel  3 роки тому +2

      What a fantastic observation! It seems we could indeed get away with computing half the matrix! Nice one :)

  • @msclrhd
    @msclrhd 2 роки тому

    Given that you are comparing multiple words, I wonder if it is possible to store them in something like a trie data structure and reuse computed Levenshtein distances for the same prefixes.

  • @devinschlegel1763
    @devinschlegel1763 2 роки тому

    i know i am late, but if you have a massive dictionary, could you not spend time precomputing up to a certain length to save time?

  • @mohammedalsayed3508
    @mohammedalsayed3508 11 місяців тому

    I wish you were my professor

  • @tekhiun
    @tekhiun 3 роки тому

    Hey Creel, have you ever tried using cheat engine ?? I am actually using it to learn asm and it would be interesting to see at least a video where you used a decompiler or CE it self to look at the ASM code of some software or game and hear what you have to say about it such as is it optimal code or not. Turns out hacking games can be as much fun as playing the game and I think you could make some quite interesting videos on the subject.

  • @au7weeng534
    @au7weeng534 2 роки тому

    the second one looks like a memoized version of the first one

  • @familytp8390
    @familytp8390 2 роки тому

    I'm having problems running this code idk why

  • @theexplosionist2019
    @theexplosionist2019 3 роки тому

    62ms for "2 rows" for "aaaaaaaaaa" using std::string to store the dictionary.
    Using the "1 row" version and fixed size array chars instead of std::string still take 62ms.

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 роки тому

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @treyquattro
    @treyquattro 3 роки тому

    from now on I'm calling dinosaurs "noshers"! "Did you see that T-Rex? It was the king of the noshers." Good on yer, mate!

  • @aroncanapa5796
    @aroncanapa5796 2 роки тому

    👍

  • @Jkauppa
    @Jkauppa 2 роки тому

    dont need, just construct the possible trees to get the distance

  • @joinedupjon
    @joinedupjon 3 роки тому

    Some problems seem to lend themselves to a recursive solution very naturally. But that's not necessarily efficient. I don't think you've shown how to optimise recursion or convert a recursive algorithm to iteration, or if there are ways of writing high level language that make it more likely the compiler will optimise some grievous recursion away.

    • @joinedupjon
      @joinedupjon 3 роки тому

      Above is meant to be a suggestion for possible future episodes. Not trying to have a dig.
      I'm enjoying the series.

  • @IronCandyNotes
    @IronCandyNotes 3 роки тому

    neat

  • @IronCandyNotes
    @IronCandyNotes 3 роки тому +1

    ...so does the same work for images?

    • @mikec518
      @mikec518 3 роки тому

      It would take a very long time, but yeah!
      This works on any two sequences, where each elements in one is comparable with each element of the other.
      For example, you could represent both images as sequences of bytes. For two 500px x 500px images, each sequence would be 250k bytes long, and the table would have 62.5 billion cells lol

    • @IronCandyNotes
      @IronCandyNotes 3 роки тому

      @@mikec518 maybe using GPU or multiple cores? no idea just guessing here...

  • @bpansky
    @bpansky 3 роки тому

    spell checkers always tell me (even right now in this comment) that my writing of "dissapears" should be...dispersal...

  • @yash1152
    @yash1152 8 місяців тому

    10:53 will watch later
    it looks tooo similar to the LCS problem's solving algorithm

  • @burnsy96
    @burnsy96 3 місяці тому +1

    _colour_

  • @lx2222x
    @lx2222x 3 роки тому

    HLSL tutorial when? 😁

  • @lunalma
    @lunalma Рік тому

    I thought this was a video about magic on 'checkers' (the boardgame)...

  • @DFPercush
    @DFPercush 3 роки тому +1

    This could make a pretty good binary diff algorithm, could it not? Like when games update and need to patch huge files, this would be the way to do it, rather than downloading the entire gigabyte+ data.

    • @SolarShado
      @SolarShado 3 роки тому

      Distributing software updates as diffs can get tricky , unless you're certain that everyone will only ever be one version behind. You have to keep old diffs around, for example v1->v2, _and_ v2->v3, and so on, which can get expensive (disk space is cheap, but not free). But then what if someone's updating from v1 to v3? Should you also have a diff for that? It's more storage on the server end, but, for larger version gaps, possibly less bandwidth (which is both cheaper _and_ faster).
      And you'll probably want to compress... something... to save bandwidth and/or disk space. You can compress the diffs easily enough (though they may or may not compress well), but do you diff compressed or unpacked versions of the files? A small change in an unpacked file could potentially result in a (much) larger change between the compressed versions. And compression isn't just used "on the wire"; often times part of the reason a program (especially games) take as long as they do to open is that they're decompressing some assets from disk into ram.
      I guess what I'm getting at is that it's a more complicated problem than it may, at first, appear. Don't get me wrong, binary diffs are absolutely used in such systems, but they're far from a magic bullet.

    • @meneldal
      @meneldal 3 роки тому +1

      Binary diffs are more or less easy depending on what kind of files you have. Compression is where it tends to perform poorly. With a zip archive, as each file is compressed entirely independently, you can easily make a diff to fix just one file. But if the compression not per file but for the whole archive, unless you force your software to reuse the same dictionary, everything will be changed, so diffs aren't practical at all.
      For video assets, if you want to edit just a couple frames there are ways to avoid reencoding everything, but most people wouldn't bother so pretty much the whole video will have to be redownloaded.
      In the end, it depends a lot on the filesystem structure you use and how you compress your data.

    • @DFPercush
      @DFPercush 3 роки тому

      @@meneldal Of course. Zip the diff, don't diff the zip. :p That's what git does.

    • @meneldal
      @meneldal 3 роки тому

      @@DFPercush That's what makes sense for code, that doesn't mean it works best with all kinds of assets. You're probably not shipping code to consumers, it's compiled binaries.

    • @DFPercush
      @DFPercush 3 роки тому

      True, but suppose you want to update one texture in a game assets package, or just a couple of polygons in a mesh. The rest of the (uncompressed) data would remain unchanged, so a good binary diff would take care of that. If you want an example see the Blizzard/Battle.net updater.

  • @cmilkau
    @cmilkau 3 роки тому

    So an algorithm just applying dynamic programming to a pre-existing recursive definition has a name? Wow, I could get so many algorithms carry my name right now.
    EDIT: spelling

  • @alex_smallet
    @alex_smallet 3 роки тому

    Beethoven notes on the table adds credibility.

  • @KelvinNishikawa
    @KelvinNishikawa 3 роки тому +1

    Could you A* your way from top-left to bottom-right in the matrix?

    • @mikec518
      @mikec518 3 роки тому +1

      An interesting idea, but you don’t have the table values upfront.
      For each cell, you can keep track of which previous cell you used for the calculation, and once you reach the bottom right, simply follow the path backwards to get the edits necessary. No A* necessary :)

  • @cmilkau
    @cmilkau 3 роки тому

    What you actually want to do is build an index for the dictionary that plays nicely with the edit distance, don't you? Do you really scan the entire dictionary computing edit distances?

  • @DukePaprikar
    @DukePaprikar 3 роки тому

    11:40 Isn't it three edits?
    1) delete C
    2) delete the second B
    3) insert R
    How could you do it in two edits?
    Edit (pun intended): I forgot that one of the allowed operations is substitution. Btw, that's cheating.

  • @brianb2308
    @brianb2308 3 роки тому

    That was awesome
    "C, a, b, r, a, b, a, b, a, b, b, b" -creel
    When you dropping your next record?

  • @gideonmaxmerling204
    @gideonmaxmerling204 3 роки тому

    Hey creel, I was going through x86-64 instructions the other day for fun and saw this AVX512 instruction:
    www.felixcloutier.com/x86/vpternlogd:vpternlogq
    this instruction lets you perform custom 3-bit ternary logic on 3 zmmwords/ymmwords/xmmwords.
    do you have any idea for any good use case for this?
    I myself though that maybe a logic circuit simulator would be a ble to compile a circuit into a set of vpternlog operations and get some fucking intense performance but that would be very difficult to do and requires compiling code at runtime.
    do you have a better idea?

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      I reckon logic circuit simulations would be an excellent use!!
      I did a bunch of automaton simulations a while back. They worked with boolean/nand and an evolution algorithm. They generated code to divide and factor integers, and then killed each other or reproduced based on how accurate they were. Maybe this instruction would have helped that type of thing? It was very good fun, but my little automatons were very slow to train.
      It would be cool if there was a matching scattered load, or the imm8 could be an 8 bit reg.
      Great instruction :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 3 роки тому

      @@WhatsACreel you could also modify the code at runtime.
      I thought that this would be useful for compiling circuits and running them intensely fast but that would not be easy and require runtime compilation.

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      @@gideonmaxmerling204 That's an excellent idea mate!! It would absolutely fly :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 3 роки тому

      @@WhatsACreel the only other use for it that I saw online is for calculating md5 and sha1 hashes but why would you do that?
      might also be able to do sha2 but I'm not sure about that.

  • @casperes0912
    @casperes0912 3 роки тому

    Fancy name for just using dynamic programming on the original algorithm

  • @coshvjicujmlqef6047
    @coshvjicujmlqef6047 3 роки тому +1

    soviet union was the best country in history.

  • @diconicabastion5790
    @diconicabastion5790 2 роки тому

    Interesting kind of poor system though. A Phonetic comparison would be better.

  • @lemdixon01
    @lemdixon01 3 роки тому +1

    Annoyingly.