My instinctive approach would be to use set, star with one word, try to find a second one compatible, then when trying to find the third word, only continue from the point in the word list that we already reached, because we know that previous words in the list were already deemed incompatible (having duplicate letters), etc. never go back to the beginning of the list of words. So the list should be scanned only once for each 1st word.
On a second thought, that would not give all the words: suppose that words A, D, G, L and P are a set of words, the algoithm would ignore a set made of A, F, K, M and Q or A, D, G, N and R...
I had the same approach, to use sets for each word that have all words that are still 'valid' with them. Then picking a seed word and recursively restricting the sets based on what's available. Doing it this way automatically sets out the scope for each search, and you can apply the forward-only search within that scope. Even better you can sort the search words at the outset by how many words would still be valid for each. Doing the forward-only search this way means the biggest set operations are left for the end, but by then you've already eliminated as many words as possible so the operations are still kept small. I combined that sort with the forward-only search from the graph method, and got the time down to 450s from 900s.
You'd have thought the naive brute force approach would be (after throwing out all the words that have duplicate letters etc) to sort the letters in each word into alphabetical order, e.g GRAPE = AEGPR. At that point that anagrams are all the same word and should be discarded as duplicates comes straight out of your new representation. From there a trie is a good data structure to find words in your set that do or don't have particular letters. Obviously your first word would just be going through each word in turn. Let's say your first word was 'ABHOR' - actually happens to already be in alphabetical order. Well you search your trie starting at C (because A and B are used) and the second letter for the C word isn't going to begin CR, CH, CO, CA (not that they are in the trie in this order anyway, any word beginning CA would begin with A because you've sorted the letters. A is most definitely going to be the first letter of the first sorted-word of any solution. Similarly B, unless it's in the first word (in which case it would be the 2nd letter) will be the first letter of the second sorted-word...and so on. The point is, you don't traverse letters that you've already used and so won't need to create sets of 20 letters to compare) You should bail faster once you start to run out letters. I'm not convinced that would speed it up sufficiently though. I think you'd beat 31 days, but the graph theory is likely the best approach because you're putting the information you want to find (words that don't share letters) directly into the data structure and then presumably searching that.
@@michael1 My idea, even if I did not mentioned it, was to have the letters sorted in a word, treating the words as set of letters instead of strings. In fact I understand that is how Matt did. And the list of words would be ordered alphabetically by their set of letters (all words containing an A in any position would be at the top of the list, the all words containing a B but no A, etc.) I have often bee accused that people can read inside my head. :) In fact, my starting position is just a read out of the graph I think.
Using the sets is very intuitive, but sets are slow data structures, which is probably slowing the algorithm biiig time. A very good way to do this same check that can potentially be a lot faster (I don't know in python but in C/C++ it would certainly be) is by encoding the letters in a word as a bit field. It fits all in a normal integer (32 bits) and you can just bit-wise-and the value for each pair of words, if it's 0 then they don't share letters, if different than 0 then they share one or more letters.
bitwise-and, right? Is arithmetic-and actually a term? But yeah, that's a brilliant way of representing this problem. The non-0 result even tells you what letter(s) they share in common. For those not impressed, bitwise-and is something your CPU can do in the shortest time of anything it can do, so you can stack a lot of these up in a second.
i mean sets have O(1) lookup I thought? even in python. its a hash look up iirc. my bigger issue is SPACE. image how much memory you have on the stack/ram because of that lol
@@pvic6959 Don't forget the constant term that is ignored in O(1). Bitwise operators are implemented in silicon in the ALU. I don't see how hashing can beat that in speed.
Shorter video generally gets watched first, so here I am. Almost got me to pause and go back to get context from the first video, but I stuck through it.
I think I have a better idea for filtering out anagrams: What if you associated every letter with a unique prime number (like: a-> 2 b-> 3 c->5 ...) and for every word you multiplied together the associated numbers of its letters. This way you would get a unique number for each letter combination without considering the order. So you would put this created number in a list (or a hash map for better performance) and every time you have a new word, you would just have to convert it into a number this way and check if the number exists in the list. If it does, the word is an anagram, if not it is unique.
Prime multiplication covers the general case where repeats are allowed, but is overkill given that they are not. It's cheaper to just use a bit per character, which only requires 26 bits, less than a 32-bit "word" of data, whereas the product of 22nd-26th primes is 5.7 million, or 33 bits. Bit twiddling is also much faster than multiplication.
@@belg4mit On a modern system integer multiplication is fast enough and it also covers larger alphabets like the Hungarian. Also using bitmasks from the developer- prospective isn't necessaryly easier. But i like the idea
But i see the huge problem with numbers larger than the Long limit. So possible solutions: Long[] as a large bitmask and some function to set its bits abstracting the array. Or just use a longer boolean array, that shouldn't be so bad for erformance. But in both cases we have the problem of the equality checks.
Oh no. I started watching this one before the first channel, but I realized it was for a video I hadn't watched so I left to go watch the first channel. I can't believe I missed the chance to be called a maverick.
I watched this first, was very confused what was going on, so I watched the main channel, then went back after I understood the problem and noticed the modified ending. Do you always do that or is it new?
I would have used a 32-bit integer as a bitfield to represent a set of letters. Then bitwise OR to combine them, AND to test intersections, etc. It should be very fast even in python.
I watched second video first because I went through my subscriptions, adding everything to my "watch later" list from top to bottom, since I was on mobile, and that is reverse order :P It's usually fine bc most channels only upload one video a day and they're rarely related to their previous video
Whaat? I was soo ssure, that it were actual pictures and I was just imagining the little movement. Also now, the perfectly timed giggles, no longer look like magic acting skills anymore. So S.A.D.
I wonder if the UA-cam algorithm is clever enough to not suggest videos that should be watched after something we have not already watched. Fortunately your 2nd channel one popped up after I'd seen the other, but for those who see the videos out of order... perhaps UA-cam suggested them out of order, or perhaps the viewers deliberately chose to watch part 2 first. Sometimes if a video pops up that has a part or episode nu,mber on, I usually hunt for previous videos in the series so I can get the whole story - unless they are stand-alone episodes when it doesn't matter. My Mum admitted she liked to read the back pages of a book first - perhaps lots of people do! I did have a look at the back as well, but after reading a few chapters and then having a look at the contents list and noticing something interesting was at the back! Don't you mean "And if you've not watched this..." in the description?
The filter you put when you pause the video looks exactly like the chromatic aberration I see in my glasses if I look off center to the left. So I can look off center to the right instead, I can compensate for it. For a bit I thought the chromatic aberration was getting worse and was concerned.
I got the notification for the second channel video first, realised that I haven't watched what you were toalking about and go to the first channel the
AS AN ENGINEER I WOULDA PROBABLY USED THE UNION METHOD TOO. ( I AM ON A CHROMEBOX AND CANNOT FIGURE OUT HOW TO GET RID OF CAPSLOCK SINCE I REMOVED MY KEYBOARD'S CAPSLOCK BUTTON AND DONT HAVE ANYTHING WITHIN REACH TO STICK IN THE HOLE)
I know you said to don't tell you how to improve it (even though someone already improved it), but I can't help but wonder, why not just keep looking for the next words in the same way? First you take a word from scanA and check if it has no letter in common with a word from scanB. If not, you take word from scanC, which starts after the scanB, and so on, until you get to scanE.
Imagine if there was a way to talk about the sorts of things you have in a program, more than just objects, but… the type of the things, and how one type of thing is different from another type of thing… some kind of “type system” perhaps. Seriously Matt, I will personally teach you Haskell so you can maths while you program and program while you maths. I’ll even do it in person next time you’re in 🇦🇺
Actually I would be interested to hear you explaining _every_ non-trivial line of your solution in this much detail.
And it would most likely be entertaining if he had to explain in that level of detail the trivial ones too. Would be nice to see him pad the time.
I'm surprised to find I wasn't subscribed to the main channel, but one consequence is that I watched the second video first. 😀
Never expected to be the one to make the likes the funny number
WAIT! Bec was in the room?! I'm stunned.
False. Clearly it was advanced photography and 3D technology. Can't fool MY eyes!
I'm glad to see someone names their variables in the same way I do
My instinctive approach would be to use set, star with one word, try to find a second one compatible, then when trying to find the third word, only continue from the point in the word list that we already reached, because we know that previous words in the list were already deemed incompatible (having duplicate letters), etc. never go back to the beginning of the list of words. So the list should be scanned only once for each 1st word.
On a second thought, that would not give all the words: suppose that words A, D, G, L and P are a set of words, the algoithm would ignore a set made of A, F, K, M and Q or A, D, G, N and R...
I had the same approach, to use sets for each word that have all words that are still 'valid' with them. Then picking a seed word and recursively restricting the sets based on what's available. Doing it this way automatically sets out the scope for each search, and you can apply the forward-only search within that scope.
Even better you can sort the search words at the outset by how many words would still be valid for each. Doing the forward-only search this way means the biggest set operations are left for the end, but by then you've already eliminated as many words as possible so the operations are still kept small.
I combined that sort with the forward-only search from the graph method, and got the time down to 450s from 900s.
You'd have thought the naive brute force approach would be (after throwing out all the words that have duplicate letters etc) to sort the letters in each word into alphabetical order, e.g GRAPE = AEGPR. At that point that anagrams are all the same word and should be discarded as duplicates comes straight out of your new representation.
From there a trie is a good data structure to find words in your set that do or don't have particular letters. Obviously your first word would just be going through each word in turn. Let's say your first word was 'ABHOR' - actually happens to already be in alphabetical order. Well you search your trie starting at C (because A and B are used) and the second letter for the C word isn't going to begin CR, CH, CO, CA (not that they are in the trie in this order anyway, any word beginning CA would begin with A because you've sorted the letters. A is most definitely going to be the first letter of the first sorted-word of any solution. Similarly B, unless it's in the first word (in which case it would be the 2nd letter) will be the first letter of the second sorted-word...and so on. The point is, you don't traverse letters that you've already used and so won't need to create sets of 20 letters to compare) You should bail faster once you start to run out letters. I'm not convinced that would speed it up sufficiently though. I think you'd beat 31 days, but the graph theory is likely the best approach because you're putting the information you want to find (words that don't share letters) directly into the data structure and then presumably searching that.
@@michael1 My idea, even if I did not mentioned it, was to have the letters sorted in a word, treating the words as set of letters instead of strings. In fact I understand that is how Matt did.
And the list of words would be ordered alphabetically by their set of letters (all words containing an A in any position would be at the top of the list, the all words containing a B but no A, etc.)
I have often bee accused that people can read inside my head. :)
In fact, my starting position is just a read out of the graph I think.
Enjoyed the main video immensely and very much appreciate every extra morsel
Using the sets is very intuitive, but sets are slow data structures, which is probably slowing the algorithm biiig time. A very good way to do this same check that can potentially be a lot faster (I don't know in python but in C/C++ it would certainly be) is by encoding the letters in a word as a bit field. It fits all in a normal integer (32 bits) and you can just bit-wise-and the value for each pair of words, if it's 0 then they don't share letters, if different than 0 then they share one or more letters.
Oooh I didn't think of that. That's a cool idea.
I used that same method and it runs fast. Take about 23s on my laptop (using PyPy and multiprocessing) for the full search
bitwise-and, right? Is arithmetic-and actually a term?
But yeah, that's a brilliant way of representing this problem. The non-0 result even tells you what letter(s) they share in common. For those not impressed, bitwise-and is something your CPU can do in the shortest time of anything it can do, so you can stack a lot of these up in a second.
i mean sets have O(1) lookup I thought? even in python. its a hash look up iirc. my bigger issue is SPACE. image how much memory you have on the stack/ram because of that lol
@@pvic6959 Don't forget the constant term that is ignored in O(1). Bitwise operators are implemented in silicon in the ALU. I don't see how hashing can beat that in speed.
instead of calculating len of union you could have tested if intersection (&) is empty
Didn't get the notification for the first one, did for this one. Saw this one first, great explanation and super useful, thanks!
Shorter video generally gets watched first, so here I am. Almost got me to pause and go back to get context from the first video, but I stuck through it.
I think I have a better idea for filtering out anagrams:
What if you associated every letter with a unique prime number (like: a-> 2 b-> 3 c->5 ...) and for every word you multiplied together the associated numbers of its letters. This way you would get a unique number for each letter combination without considering the order. So you would put this created number in a list (or a hash map for better performance) and every time you have a new word, you would just have to convert it into a number this way and check if the number exists in the list. If it does, the word is an anagram, if not it is unique.
Prime multiplication covers the general case where repeats are allowed, but is overkill given that they are not. It's cheaper to just use a bit per character, which only requires 26 bits, less than a 32-bit "word" of data, whereas the product of 22nd-26th primes is 5.7 million, or 33 bits. Bit twiddling is also much faster than multiplication.
An easier way is to use a 32bit number and give each letter a bit. Cheaper than multiplying!
@@belg4mit On a modern system integer multiplication is fast enough and it also covers larger alphabets like the Hungarian. Also using bitmasks from the developer- prospective isn't necessaryly easier.
But i like the idea
But i see the huge problem with numbers larger than the Long limit. So possible solutions: Long[] as a large bitmask and some function to set its bits abstracting the array. Or just use a longer boolean array, that shouldn't be so bad for erformance. But in both cases we have the problem of the equality checks.
First watched this, now to get some context. For some reason YT only recommended this one and not the main
Oh no. I started watching this one before the first channel, but I realized it was for a video I hadn't watched so I left to go watch the first channel.
I can't believe I missed the chance to be called a maverick.
I watched this first, was very confused what was going on, so I watched the main channel, then went back after I understood the problem and noticed the modified ending. Do you always do that or is it new?
I think it's new or he been doing for a while
I would have used a 32-bit integer as a bitfield to represent a set of letters. Then bitwise OR to combine them, AND to test intersections, etc. It should be very fast even in python.
Based on the latest main channel video, it looks like you win
I watched second video first because I went through my subscriptions, adding everything to my "watch later" list from top to bottom, since I was on mobile, and that is reverse order :P
It's usually fine bc most channels only upload one video a day and they're rarely related to their previous video
I like you too! This popped up in my recommended without the original so now I'm gonna watch the other video.
Whaat? I was soo ssure, that it were actual pictures and I was just imagining the little movement. Also now, the perfectly timed giggles, no longer look like magic acting skills anymore. So S.A.D.
But have you found any words with 25 unique letters? What is the longest string of unique letters that comprises a valid word?
I wonder if the UA-cam algorithm is clever enough to not suggest videos that should be watched after something we have not already watched. Fortunately your 2nd channel one popped up after I'd seen the other, but for those who see the videos out of order... perhaps UA-cam suggested them out of order, or perhaps the viewers deliberately chose to watch part 2 first.
Sometimes if a video pops up that has a part or episode nu,mber on, I usually hunt for previous videos in the series so I can get the whole story - unless they are stand-alone episodes when it doesn't matter.
My Mum admitted she liked to read the back pages of a book first - perhaps lots of people do! I did have a look at the back as well, but after reading a few chapters and then having a look at the contents list and noticing something interesting was at the back!
Don't you mean "And if you've not watched this..." in the description?
The filter you put when you pause the video looks exactly like the chromatic aberration I see in my glasses if I look off center to the left. So I can look off center to the right instead, I can compensate for it.
For a bit I thought the chromatic aberration was getting worse and was concerned.
oh wow! that's very cool! thx for making me try that!
"don't send me ways I could've improved it"
No wonder it took a month to run lol, set operations are typically inefficient because they have to do that uniqueness test on every op
I got the notification for this video and not the main channel so started this one and went wait a minute 😅
This is deffo something that would bog down the story. But its great to hear details of how the code works for those that are interested
I don't regret asking, but I am still nonetheless confused about everything you just said.
i'm just amazed you can explain code in under three minutes
I got the notification for the second channel video first, realised that I haven't watched what you were toalking about and go to the first channel the
Bec Hill was in the studio? You should have brought her on-stage instead of using the goofy photographs!
Ah yes, Matt's "Bec To The Studio Tour"...
No, wait - that was Jonathan Pie.
Alright then, on to the main channel.
Interesting, I hadn't considered it's possible to determine the intersection count by measuring the union count.
It's not very efficient though, the union operator does a lot of work in the background, which includes checking for duplicates.
awesome!
loved bec in the video. all excellent. keep it up 😋
Are those freezeframes supposed to be anaglyphs? I couldn't make it work with my red/cyan galsses
Guess I'm not the only one watching the second channel first.
Happens when you go alphabetically
AS AN ENGINEER I WOULDA PROBABLY USED THE UNION METHOD TOO. ( I AM ON A CHROMEBOX AND CANNOT FIGURE OUT HOW TO GET RID OF CAPSLOCK SINCE I REMOVED MY KEYBOARD'S CAPSLOCK BUTTON AND DONT HAVE ANYTHING WITHIN REACH TO STICK IN THE HOLE)
there is a virtual keyboard you can use built into windows
@@DanielSDiggs my chromebox runs chrome OS
Give_it_a_try is a rough variable name.
"don't send me ways i could have improved it"
Too bad he edited that bit out of the video XD.
I know you said to don't tell you how to improve it (even though someone already improved it), but I can't help but wonder, why not just keep looking for the next words in the same way?
First you take a word from scanA and check if it has no letter in common with a word from scanB. If not, you take word from scanC, which starts after the scanB, and so on, until you get to scanE.
I watched this one first, because I do "FILO" on my YT subscription page. 🙂
This is fascinating and, no, I’m not made of cloth! LOL!
So the code is posted on GitHub you say... Interesting...
You said it could be optimised, but I believe it's quite good already (at least better than what I would have thought of).
I am a loyal second channel first enjoyer
I regret watching this. ;-)
But last time I did 2nd first you said it was wrong!
Hi
Imagine if there was a way to talk about the sorts of things you have in a program, more than just objects, but… the type of the things, and how one type of thing is different from another type of thing… some kind of “type system” perhaps.
Seriously Matt, I will personally teach you Haskell so you can maths while you program and program while you maths. I’ll even do it in person next time you’re in 🇦🇺
I like code 👍😁
matt, i think you couldve improved it!
Son of Maverick. ie Imagineverick.
I did watch this one first :)
C programmers be crying
Where are the five twenty-five-letter words the title promised?
Well - if you re-arrange it to 'Words: five, with twenty-five letters', then everything after the colon is self-descriptive. Otherwise, no clue.
wow im a maverick lol
2nd channel first
21st
UA-cam thinks this comment needs to be translated into English 🤔
You are good at coding Matt.