>We could do whole videos about those two topics And you should imho. Knowing how search engines work (roughly) greatly helps in finding stuff on the internet, which atleast i think is incredibly important regardless of profession (even for hobbyist stuff this is pretty important):
Nice Explanation for Search. Max covered Inverted Index, TF-IDF, stop word removal, stemming, Ranking, Proximity, Conceptual Search etc. in 10 minutes only. Well Done.
I would personally love to see an in-depth video on the language models used for search engines, and more about e.g. what Google calls 'neural networks'.
6:19 How do you preindex the word relative location for nearness approach? if you assign bonuses based on combinations, then the number combinations for most documents would make the index metadocument hard to read. likewise, if the word locations are recorded in the index, the metadocument would be HUGE!
***** What would, IMHO, be a pertinent side-topic to explore is: by what algorithms do search engines decide which results are relevant to you, personally. There seems to be a somewhat disturbing trend toward an echo-chamber effect; two people, using the exact same search terms, are likely to find variant results in what pages they're shown, as well as the order in which they appear. It makes it difficult, or at least more difficult than in a brick-and-mortar library, to find contrary points of view and/or conflicting information on a given topic. For anyone interested in elevating discussion on fora such as UA-cam comments, this may well be more than a trivial matter.
At 5:58 Dr. Wilson mentions stemming as a way to find documents based on the root word. When programming this, do large search engines use a specific set of rules based on the english language, or does it pick these things up through sequences it sees often, using machine learning? Sorry if this comment didn't really make sense, I'm just trying to figure out how that would be programmed.
Google isn't really a 'secret' formula. It is a product of having a LOT of information which has been applied towards their optimizations. PageRank perhaps can be seen as the magic formula which allowed them to be good enough to get to where they are today, but that is only a small piece of the puzzle. The real star of the show is relevant equality of search terms, which requires a lot of data to achieve accurately. Beyond what is discussed in this video, with how each word is given an importance value, each word is also put into a 'group of equality'. Any group of equality for search terms is the combination of a variety of things we would likely consider fundamentally different, but in respect to what is important to a search, that fundamental difference is worthless, what is valuable, is if it is relevant to the search. Searching for boats, for instance, may realize that returning results fishing rods returns positive relevancy, so boats and fishing rods could then be seen as 80% the same thing. So when someone searches for boats, results for fishing rods could be returned, but with 80% of the importance factor given for boats, and return better results. This leaves the 'relevancy synonyms' left up to the engine to assign autonomously in the most statistically optimized way. In this, google has a search algorithm which is exponentially more efficient than anything humans could write themselves, because of how it automatically self optimizes its results without human intervention, with no care for what 'should' or 'shouldn't' be technically categorized together. Beyond the initial sorting of a site into its group of completely different but relevantly equal things, it will then rate the site on how good it is at being loyal to its predicted relevancy, by tracking if people found what they needed there, or immediately jumped back to their search to try again. If a site is deemed say, 30% relevant, but in practice is actually _more_ or less, it is a sign that the site or search terms are poorly defined, and is improved. More would improve the specifically of its relevancy(or add a relevancy synonym), less would simply be stuck further down the list until people stop wasting their time clicking on it. This makes spam sites hoping to exploit it a challenging, if not an impossible to maintain task, because those results will lead to a poor relevancy rating, causing them less and less likely to be anywhere near the top of any string of text you search for without some very specific searching, in which case, you were likely looking for it. Edit: I see at 8:00 you went over latent semantics analysis, so nevermind xD I should remember to finish watching a video before commenting. Ah well.
Richard Smith Still, you comment was more interesting and informative than 99.9% of the YT comments. As a beginner into the world of data and machine learning, I enjoyed reading very much ^_^
What about measuring correlation between words with predictability? Like, if you have horse, there's a 20% chance the result also contains pet and if you have pet, there's a 6% change the document mentions horse. I don't think you could explicitly group words, because it's non abelian, but you could have some kind of minimum threshold of probable connection that was required for the word to be considered a real associate.
You missed a trick here by not having the "like" mug in the background show a link when you hover over it that takes you to the computerphile facebook page.
When you do stemming, you need to run the words through a dictionary. In that case, you can also read whether it is an adjective or determiner and you can treat all adjectives and determiners as having the same distance from their word so that ‘my horse’ and ‘my lovely horse’ (and ‘horse of mine’) would be treated as equally relevant.
8:40 the first problem that came to mind in your, granted simplistic, explanation is that a document with "my field" 40 times would rank way higher than a page with "my pony" 6 times. What I am curious about is how the index still factors in. What I understand is that the index indexes all words in order to improve search speed, but to make the search better you have to look for clumps of words. Do you have to reindex every page with two word groups and three word groups? that seems inefficient. "my - 6, horse - 3, my horse - 2, horse is - 2, my horse is - 1" etc... In other words how do you catalog the relationship between words?
***** yes, but I'm wondering about the speed of these methods in conjunction with basic index. You could pull articles through the index based on the words alone, and then use your other algorithms to sort the resulting list. but that seems inefficient similar to the pre index search. i.e. going through each document and counting which words are close together based on a third resource which has all the keywords logically organised based on concepts or what have you.
I've got a question. How does one search through an index quickly to assign these scores? Do you sort it roughly somehow and just assign scores to the first few elements? It seems like when the index gets large, that would become the speed limiter.
I remember that the big thing about google was how fast it was compared to others, now it seems that any increase in the speed of a search engine is pretty trivial when they all return results very fast. I guess google still had the edge on returning the most relevant results though.
Awesome video!! I would love to have more Computerphile videos on Semantic web related topics. I am doing an researched project on ontology alignment and mapping at the moment and the topics of this video were very relevant to want I have been looking at. Thanks for making this!!
+Cr42yguy I have same problem, it bothers a lot of people. They always get annoyed, some people tell me to stop bouncing my legs, but i can't help it. The moment you stop forcing yourself to remain still it starts to happen again.
Couple of questions: Why wouldn't a document with nothing but the word "horse" listed a thousand times show up high in the rankings? Also, if this is how indexes work, how does Google search for strings with quotes? IE: "My horse" wouldn't show documents with "my lovely horse".
***** True, but the biggest problem in this case is the algorithm, not the language. Changing the language may make it run double as fast, while changing the algorithm may make it run a billion times faster when there is a lot of data.
Dariush MJ Exactly, you can use the fastest language on the fastest machine, if its an inefficient algorithm it could take a ridiculously long time in comparison to a faster algorithm using python
Dariush MJ Python in general is sort of a bane to programming though, producing lots of very slow and unreliable code, with performance often a bit worse even vs a lot of other scripting languages (such as Lua or JS). a lot of times with Python code though, it is the combined problem of both a slow language and poorly written code. if speed is relevant, a person is probably better off using C or C++ or similar.
***** The problem is that given a long document with enough unique words, you could be running through a loop hundreds of thousands of times, that's not going to be fast in any language. Efficient algorithm production is as much of an art as a science. This is why Google only has 3 or 4 competitors that are even trying any more.
Alexandru Gheorghe I know about algorithmic complexity, but Python is often around 40x-100x slower than C, if your code actually *does* anything (vs just calling into library functions or doing database queries or similar). if the same algorithm is used in either language, that speed difference may amount to a fairly big difference overall. C in no way prevents using O(log n) or O(1) algorithms, and optimizing algorithms is still a pretty big deal in C land as well.
I wonder how search engines deal with languages that conjugate words a lot, like Finnish? There are dozens if not hundreds of ways to conjugate a single noun, which can affect or be affected by how you conjugate other words in the sentence.
Rented Mule By "word stemming", conjugations and mutations of words are cut off and stored as the "stem" or root of the word. (Running, Ran, Runs)->Run. Each language has their own unique stemming rules that search engines can use.
hnnnnnghhh I guess it makes sense that search engines would have access to dictionaries. I did a little poking on Google and found a research paper that claims that search engines like Google, Yahoo and Bing don't perform very well with non-English languages. Granted, the paper also says that there are smaller, localized search engines that perform well on morphologically complex languages.
I disagree. He does tend to mumble stuff, but he doesn't do a great job explaining stuff. I know tfidf quite well, but even I found his explanation to be quite weak.
I once wanted to find out how the Japanese calendar worked, i.e. 2015 = 27 in Heisei. Anyway i googled "japanese dates" Moral of the story do not google Japanese dates
Yeah... all this 'intelligence' that search engine providers are putting into their products are really neat, but when over 75% of searches you make as part of your job require looking for *exact* words or *exact* phrases, and the search engines 'intelligently' turn "process halted with error" into "process stopping by mistake", *even* if you use double quotes, it starts getting **REALLY** **ANNOYING**. I really wish Google would add a "I mean this literally" option to their search options
The second episode of Star Trek TNG appears horribly dated to anyone who's used Google, because Data's search for an incident in which someone had showered in his or her clothing was treated as an un-indexed paper document search, assisted only by the android's ability to read every file relatively quickly.
I lost my concentration a couple of times because the presenter kept bobbing up and down. It is a subtle but effective of making me lose my calm because I can't reach out to steady him.
Very interesting subject but as a director I was getting so annoyed by the guy trembling up and down in his chair as if was wiggling his feet being nervous. It really get me out of the story.
It's time I sling the baskets off this overburdened HORSE Sink MY toes into the ground and set a different course Cause if I were here and you were there I'd meet you in between And not until MY dying day, confess what I have seen
>We could do whole videos about those two topics
And you should imho. Knowing how search engines work (roughly) greatly helps in finding stuff on the internet, which atleast i think is incredibly important regardless of profession (even for hobbyist stuff this is pretty important):
Nice Explanation for Search. Max covered Inverted Index, TF-IDF, stop word removal, stemming, Ranking, Proximity, Conceptual Search etc. in 10 minutes only. Well Done.
look at my horse, my horse is amazing.
I couldn't stop thinking about this song.
Slithereenn You commented this before I could!
ok?
@@Triantalex bruh, 9 years, i have to rewatch the video to remember what was this all about
I would personally love to see an in-depth video on the language models used for search engines, and more about e.g. what Google calls 'neural networks'.
Tony2dH It's not just what Google calls them, Computer scientists call them that too.
indeed time for part 2 with a machine learning vs algorithmic approaches chat
"Libraries were a big place full of books you wanted to find [...]"
Nice ...
This guy is awesome. Simple, clear and straight to the point. Well done mate.
Interesting video. Thanks for creating it.
Cool topic, and awesome video production.
6:19 How do you preindex the word relative location for nearness approach? if you assign bonuses based on combinations, then the number combinations for most documents would make the index metadocument hard to read. likewise, if the word locations are recorded in the index, the metadocument would be HUGE!
***** What would, IMHO, be a pertinent side-topic to explore is: by what algorithms do search engines decide which results are relevant to you, personally.
There seems to be a somewhat disturbing trend toward an echo-chamber effect; two people, using the exact same search terms, are likely to find variant results in what pages they're shown, as well as the order in which they appear.
It makes it difficult, or at least more difficult than in a brick-and-mortar library, to find contrary points of view and/or conflicting information on a given topic.
For anyone interested in elevating discussion on fora such as UA-cam comments, this may well be more than a trivial matter.
At 5:58 Dr. Wilson mentions stemming as a way to find documents based on the root word. When programming this, do large search engines use a specific set of rules based on the english language, or does it pick these things up through sequences it sees often, using machine learning? Sorry if this comment didn't really make sense, I'm just trying to figure out how that would be programmed.
Google isn't really a 'secret' formula. It is a product of having a LOT of information which has been applied towards their optimizations. PageRank perhaps can be seen as the magic formula which allowed them to be good enough to get to where they are today, but that is only a small piece of the puzzle. The real star of the show is relevant equality of search terms, which requires a lot of data to achieve accurately.
Beyond what is discussed in this video, with how each word is given an importance value, each word is also put into a 'group of equality'. Any group of equality for search terms is the combination of a variety of things we would likely consider fundamentally different, but in respect to what is important to a search, that fundamental difference is worthless, what is valuable, is if it is relevant to the search. Searching for boats, for instance, may realize that returning results fishing rods returns positive relevancy, so boats and fishing rods could then be seen as 80% the same thing. So when someone searches for boats, results for fishing rods could be returned, but with 80% of the importance factor given for boats, and return better results.
This leaves the 'relevancy synonyms' left up to the engine to assign autonomously in the most statistically optimized way.
In this, google has a search algorithm which is exponentially more efficient than anything humans could write themselves, because of how it automatically self optimizes its results without human intervention, with no care for what 'should' or 'shouldn't' be technically categorized together.
Beyond the initial sorting of a site into its group of completely different but relevantly equal things, it will then rate the site on how good it is at being loyal to its predicted relevancy, by tracking if people found what they needed there, or immediately jumped back to their search to try again. If a site is deemed say, 30% relevant, but in practice is actually _more_ or less, it is a sign that the site or search terms are poorly defined, and is improved. More would improve the specifically of its relevancy(or add a relevancy synonym), less would simply be stuck further down the list until people stop wasting their time clicking on it.
This makes spam sites hoping to exploit it a challenging, if not an impossible to maintain task, because those results will lead to a poor relevancy rating, causing them less and less likely to be anywhere near the top of any string of text you search for without some very specific searching, in which case, you were likely looking for it.
Edit: I see at 8:00 you went over latent semantics analysis, so nevermind xD I should remember to finish watching a video before commenting. Ah well.
Richard Smith Still, you comment was more interesting and informative than 99.9% of the YT comments. As a beginner into the world of data and machine learning, I enjoyed reading very much ^_^
What about measuring correlation between words with predictability? Like, if you have horse, there's a 20% chance the result also contains pet and if you have pet, there's a 6% change the document mentions horse. I don't think you could explicitly group words, because it's non abelian, but you could have some kind of minimum threshold of probable connection that was required for the word to be considered a real associate.
rngwrldngnr if you watched the whole video you'd hear him say that it is much more complex than that and measures probability
You missed a trick here by not having the "like" mug in the background show a link when you hover over it that takes you to the computerphile facebook page.
When you do stemming, you need to run the words through a dictionary. In that case, you can also read whether it is an adjective or determiner and you can treat all adjectives and determiners as having the same distance from their word so that ‘my horse’ and ‘my lovely horse’ (and ‘horse of mine’) would be treated as equally relevant.
Great video. I'd love to see some followups on the probabilistic and language approaches.
Nice little intro to TF-IDF Max :)
8:40 the first problem that came to mind in your, granted simplistic, explanation is that a document with "my field" 40 times would rank way higher than a page with "my pony" 6 times.
What I am curious about is how the index still factors in. What I understand is that the index indexes all words in order to improve search speed, but to make the search better you have to look for clumps of words. Do you have to reindex every page with two word groups and three word groups? that seems inefficient. "my - 6, horse - 3, my horse - 2, horse is - 2, my horse is - 1" etc...
In other words how do you catalog the relationship between words?
***** yes, but I'm wondering about the speed of these methods in conjunction with basic index.
You could pull articles through the index based on the words alone, and then use your other algorithms to sort the resulting list. but that seems inefficient similar to the pre index search. i.e. going through each document and counting which words are close together based on a third resource which has all the keywords logically organised based on concepts or what have you.
And all that in mere seconds... it's an amazing world we take for granted.
I've got a question.
How does one search through an index quickly to assign these scores?
Do you sort it roughly somehow and just assign scores to the first few elements? It seems like when the index gets large, that would become the speed limiter.
Very basic stuff but awesomely explained =)
I remember that the big thing about google was how fast it was compared to others, now it seems that any increase in the speed of a search engine is pretty trivial when they all return results very fast. I guess google still had the edge on returning the most relevant results though.
Awesome video!! I would love to have more Computerphile videos on Semantic web related topics. I am doing an researched project on ontology alignment and mapping at the moment and the topics of this video were very relevant to want I have been looking at. Thanks for making this!!
he seems to really like bouncing up and down THE WHOLE VIDEO!
I think we all have our ticks. I used to literally tremble when giving explanations on things I was interested in and studying.
i am totally aware of that fact. nontheless was it very annoying once i noticed it.
+Cr42yguy I have same problem, it bothers a lot of people. They always get annoyed, some people tell me to stop bouncing my legs, but i can't help it. The moment you stop forcing yourself to remain still it starts to happen again.
+Cr42yguy You should see him in a lecture xD
+Cr42yguy Noticed there are two coffee cups? that might be a good reason for bouncing as well as a good ammount of interest in the topic.
Where can I find a copy of that 50 lines of Python that he mentions at 4:20?
More videos on this topic please! I'm curious to know in what ways google search is superior to other search engines
What about the stationary distribution of Markov chain, page rank ?
Are you trying out the technique each time you say _pony_ in a video, it doubles the view count?
i like the "banana" screensaver going on in the background
I've got the same cup as the "Coffee" like button one except it's written "Tea" :p
Couple of questions: Why wouldn't a document with nothing but the word "horse" listed a thousand times show up high in the rankings? Also, if this is how indexes work, how does Google search for strings with quotes? IE: "My horse" wouldn't show documents with "my lovely horse".
the oscillating chair is really interesting...
50 Lines of PYTHON are not really fast you say? Well, thats what you get for using python.
***** True, but the biggest problem in this case is the algorithm, not the language. Changing the language may make it run double as fast, while changing the algorithm may make it run a billion times faster when there is a lot of data.
Dariush MJ Exactly, you can use the fastest language on the fastest machine, if its an inefficient algorithm it could take a ridiculously long time in comparison to a faster algorithm using python
Dariush MJ Python in general is sort of a bane to programming though, producing lots of very slow and unreliable code, with performance often a bit worse even vs a lot of other scripting languages (such as Lua or JS).
a lot of times with Python code though, it is the combined problem of both a slow language and poorly written code.
if speed is relevant, a person is probably better off using C or C++ or similar.
***** The problem is that given a long document with enough unique words, you could be running through a loop hundreds of thousands of times, that's not going to be fast in any language. Efficient algorithm production is as much of an art as a science. This is why Google only has 3 or 4 competitors that are even trying any more.
Alexandru Gheorghe
I know about algorithmic complexity, but Python is often around 40x-100x slower than C, if your code actually *does* anything (vs just calling into library functions or doing database queries or similar).
if the same algorithm is used in either language, that speed difference may amount to a fairly big difference overall.
C in no way prevents using O(log n) or O(1) algorithms, and optimizing algorithms is still a pretty big deal in C land as well.
Came for the topic, stayed for the ponies :D
At 8:34 I was almost expecting him to mention google bombs. :P
I wonder how search engines deal with languages that conjugate words a lot, like Finnish? There are dozens if not hundreds of ways to conjugate a single noun, which can affect or be affected by how you conjugate other words in the sentence.
Rented Mule By "word stemming", conjugations and mutations of words are cut off and stored as the "stem" or root of the word. (Running, Ran, Runs)->Run. Each language has their own unique stemming rules that search engines can use.
hnnnnnghhh I guess it makes sense that search engines would have access to dictionaries.
I did a little poking on Google and found a research paper that claims that search engines like Google, Yahoo and Bing don't perform very well with non-English languages.
Granted, the paper also says that there are smaller, localized search engines that perform well on morphologically complex languages.
Dear Computerphile, can you please add English subtitles for non-native speakers. Thank you.
I take it that web crawlers will be mentioned in the next video?
He does an excellent job explaining stuff, but sometimes he just mumbles to where I can't understand him.
I disagree. He does tend to mumble stuff, but he doesn't do a great job explaining stuff. I know tfidf quite well, but even I found his explanation to be quite weak.
Look at my horse, my horse is amazing....
ok?
@@Triantalex I'm sure there is a point in this video, where this weebl reference makes sense. Might rewatch later. It's been 9 years.
Has Google indexed every file on the internet? How does it have space for that?
bkky9 They have a joke size hard disk on their PC.
bkky9 they have supercomputers
My lovely horse, running through the field
Where are you going, with your fetlocks blowing in the wind?
So you always need an index ...
Then what do you do if your word is not in the index ?
That's when Google says there's nothing to be found.
TheNefari If the word is out there but not in the index, then the index needs updating. In the meantime it will tell you that it's not found.
TheNefari Who did I found here. Didn't expected to see you somewhere in the comment section.
Albert Hofmann were you guys lovers or something?
aakksshhaayy Nah, he has a small channel and I saw some vid. a while ago. Was just surprised.
So Google just has a MASSIVE index which shows how many times EVERY POSSIBLE WORD occurs in EVERY SINGLE WEB PAGE in their search space?
these remind me of fractals, and then making it 3d or 4d connecting them
I once wanted to find out how the Japanese calendar worked, i.e. 2015 = 27 in Heisei.
Anyway i googled "japanese dates"
Moral of the story do not google Japanese dates
Hi your info is really interesting🙉
was he sat on an exercise ball?
Yeah... all this 'intelligence' that search engine providers are putting into their products are really neat, but when over 75% of searches you make as part of your job require looking for *exact* words or *exact* phrases, and the search engines 'intelligently' turn "process halted with error" into "process stopping by mistake", *even* if you use double quotes, it starts getting **REALLY** **ANNOYING**.
I really wish Google would add a "I mean this literally" option to their search options
The second episode of Star Trek TNG appears horribly dated to anyone who's used Google, because Data's search for an incident in which someone had showered in his or her clothing was treated as an un-indexed paper document search, assisted only by the android's ability to read every file relatively quickly.
You should rename the video "How search engine indexing works"... So that your video gets a higher index lol
That's a lot of pre-computation. Pre-computation that you might only use 10% of the final result.
more!
Jeez, will someone just get this guy a horse already!?
What we search here is "my horse" as a block, and not two separate words, but let's do some maths here it will surely resolve the problem x'D
what a fidget!
OK 1/3 through and I just can't hold it in me any longer... You're wearing sun glasses in door with the blinds down and closed, why? :-)
Quick question, Will i become each time i watch a computerphile video?
Agree oR Not
or trees
Could you go into my complexity?
he said we can do this in 50 lines of python , please I want this 50 line code
And the obvious thing to do next is Google "my horse"
6:48
11.5?
لا احد اضاف شىء
I miss Sixty Symbols :(
I lost my concentration a couple of times because the presenter kept bobbing up and down. It is a subtle but effective of making me lose my calm because I can't reach out to steady him.
"my pony" I c what u did thar
BANANA
What's with this guy and his horses? :D
(blazin saddles) HORSES??!?!??!?!??!
are you riding a horse?
7:42 My Little Pony... Half Life 3 confirmed!
This guy sounds super tired lol
Very interesting subject but as a director I was getting so annoyed by the guy trembling up and down in his chair as if was wiggling his feet being nervous. It really get me out of the story.
It's time I sling the baskets off this overburdened HORSE
Sink MY toes into the ground and set a different course
Cause if I were here and you were there
I'd meet you in between
And not until MY dying day, confess what I have seen
Like if you googled "my horse"
domato
He is very bouncy.
First comment! Love the vid thanks
Cute, he explains 'what libraries were' for the average millenial barbarian. lol
Great explanation, but please stop jumping in your chair, I'm getting a bit of motion sickness.
first
arie brons 9th you idiot
arie brons buy a mirror and then we 'll see who is the idiot here
arie brons guys, guys calm down we don't need to fight
i mean we are all human,
in fact we are all the same person
arie brons what do you mean, same person
arie brons i mean that we are litterallty just letters expressing the oppinion of some dude with a weird hobby