How Search Engines Treat Data - Computerphile

Поділитися
Вставка
  • Опубліковано 11 сер 2015
  • Search Engines are a bit like the Public Library - You wouldn't wander around hoping to find the book you want, there's a system in place. Data is the same - Dr. Max Wilson Explains.
    / computerphile
    / computer_phile
    This video was filmed and edited by Sean Riley.
    Computer Science at the University of Nottingham: bit.ly/nottscomputer
    Computerphile is a sister project to Brady Haran's Numberphile. More at www.bradyharan.com

КОМЕНТАРІ • 123

  • @wlfbck20
    @wlfbck20 9 років тому +50

    >We could do whole videos about those two topics
    And you should imho. Knowing how search engines work (roughly) greatly helps in finding stuff on the internet, which atleast i think is incredibly important regardless of profession (even for hobbyist stuff this is pretty important):

  • @Slithy
    @Slithy 9 років тому +53

    look at my horse, my horse is amazing.
    I couldn't stop thinking about this song.

    • @KhalilEstell
      @KhalilEstell 9 років тому +2

      Slithereenn You commented this before I could!

  • @UsamaNada
    @UsamaNada 9 років тому +2

    Nice Explanation for Search. Max covered Inverted Index, TF-IDF, stop word removal, stemming, Ranking, Proximity, Conceptual Search etc. in 10 minutes only. Well Done.

  • @Tony2dH
    @Tony2dH 9 років тому +25

    I would personally love to see an in-depth video on the language models used for search engines, and more about e.g. what Google calls 'neural networks'.

    • @JsbWalker
      @JsbWalker 9 років тому +5

      Tony2dH It's not just what Google calls them, Computer scientists call them that too.

    • @BrettonAuerbach
      @BrettonAuerbach 8 місяців тому +1

      indeed time for part 2 with a machine learning vs algorithmic approaches chat

  • @M4rtingale
    @M4rtingale 9 років тому +6

    "Libraries were a big place full of books you wanted to find [...]"
    Nice ...

  • @minihjalte
    @minihjalte 9 років тому +24

    Interesting video. Thanks for creating it.

  • @syedali1217
    @syedali1217 5 років тому +1

    This guy is awesome. Simple, clear and straight to the point. Well done mate.

  • @Robertlavigne1
    @Robertlavigne1 9 років тому

    Awesome video!! I would love to have more Computerphile videos on Semantic web related topics. I am doing an researched project on ontology alignment and mapping at the moment and the topics of this video were very relevant to want I have been looking at. Thanks for making this!!

  • @harounhajem7972
    @harounhajem7972 9 років тому +1

    Cool topic, and awesome video production.

  • @soviut
    @soviut 9 років тому

    Great video. I'd love to see some followups on the probabilistic and language approaches.

  • @Cr42yguy
    @Cr42yguy 9 років тому +42

    he seems to really like bouncing up and down THE WHOLE VIDEO!

    • @brandonthesteele
      @brandonthesteele 9 років тому +8

      I think we all have our ticks. I used to literally tremble when giving explanations on things I was interested in and studying.

    • @Cr42yguy
      @Cr42yguy 9 років тому +2

      i am totally aware of that fact. nontheless was it very annoying once i noticed it.

    • @JeaneAdix
      @JeaneAdix 9 років тому

      +Cr42yguy I have same problem, it bothers a lot of people. They always get annoyed, some people tell me to stop bouncing my legs, but i can't help it. The moment you stop forcing yourself to remain still it starts to happen again.

    • @chameleonedm
      @chameleonedm 9 років тому +1

      +Cr42yguy You should see him in a lecture xD

    • @RandomNUser
      @RandomNUser 8 років тому +1

      +Cr42yguy Noticed there are two coffee cups? that might be a good reason for bouncing as well as a good ammount of interest in the topic.

  • @Seppes94
    @Seppes94 9 років тому +15

    Look at my horse, my horse is amazing....

  • @JimmyWirsborg
    @JimmyWirsborg 9 років тому +2

    Very basic stuff but awesomely explained =)

  • @mikejohnstonbob935
    @mikejohnstonbob935 8 років тому +2

    6:19 How do you preindex the word relative location for nearness approach? if you assign bonuses based on combinations, then the number combinations for most documents would make the index metadocument hard to read. likewise, if the word locations are recorded in the index, the metadocument would be HUGE!

  • @JonHurlock
    @JonHurlock 9 років тому +1

    Nice little intro to TF-IDF Max :)

  • @stok3si3
    @stok3si3 9 років тому

    You missed a trick here by not having the "like" mug in the background show a link when you hover over it that takes you to the computerphile facebook page.

  • @LazyMasterGamer
    @LazyMasterGamer 9 років тому +1

    I've got the same cup as the "Coffee" like button one except it's written "Tea" :p

  • @musicalsimon
    @musicalsimon 9 років тому

    More videos on this topic please! I'm curious to know in what ways google search is superior to other search engines

  • @eSZett_
    @eSZett_ 9 років тому

    I've got a question.
    How does one search through an index quickly to assign these scores?
    Do you sort it roughly somehow and just assign scores to the first few elements? It seems like when the index gets large, that would become the speed limiter.

  • @foxdash
    @foxdash 9 років тому

    I remember that the big thing about google was how fast it was compared to others, now it seems that any increase in the speed of a search engine is pretty trivial when they all return results very fast. I guess google still had the edge on returning the most relevant results though.

  • @Reavenk
    @Reavenk 9 років тому +6

    He does an excellent job explaining stuff, but sometimes he just mumbles to where I can't understand him.

    • @peterr6205
      @peterr6205 6 років тому +1

      I disagree. He does tend to mumble stuff, but he doesn't do a great job explaining stuff. I know tfidf quite well, but even I found his explanation to be quite weak.

  • @rngwrldngnr
    @rngwrldngnr 9 років тому

    What about measuring correlation between words with predictability? Like, if you have horse, there's a 20% chance the result also contains pet and if you have pet, there's a 6% change the document mentions horse. I don't think you could explicitly group words, because it's non abelian, but you could have some kind of minimum threshold of probable connection that was required for the word to be considered a real associate.

    • @FoxDren
      @FoxDren 9 років тому +1

      rngwrldngnr if you watched the whole video you'd hear him say that it is much more complex than that and measures probability

  • @captainnintendo
    @captainnintendo 9 років тому +1

    Came for the topic, stayed for the ponies :D

  • @tubeworm339
    @tubeworm339 9 років тому

    At 5:58 Dr. Wilson mentions stemming as a way to find documents based on the root word. When programming this, do large search engines use a specific set of rules based on the english language, or does it pick these things up through sequences it sees often, using machine learning? Sorry if this comment didn't really make sense, I'm just trying to figure out how that would be programmed.

  • @SyntheticFuture
    @SyntheticFuture 9 років тому

    And all that in mere seconds... it's an amazing world we take for granted.

  • @Adamantium9001
    @Adamantium9001 9 років тому

    So Google just has a MASSIVE index which shows how many times EVERY POSSIBLE WORD occurs in EVERY SINGLE WEB PAGE in their search space?

  • @woobmonkey
    @woobmonkey 9 років тому

    ***** What would, IMHO, be a pertinent side-topic to explore is: by what algorithms do search engines decide which results are relevant to you, personally.
    There seems to be a somewhat disturbing trend toward an echo-chamber effect; two people, using the exact same search terms, are likely to find variant results in what pages they're shown, as well as the order in which they appear.
    It makes it difficult, or at least more difficult than in a brick-and-mortar library, to find contrary points of view and/or conflicting information on a given topic.
    For anyone interested in elevating discussion on fora such as UA-cam comments, this may well be more than a trivial matter.

  • @stensoft
    @stensoft 9 років тому +3

    When you do stemming, you need to run the words through a dictionary. In that case, you can also read whether it is an adjective or determiner and you can treat all adjectives and determiners as having the same distance from their word so that ‘my horse’ and ‘my lovely horse’ (and ‘horse of mine’) would be treated as equally relevant.

  • @ValleysOfRain
    @ValleysOfRain 9 років тому

    I take it that web crawlers will be mentioned in the next video?

  • @danidanae6905
    @danidanae6905 8 років тому

    Hi your info is really interesting🙉

  • @hrnekbezucha
    @hrnekbezucha 9 років тому +1

    Are you trying out the technique each time you say _pony_ in a video, it doubles the view count?

  • @privettotheworld
    @privettotheworld 9 років тому

    i like the "banana" screensaver going on in the background

  • @SuperdoggyMusic
    @SuperdoggyMusic 9 років тому +1

    At 8:34 I was almost expecting him to mention google bombs. :P

  • @markderosa
    @markderosa 8 років тому

    Where can I find a copy of that 50 lines of Python that he mentions at 4:20?

  • @Niki_0001
    @Niki_0001 9 років тому

    I wonder how search engines deal with languages that conjugate words a lot, like Finnish? There are dozens if not hundreds of ways to conjugate a single noun, which can affect or be affected by how you conjugate other words in the sentence.

    • @hnnnnnghhh
      @hnnnnnghhh 9 років тому

      Rented Mule By "word stemming", conjugations and mutations of words are cut off and stored as the "stem" or root of the word. (Running, Ran, Runs)->Run. Each language has their own unique stemming rules that search engines can use.

    • @Niki_0001
      @Niki_0001 9 років тому

      hnnnnnghhh I guess it makes sense that search engines would have access to dictionaries.
      I did a little poking on Google and found a research paper that claims that search engines like Google, Yahoo and Bing don't perform very well with non-English languages.
      Granted, the paper also says that there are smaller, localized search engines that perform well on morphologically complex languages.

  • @veggiet2009
    @veggiet2009 9 років тому +1

    8:40 the first problem that came to mind in your, granted simplistic, explanation is that a document with "my field" 40 times would rank way higher than a page with "my pony" 6 times.
    What I am curious about is how the index still factors in. What I understand is that the index indexes all words in order to improve search speed, but to make the search better you have to look for clumps of words. Do you have to reindex every page with two word groups and three word groups? that seems inefficient. "my - 6, horse - 3, my horse - 2, horse is - 2, my horse is - 1" etc...
    In other words how do you catalog the relationship between words?

    • @veggiet2009
      @veggiet2009 9 років тому +2

      ***** yes, but I'm wondering about the speed of these methods in conjunction with basic index.
      You could pull articles through the index based on the words alone, and then use your other algorithms to sort the resulting list. but that seems inefficient similar to the pre index search. i.e. going through each document and counting which words are close together based on a third resource which has all the keywords logically organised based on concepts or what have you.

  • @RageForSeven
    @RageForSeven 9 років тому

    the oscillating chair is really interesting...

  • @samuelvidal3437
    @samuelvidal3437 9 років тому

    What about the stationary distribution of Markov chain, page rank ?

  • @michael1026h1
    @michael1026h1 8 років тому

    Couple of questions: Why wouldn't a document with nothing but the word "horse" listed a thousand times show up high in the rankings? Also, if this is how indexes work, how does Google search for strings with quotes? IE: "My horse" wouldn't show documents with "my lovely horse".

  • @DataCab1e
    @DataCab1e 9 років тому

    The second episode of Star Trek TNG appears horribly dated to anyone who's used Google, because Data's search for an incident in which someone had showered in his or her clothing was treated as an un-indexed paper document search, assisted only by the android's ability to read every file relatively quickly.

  • @rich1051414
    @rich1051414 9 років тому +1

    Google isn't really a 'secret' formula. It is a product of having a LOT of information which has been applied towards their optimizations. PageRank perhaps can be seen as the magic formula which allowed them to be good enough to get to where they are today, but that is only a small piece of the puzzle. The real star of the show is relevant equality of search terms, which requires a lot of data to achieve accurately.
    Beyond what is discussed in this video, with how each word is given an importance value, each word is also put into a 'group of equality'. Any group of equality for search terms is the combination of a variety of things we would likely consider fundamentally different, but in respect to what is important to a search, that fundamental difference is worthless, what is valuable, is if it is relevant to the search. Searching for boats, for instance, may realize that returning results fishing rods returns positive relevancy, so boats and fishing rods could then be seen as 80% the same thing. So when someone searches for boats, results for fishing rods could be returned, but with 80% of the importance factor given for boats, and return better results.
    This leaves the 'relevancy synonyms' left up to the engine to assign autonomously in the most statistically optimized way.
    In this, google has a search algorithm which is exponentially more efficient than anything humans could write themselves, because of how it automatically self optimizes its results without human intervention, with no care for what 'should' or 'shouldn't' be technically categorized together.
    Beyond the initial sorting of a site into its group of completely different but relevantly equal things, it will then rate the site on how good it is at being loyal to its predicted relevancy, by tracking if people found what they needed there, or immediately jumped back to their search to try again. If a site is deemed say, 30% relevant, but in practice is actually _more_ or less, it is a sign that the site or search terms are poorly defined, and is improved. More would improve the specifically of its relevancy(or add a relevancy synonym), less would simply be stuck further down the list until people stop wasting their time clicking on it.
    This makes spam sites hoping to exploit it a challenging, if not an impossible to maintain task, because those results will lead to a poor relevancy rating, causing them less and less likely to be anywhere near the top of any string of text you search for without some very specific searching, in which case, you were likely looking for it.
    Edit: I see at 8:00 you went over latent semantics analysis, so nevermind xD I should remember to finish watching a video before commenting. Ah well.

    • @mustafaadam9697
      @mustafaadam9697 9 років тому

      Richard Smith Still, you comment was more interesting and informative than 99.9% of the YT comments. As a beginner into the world of data and machine learning, I enjoyed reading very much ^_^

  • @Destro7000
    @Destro7000 9 років тому

    My lovely horse, running through the field
    Where are you going, with your fetlocks blowing in the wind?

  • @CraftySalvager
    @CraftySalvager 9 років тому

    That's a lot of pre-computation. Pre-computation that you might only use 10% of the final result.

  • @fadouarasmouki725
    @fadouarasmouki725 9 років тому

    Dear Computerphile, can you please add English subtitles for non-native speakers. Thank you.

  • @thetommantom
    @thetommantom 9 років тому

    these remind me of fractals, and then making it 3d or 4d connecting them

  • @Jorissoris
    @Jorissoris 9 років тому +7

    50 Lines of PYTHON are not really fast you say? Well, thats what you get for using python.

    • @DariushMJ
      @DariushMJ 9 років тому +35

      ***** True, but the biggest problem in this case is the algorithm, not the language. Changing the language may make it run double as fast, while changing the algorithm may make it run a billion times faster when there is a lot of data.

    • @bookdream
      @bookdream 9 років тому +11

      Dariush MJ Exactly, you can use the fastest language on the fastest machine, if its an inefficient algorithm it could take a ridiculously long time in comparison to a faster algorithm using python

    • @BGBTech
      @BGBTech 9 років тому

      Dariush MJ Python in general is sort of a bane to programming though, producing lots of very slow and unreliable code, with performance often a bit worse even vs a lot of other scripting languages (such as Lua or JS).
      a lot of times with Python code though, it is the combined problem of both a slow language and poorly written code.
      if speed is relevant, a person is probably better off using C or C++ or similar.

    • @Folopolis
      @Folopolis 9 років тому

      ***** The problem is that given a long document with enough unique words, you could be running through a loop hundreds of thousands of times, that's not going to be fast in any language. Efficient algorithm production is as much of an art as a science. This is why Google only has 3 or 4 competitors that are even trying any more.

    • @BGBTech
      @BGBTech 9 років тому

      Alexandru Gheorghe
      I know about algorithmic complexity, but Python is often around 40x-100x slower than C, if your code actually *does* anything (vs just calling into library functions or doing database queries or similar).
      if the same algorithm is used in either language, that speed difference may amount to a fairly big difference overall.
      C in no way prevents using O(log n) or O(1) algorithms, and optimizing algorithms is still a pretty big deal in C land as well.

  • @Tharkz
    @Tharkz 9 років тому +1

    OK 1/3 through and I just can't hold it in me any longer... You're wearing sun glasses in door with the blinds down and closed, why? :-)

  • @dupirechristophe7703
    @dupirechristophe7703 5 років тому

    What we search here is "my horse" as a block, and not two separate words, but let's do some maths here it will surely resolve the problem x'D

  • @SparkysBarelyMusic
    @SparkysBarelyMusic 9 років тому +1

    I once wanted to find out how the Japanese calendar worked, i.e. 2015 = 27 in Heisei.
    Anyway i googled "japanese dates"
    Moral of the story do not google Japanese dates

  • @bkky9
    @bkky9 9 років тому

    Has Google indexed every file on the internet? How does it have space for that?

    • @Nilguiri
      @Nilguiri 9 років тому

      bkky9 They have a joke size hard disk on their PC.

    • @xponen
      @xponen 9 років тому

      bkky9 they have supercomputers

  • @ScornMuffins
    @ScornMuffins 9 років тому +1

    Jeez, will someone just get this guy a horse already!?

  • @TheNefari
    @TheNefari 9 років тому

    So you always need an index ...
    Then what do you do if your word is not in the index ?

    • @o0julek0o
      @o0julek0o 9 років тому +19

      That's when Google says there's nothing to be found.

    • @Nilguiri
      @Nilguiri 9 років тому +1

      TheNefari If the word is out there but not in the index, then the index needs updating. In the meantime it will tell you that it's not found.

    • @beeflon
      @beeflon 9 років тому

      TheNefari Who did I found here. Didn't expected to see you somewhere in the comment section.

    • @aakksshhaayy
      @aakksshhaayy 9 років тому

      Albert Hofmann were you guys lovers or something?

    • @beeflon
      @beeflon 9 років тому

      aakksshhaayy Nah, he has a small channel and I saw some vid. a while ago. Was just surprised.

  • @iyaanazeez8989
    @iyaanazeez8989 5 років тому

    Quick question, Will i become each time i watch a computerphile video?
    Agree oR Not

  • @owhs
    @owhs 9 років тому

    was he sat on an exercise ball?

  • @Flagen579
    @Flagen579 9 років тому +35

    BANANA

  • @SimbaKing7
    @SimbaKing7 9 років тому

    more!

  • @4pThorpy
    @4pThorpy 9 років тому

    what a fidget!

  • @chappie__
    @chappie__ 4 роки тому

    You should rename the video "How search engine indexing works"... So that your video gets a higher index lol

  • @lafeo0077
    @lafeo0077 5 років тому

    Could you go into my complexity?

  • @thetommantom
    @thetommantom 9 років тому

    or trees

  • @Mad_Elf_0
    @Mad_Elf_0 9 років тому

    Yeah... all this 'intelligence' that search engine providers are putting into their products are really neat, but when over 75% of searches you make as part of your job require looking for *exact* words or *exact* phrases, and the search engines 'intelligently' turn "process halted with error" into "process stopping by mistake", *even* if you use double quotes, it starts getting **REALLY** **ANNOYING**.
    I really wish Google would add a "I mean this literally" option to their search options

  • @khaledtareq1472
    @khaledtareq1472 7 років тому

    he said we can do this in 50 lines of python , please I want this 50 line code

  • @Slutuppnu
    @Slutuppnu 9 років тому

    A pony, a pony! My kingdom for a pony!

  • @michaelkruger4421
    @michaelkruger4421 9 років тому

    And the obvious thing to do next is Google "my horse"

  • @ArnoldsKtm
    @ArnoldsKtm 8 років тому

    What's with this guy and his horses? :D

  • @DJDavid98
    @DJDavid98 9 років тому +1

    "my pony" I c what u did thar

  • @Nulono
    @Nulono 8 років тому

    6:48
    11.5?

  • @VladVladislav790
    @VladVladislav790 9 років тому

    I miss Sixty Symbols :(

  • @Zishy
    @Zishy 9 років тому +5

    are you riding a horse?

  • @zebraforceone
    @zebraforceone 8 років тому

    (blazin saddles) HORSES??!?!??!?!??!

  • @trefod
    @trefod 9 років тому

    I lost my concentration a couple of times because the presenter kept bobbing up and down. It is a subtle but effective of making me lose my calm because I can't reach out to steady him.

  • @goeiecool9999
    @goeiecool9999 9 років тому +2

    This guy sounds super tired lol

  • @aliaydogdu5810
    @aliaydogdu5810 8 років тому

    domato

  • @Goodtimes4100
    @Goodtimes4100 9 років тому

    First comment! Love the vid thanks

  • @BillyBob-ik4pn
    @BillyBob-ik4pn 9 років тому

    7:42 My Little Pony... Half Life 3 confirmed!

  • @arminhrnjic8706
    @arminhrnjic8706 9 років тому

    Like if you googled "my horse"

  • @grimreefer4366
    @grimreefer4366 9 років тому

    It's time I sling the baskets off this overburdened HORSE
    Sink MY toes into the ground and set a different course
    Cause if I were here and you were there
    I'd meet you in between
    And not until MY dying day, confess what I have seen

  • @KhalilEstell
    @KhalilEstell 9 років тому

    He is very bouncy.

  • @7177YT
    @7177YT 4 роки тому

    Cute, he explains 'what libraries were' for the average millenial barbarian. lol

  • @ariebrons7976
    @ariebrons7976 9 років тому

    first

    • @ariebrons7976
      @ariebrons7976 9 років тому

      arie brons 9th you idiot

    • @ariebrons7976
      @ariebrons7976 9 років тому

      arie brons buy a mirror and then we 'll see who is the idiot here

    • @ariebrons7976
      @ariebrons7976 9 років тому

      arie brons guys, guys calm down we don't need to fight
      i mean we are all human,
      in fact we are all the same person

    • @ariebrons7976
      @ariebrons7976 9 років тому

      arie brons what do you mean, same person

    • @ariebrons7976
      @ariebrons7976 9 років тому

      arie brons i mean that we are litterallty just letters expressing the oppinion of some dude with a weird hobby

  • @rdoetjes
    @rdoetjes 9 років тому +1

    Very interesting subject but as a director I was getting so annoyed by the guy trembling up and down in his chair as if was wiggling his feet being nervous. It really get me out of the story.

  • @poteb
    @poteb 8 років тому

    Great explanation, but please stop jumping in your chair, I'm getting a bit of motion sickness.