Word Embedding and Word2Vec, Clearly Explained!!!

Поділитися
Вставка
  • Опубліковано 22 гру 2024

КОМЕНТАРІ • 564

  • @statquest
    @statquest  Рік тому +20

    To learn more about Lightning: lightning.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
    NOTE: A lot of people ask for the math at 13:16 to be clarified. In that example we have 3,000,000 inputs, each connected to 100 activation functions, for a total of 300,000,000 weights on the connections from the inputs to the activation functions. We then have another 300,000,000 weights on the connections from activations functions to the outputs. 300,000,000 + 300,000,000 = 2 * 300,000,000

  • @karanacharya18
    @karanacharya18 7 місяців тому +70

    In simple words, word embeddings is the by-product of training a neural network to predict the next word. By focusing on that single objective, the weights themselves (embeddings) can be used to understand the relationships between the words. This is actually quite fantastic! As always, great video @statquest!

    • @statquest
      @statquest  7 місяців тому +8

      bam! :)

    • @joeybasile545
      @joeybasile545 7 місяців тому +4

      Not necessarily just the next word. Your statement is specific.

  • @NoNonsense_01
    @NoNonsense_01 Рік тому +114

    Probably the most important concept in NLP. Thank you explaining it so simply and rigorously. Your videos are a thing of beauty!

  • @exxzxxe
    @exxzxxe 9 місяців тому +27

    Josh; this is the absolutely clearest and most concise explanation of embeddings on UA-cam!

    • @statquest
      @statquest  9 місяців тому +2

      Thank you very much!

    • @davins90
      @davins90 8 місяців тому +1

      totally agree

  • @chad5615
    @chad5615 Рік тому +4

    Keep up the amazing work (especially the songs) Josh, you're making live easy for thousands of people !

    • @statquest
      @statquest  Рік тому

      Wow! Thank you so much for supporting StatQuest! TRIPLE BAM!!!! :)

  • @SergioPolimante
    @SergioPolimante 11 місяців тому +7

    Statquest is by far the best machine learning Chanel on UA-cam to learn the basic concepts. Nice job

  • @rachit7185
    @rachit7185 Рік тому +97

    This channel is literally the best thing happened to me on youtube! Way too excited for your upcoming video on transformers, attention and LLMs. You're the best Josh ❤

    • @statquest
      @statquest  Рік тому +6

      Wow, thanks!

    • @MiloLabradoodle
      @MiloLabradoodle Рік тому +4

      Yes, please do a video on transformers. Great channel.

    • @statquest
      @statquest  Рік тому +17

      @@MiloLabradoodle I'm working on the transformers video right now.

    • @liuzeyu3125
      @liuzeyu3125 Рік тому +1

      @@statquest Can't wait to see it!

  • @harin01737
    @harin01737 Рік тому +5

    I was struggling to understand NLP and DL concepts, thinking of dropping my classes, and BAM!!! I found you, and now I'm writing a paper on neural program repair using DL techniques.

  • @JawadAhmadCodes
    @JawadAhmadCodes 3 місяці тому +1

    Oh my Gosh, StatQuest is surely the greatest channel I found to learn the whole universe in simple way. WOW!

  • @myyoutubechannel2858
    @myyoutubechannel2858 4 місяці тому +1

    In the first 19 seconds my mans explains Word Embedding more simply and elegantly than anything else out there on the internet.

  • @ashmitgupta8039
    @ashmitgupta8039 5 місяців тому +2

    Was literally struggling to understand this concept, and then I found this goldmine.

  • @mannemsaisivadurgaprasad8987
    @mannemsaisivadurgaprasad8987 Рік тому +2

    On of the best videos I've seen till now regarding Embeddings.

  • @haj5776
    @haj5776 Рік тому +2

    The phrase "similar words will have similar numbers" in the song will stick with me for a long time, thank you!

  • @tanbui7569
    @tanbui7569 Рік тому +3

    Damn, when I first learned about this 4 years ago, it took me two days to wrap my head around to understand these weights and embeddings to implement in codes. Just now, I need to refreshe myself the concepts since I have not worked with it in a while and your videos illustrated what I learned (whole 2 days in the past) in just 16 minutes !! I wished this video existed earlier !!

  • @yuxiangzhang2343
    @yuxiangzhang2343 Рік тому +7

    So good!!! This is literally the best deep learning tutorial series I find… after a very long search on the web!

  • @mycotina6438
    @mycotina6438 Рік тому +4

    BAM!! StatQuest never lie, it is indeed super clear!

  • @manuelamankwatia6556
    @manuelamankwatia6556 8 місяців тому +2

    This is by far the best video on embeddings. A while university corse is broken down in 15minutes

  • @TropicalCoder
    @TropicalCoder Рік тому +2

    That was the first time I actually understood embeddings - thanks!

  • @pichazai
    @pichazai 7 місяців тому +2

    this channel is the best resource of ML in the entire internet

  • @acandmishra
    @acandmishra 8 місяців тому +1

    your work is extremely amazing and so helpful for new learns who want to go into detail of working of Deep Learning models , instead of just knowing what they do!!
    Keep it up!

  • @noadsensehere9195
    @noadsensehere9195 2 місяці тому +1

    This is the only video I was finding to understand this basic concept for NLP! tHANKS

  • @pushkar260
    @pushkar260 Рік тому +3

    That was quite informative

    • @statquest
      @statquest  Рік тому

      BAM! Thank you so much for supporting StatQuest!!! :)

  • @wizenith
    @wizenith Рік тому +10

    haha I love your opening and your teaching style! when we think something is extremely difficult to learn, everything should begin with singing a song, that make a day more beautiful to begin with ( heheh actually I am not just teasing lol, I really like that ) thanks for sharing your thoughts with us

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому +1

    This is the best explanation of word embedding I have come across.

  • @awaredz007
    @awaredz007 7 місяців тому +1

    Wow!! This is the best definition I have ever heard or seen, of word embedding. Right at 09:35. Thanks for the clear and awesome video. You guy rock!!

  • @channel_SV
    @channel_SV Рік тому +1

    It's so nice to google and realize that there is a StatQuest about your question, when you are certain of that there hadn't been one some time before

  • @DanielDias-vl2js
    @DanielDias-vl2js 4 місяці тому +1

    Thank goodness I found this channel! You've got great content and an excellent teaching methodology here!

  • @mazensaaed8635
    @mazensaaed8635 5 місяців тому +2

    I promise I'll be member in your channel when I get my first data science job

    • @statquest
      @statquest  5 місяців тому

      BAM! Thank you very much! :)

  • @flow-saf
    @flow-saf Рік тому +2

    This video explains the source of the multiple dimensions in a word embedding, in the most simple way. Awesome. :)

  • @exxzxxe
    @exxzxxe 8 місяців тому +1

    Hopefully everyone following this channel has Josh's book. It is quite excellent!

    • @statquest
      @statquest  8 місяців тому

      Thanks for that!

  • @rathinarajajeyaraj1502
    @rathinarajajeyaraj1502 Рік тому +1

    This is one of the best sources of information.... I always find videos a great source of visual stimulation... thank you.... infinite baaaam

  • @dreamdrifter
    @dreamdrifter Рік тому +2

    Thank you Josh, this is something I've been meaning to wrap my head around for a while and you explained it so clearly!

  • @ananpinya835
    @ananpinya835 Рік тому +3

    StatQuest is great! I learn a lot from your channel. Thank you very much!

  • @gustavow5746
    @gustavow5746 Рік тому +1

    the best video I saw about this topic so far. Great Content! Congrats!!

  • @FullStackAmigo
    @FullStackAmigo Рік тому +4

    Absolutely the best explanation that I've found so far! Thanks!

  • @muthuaiswaryaaswaminathan4079
    @muthuaiswaryaaswaminathan4079 Рік тому +2

    Thank you so much for this playlist! Got to learn a lot of things in a very clear manner. TRIPLE BAM!!!

  • @EZZAHIRREDOUANE
    @EZZAHIRREDOUANE 7 місяців тому +1

    Great presentation, You saved my day after watching several videos, thank you!

  • @MarvinMendesCabral
    @MarvinMendesCabral Рік тому +1

    Hey Josh, i'm a brazilian student and i love to see your videos, it's such a good and fun to watch explanation of every one of the concepts, i just wanted to say thank you, cause in the last few months you made me smile beautiful in the middle of studying, so, thank you!!! (sorry for the bad english hahaha)

  • @wellwell8025
    @wellwell8025 Рік тому +3

    Way better than my University slides. Thanks

  • @lfalfa8460
    @lfalfa8460 Рік тому +1

    I love all of your songs. You should record a CD!!! 🤣
    Thank you very much again and again for the elucidating videos.

  • @familywu3869
    @familywu3869 Рік тому +6

    Thank you very much for your excellent tutorials! Josh. Here I have a question, at around 13:30 of this video tutorial, you mentioned to multiply by 2. I am not sure why 2? I mean if there are more than 2 outputs, will we multiply the number of output nodes, instead of 2? Thank you for your clarification in advance.

    • @statquest
      @statquest  Рік тому +4

      If we have 3,000,000 words and phrases as inputs, and each input is connected to 100 activation functions, then we have 300,000,000 weights going from the inputs to the activation function. Then from those 100 activation function, we have 3,000,000 outputs (one per word or phrase), each with a weight. So we have 300,000,000 weights on the input side, and 300,000,000 weights on the output side, or a total of 600,000,000 weights. However, since we always have the same number of weights on the input and output sides, we only need to calculate the number of weights on one side and then just multiply that number by 2.

    • @surojit9625
      @surojit9625 Рік тому +3

      @@statquest Thanks for explaining! I also had the same question.

    • @jwilliams8210
      @jwilliams8210 Рік тому +1

      Ohhhhhhhhh! I missed that the first time around! BTW: (Stat)Squatch and Norm are right: StatQuest is awesome!!

  • @mykolalebid6279
    @mykolalebid6279 Місяць тому

    Thank you for your excellent work. A video on negative sampling would be a valuable addition.

    • @statquest
      @statquest  Місяць тому

      I'll keep that in mind.

  • @aoliveiragomes
    @aoliveiragomes Рік тому +1

    Thanks!

    • @statquest
      @statquest  Рік тому

      BAM!!! Thank you so much for supporting StatQuest!!! :)

  • @fouadboutaleb4157
    @fouadboutaleb4157 Рік тому +2

    Bro , i have my master degree in ML but trust me you explain it better than my teachers ❤❤❤
    Big thanks

  • @LakshyaGupta-ge3wj
    @LakshyaGupta-ge3wj Рік тому +2

    Absolutely mind blowing and amazing presentation! For the Word2Vec's strategy for increasing context, does it employ the 2 strategies in "addition" to the 1-Output-For-1-Input basic method we talked about in the whole video or are they replacements? Basically, are we still training the model on predicting "is" for "Gymkata" in the same neural network along with predicting "is" for a combination of "Gymkata" and "great"?

    • @statquest
      @statquest  Рік тому

      Word2Vec uses one of the two strategies presented at the end of the video.

  • @michaelcheung6290
    @michaelcheung6290 Рік тому +2

    Thank you statquest!!! Finally I started to understand LSTM

  • @RaynerGS
    @RaynerGS Рік тому +1

    I admire your work a lot. Salute from Brazil.

  • @mamdouhdabjan9292
    @mamdouhdabjan9292 Рік тому +6

    Hey Josh. A great new series that I, and many others, would be excited to see is bayesian statistics. Would love to watch you explain the intricacies of that branch of stats. Thanks as always for the great content and keep up with the neural-network related videos. They are especially helpful.

  • @周子懿-y5r
    @周子懿-y5r Рік тому +3

    Thank you Josh for this great video. I have a quick question about the Negative Sampling: If we only want to predict A, why do we need to keep the weights for "abandon" instead of just ignoring all the weights except for "A"?

    • @statquest
      @statquest  Рік тому +3

      If we only focused on the weights for "A" and nothing else, then training would cause all of the weights to make every output = 1. In contrast, by adding some outputs that we want to be 0, training is forced to make sure that not every single output gets a 1.

  • @alexdamado
    @alexdamado 5 місяців тому +1

    Thanks for posting. It is indeed a clear explanation and helped me move forward with my studies.

    • @statquest
      @statquest  5 місяців тому

      Glad it was helpful!

  • @mahdi132
    @mahdi132 Рік тому +1

    Thank you sir. Your explanation is great and your work is much appreciated.

  • @ajd3fjf4hsjd3
    @ajd3fjf4hsjd3 4 місяці тому +1

    Fantasticly simple, and complete!

  • @ramzirebai3661
    @ramzirebai3661 Рік тому +1

    Thank you so much Mr.Josh Starmer, you are the only one that makes ML concepts easy to understand
    Can you , please , explain Glove ?

  • @ah89971
    @ah89971 Рік тому +50

    When I watched this,I have only one question which is why all the others failed to explain this if they are fully understood the concept?

    • @statquest
      @statquest  Рік тому +18

      bam!

    • @rudrOwO
      @rudrOwO Рік тому +6

      @@statquest Double Bam!

    • @meow-mi333
      @meow-mi333 11 місяців тому +3

      Bam the bam!

    • @eqe-kui-nei
      @eqe-kui-nei 3 місяці тому +1

      @@ah89971 A lot people in this industry (even with a phd) actually dont.

  • @m3ow21
    @m3ow21 Рік тому +1

    I love the way you teach!

  • @akashbarik5806
    @akashbarik5806 21 день тому

    @statquest "5:30" I'm not sure if I'm right, but after researching a bit I found out that the number of activation functions have nothing to do with the number of associations with each word, The number of activation functions depend upon the structure of your neural network, and the number of vector representations solely depend upon how you want to embed the words. In simple terms, you can have a 3 vector representation of a word and use only 2 activation functions. I may be wrong but thats what I found out.

    • @statquest
      @statquest  20 днів тому +1

      To create 3 embedding values per input with only 2 activation functions, you could connect all of the inputs to 1 activation function and put 1 weight on each input, but then you'd need to connect all of the inputs to the other activation function and use 2 weights for each input. The problem with that second activation function is that input * w1 * w2 = input * (w1 * w2) = input * w3. so I believe you'd end up with the equivalent of just 2 embedding values per input in the end. I believe this is why neural networks are always designed to have one weight per input per activation function.

    • @akashbarik5806
      @akashbarik5806 17 днів тому

      @@statquest Thanks alot for the clarification

  • @alfredoderodt6519
    @alfredoderodt6519 Рік тому +1

    You are a beautiful human! Thank you so much for this video! I was finally able to understand this concept! Thanks so much again!!!!!!!!!!!!! :)

  • @saisrisai9649
    @saisrisai9649 11 місяців тому +1

    Thank you Statquest!!!!

  • @vpnserver407
    @vpnserver407 Рік тому +1

    highly valuable video and book tutorial, thanks for putting this kind of special tuts out here .

  • @AliShafiei-ui8tn
    @AliShafiei-ui8tn Рік тому +1

    the best channel ever.

  • @avishkaravishkar1451
    @avishkaravishkar1451 Рік тому +2

    For those of you who find it hard to understand this video, my recommendation is to watch it at a slower pace and make notes of the same. It will really make things much more clear.

  • @eamonnik
    @eamonnik Рік тому +1

    Hey Josh! Loved seeing your talk at BU! Appreciate your videos :)

  • @ParthPandey-j2h
    @ParthPandey-j2h Рік тому +1

    At 13:30 why we had multiplied by 2, while calculating number of weights required. 3 million(word) * 100(activation/word) * 2?

    • @statquest
      @statquest  Рік тому

      Because we have the same number of weights on connections going to the activation function as we have of weights going from the activation functions.

    • @ParthPandey-j2h
      @ParthPandey-j2h Рік тому

      @@statquest Is it like wX+b, so taking b

    • @statquest
      @statquest  Рік тому +1

      @@ParthPandey-j2h If we have 3,000,000 words and phrases as inputs, and each input is connected to 100 activation functions, then we have 300,000,000 weights going from the inputs to the activation function. Then from those 100 activation function, we have 3,000,000 outputs (one per word or phrase), each with a weight. So we have 300,000,000 weights on the input side, and 300,000,000 weights on the output side, or a total of 600,000,000 weights. However, since we always have the same number of weights on the input and output sides, we only need to calculate the number of weights on one side and then just multiply that number by 2.

    • @ParthPandey-j2h
      @ParthPandey-j2h Рік тому +2

      Wow, thanks for the reply @@statquest Double Bam!!

  • @bancolin1005
    @bancolin1005 Рік тому +1

    BAM! Thanks for your video, I finally realize what the negative sampling means ~

  • @natuchips98
    @natuchips98 4 місяці тому +1

    You literally saved my life

  • @MadeyeMoody492
    @MadeyeMoody492 Рік тому +1

    Great video! Was just wondering why the output of the softmax activation at 10:10 are just 1 and 0s. Wouldn't that only be the case if we applied ArgMax here not SoftMax?

    • @statquest
      @statquest  Рік тому +3

      In this example the data set is very small and, for example, the word "is" is always followed by "great", every single time. In contrast, if we had a much larger dataset, then the word "is" would be followed by a bunch of words (like "great", or "awesome" or "horrible", etc) and not followed by a bunch of other words (like "ate", or "stand", etc). In that case, the soft max would tells which words had the highest probability of following is and we wouldn't just get 1.0 for a single word that could follow the word 'is'.

    • @MadeyeMoody492
      @MadeyeMoody492 Рік тому +1

      @@statquest Ohh ok, that clears it up. Thanks!!

  • @wenqiangli7544
    @wenqiangli7544 Рік тому +1

    Great video for explaining word2vec!

  • @ColinTimmins
    @ColinTimmins Рік тому +1

    Thank you so much for these videos. It really helps with the visuals because I am dyslexic… Quadruple BAM!!!! lol 😊

  • @danish5326
    @danish5326 Рік тому +1

    Thanks for enlightening us Master.

  • @ericvaish8841
    @ericvaish8841 4 місяці тому +1

    Great explanation my man!!

  • @neemo8089
    @neemo8089 Рік тому

    Thank you so much for the video! I have one question, at 15:09, why we only need to optimize 300 steps? For one word with 100 * 2 weights? not sure how to understand the '2' as well.

    • @statquest
      @statquest  Рік тому +1

      At 15:09 there are 100 weights going from the word "aardvark" to the 100 activation functions in the hidden layer. There are then 100 weights going from the activation functions to the sum for the word "A" and 100 weights going from the activation functions to the sum for the word "abandon". Thus, 100 + 100 + 100 = 300.

    • @neemo8089
      @neemo8089 Рік тому +1

      Thank you!@@statquest

  • @auslei
    @auslei Рік тому +1

    Love this channel.

  • @shamshersingh9680
    @shamshersingh9680 8 місяців тому +1

    Hi Josh, again the best explanation for the concept. However, I have a doubt. As per the explanation, word-embeddings are the weights associated with each word between the input and activation function layer. These weights are obtained after training on large text corpus like wikipedia. When I train another model using these embeddings on another set of data, the weights (embeddings) will change during back-propagation while training. So the embeddings will not remain same and change with every model we train. Is it correct interpretation or I am missing something here.

    • @statquest
      @statquest  8 місяців тому +1

      When you build a neural network, you can specify which weights are trainable and which should be left as is. This is the basis of "fine-tuning" a model - just training specific weights rather than all of them. So, you can do that. Or you, you can just start from scratch - don't pre-train the word embeddings, but train them when you train everything else. This is what most large language models, like ChatGPT, do.

  • @tupaiadhikari
    @tupaiadhikari Рік тому

    Great Explanation. Please make a video on how do we connect the output of an Embedding Layer to an LSTM/GRU for doing classification for say Sentiment Analysis

    • @statquest
      @statquest  Рік тому

      I show how to connect it to an LSTM for language translation here: ua-cam.com/video/L8HKweZIOmg/v-deo.html

    • @tupaiadhikari
      @tupaiadhikari Рік тому +1

      @@statquest Thank You Professor Josh !

  • @MaskedEngineerYH
    @MaskedEngineerYH Рік тому +1

    Keep going statquest!!

  • @lexxynubbers
    @lexxynubbers Рік тому +1

    Machine learning explained like Sesame Street is exactly what I need right now.

  • @pedropaixaob
    @pedropaixaob 11 місяців тому +1

    This is an amazing video. Thank you!

  • @gabrielrochasantana
    @gabrielrochasantana 9 місяців тому +1

    Amazing lecture, congrats. The audio was also made from an NPL (Natural Language Processing), right?

    • @statquest
      @statquest  9 місяців тому

      The translated overdubs were.

  • @ar_frz
    @ar_frz Місяць тому +1

    This was lovely! thank you.

  • @nimitnag6497
    @nimitnag6497 4 місяці тому +1

    Hey Josh , thanks for this amazing video. It was an amazing explanation of a cool concept. However I have a question. If in a corpus , I also have a document that states Troll 2 is bad!. Will the word bad and awesome share the similar embedding vector? If not can you please give an explanation. Thank you so much for helping around

    • @statquest
      @statquest  4 місяці тому

      It's possible that they would, since it occurs in the exact same context. However, if you have a larger dataset, you'll get "bad" in other, more negative contexts, and you'll get "awesome" in other, more positive contexts, and that will, ultimately, affect the embeddings for each word.

    • @nimitnag6497
      @nimitnag6497 4 місяці тому +1

      @@statquest Thank you so much Josh for your quick reply

    • @nimitnag6497
      @nimitnag6497 4 місяці тому

      Do you have any discord groups or any other forum where can ask questions ?

    • @statquest
      @statquest  4 місяці тому

      @@nimitnag6497 Unfortunately not.

  • @kimsobota1324
    @kimsobota1324 Рік тому

    I appreciate the knowledge you've just shared. It explains many things to me about neural networks. I have a question though, If you are randomly assigning a Value to a word, why not try something easier?
    For example, In Hebrew, each of the letters of the Alef - Bet is assigned a value. these values are added together to form a sum of a word. It is the context of the word, in a sentence that forms the block. Sabe? Take a look at Gamatra, Hewbew has been doing this for thousands of years. Just a thought.

    • @statquest
      @statquest  Рік тому

      Would that method result in words used in similar contexts to have similar numbers? Does it apply to other languages? Other symbols? And can we end up with multiple numbers per symbol to reflect how it can be used or modified in different contexts?

    • @kimsobota1324
      @kimsobota1324 Рік тому

      I wish I could answer that question better than to tell you context is EVERYTHING in Hebrew, a language that has but doesn't use vowels, since all who use the language understand the consonant-based word structures.
      Not only that, but in the late 1890s Rabbis from Ukraine and Azerbaijan developed a mathematical code that was used to predict word structures from the Torah that were accurate to a value of 0.001%.
      Others have tried to apply it to other books like Alice in Wonderland and could not duplicate the result.
      You can find more information on the subject through a book called, The Bible Code, which gives much more information as well as the formuli the Jewish Mathameticians created.
      While it is a poor citation, I have included this Wikipedia link: en.wikipedia.org/wiki/Bible_code#:~:text=The%20Bible%20code%20(Hebrew%3A%20%D7%94%D7%A6%D7%95%D7%A4%D7%9F,has%20predicted%20significant%20historical%20events.
      The book is available on Amazon if you find it peaks your interest. Please let me know if this helps.
      @@statquest

    • @kimsobota1324
      @kimsobota1324 Рік тому

      @starquest,
      I had not heard from you about the Wiki?

  • @jingzhouzhao8609
    @jingzhouzhao8609 6 місяців тому

    Great Video in high quality!! Just wondering "times 2" at 13:27, because I saw 4 neurons in output layers, so not "times 4"?

    • @statquest
      @statquest  6 місяців тому

      A lot of people ask for the math at 13:16 to be clarified. In that example we have 3,000,000 inputs (only the first 4 are shown...), each connected to 100 activation functions, for a total of 300,000,000 weights on the connections from the inputs to the activation functions. We then have another 300,000,000 weights on the connections from activations functions to the outputs (only 4 outputs are show, but there are 3,000,000). 300,000,000 + 300,000,000 = 2 * 300,000,000

  • @study-tp4ts
    @study-tp4ts Рік тому +1

    Great video as always!

  • @yasminemohamed5157
    @yasminemohamed5157 Рік тому +1

    Awesome as always. Thank you!!

  • @manpower9641
    @manpower9641 Місяць тому

    hmm, where did the x2 on the weights come from (min 13:30). Thank you :)

    • @statquest
      @statquest  Місяць тому

      We have 3,000,000 inputs, and each input has 100 weights going to the hidden layer. We then have 100 weights going from the hidden layer to the 3,000,000 outputs. The total number of weights that we need to train, is thus the sum of the weights from the input to the hidden layer (3,000,000 * 100), plus the weights from the hidden layer to the outputs (3,000,000 * 100). Thus, we can write it as 3,000,000 * 100 * 2.

  • @c.nbhaskar4718
    @c.nbhaskar4718 Рік тому +1

    great stuff as usual ..BAM * 600 million

  • @sandeepgiri2374
    @sandeepgiri2374 Рік тому +1

    Could you please explain the final calculation to derive 300 weights to be optimized? Shouldn't it be 1*100*2 = 200 weights, not 300?

    • @statquest
      @statquest  Рік тому +1

      I'm not sure I fully understand your math. However, if we have 3,000,000 words in the vocabulary, and each one has 100 weights going to the activation functions and 100 weights going away from the activation functions, then we have 3,000,000 * 100 * 2 = 600,000,000.

    • @sandeepgiri2374
      @sandeepgiri2374 Рік тому

      @@statquest Hi Josh, I'm talking about 15:10, when you're saying you need to optimize 300 per step. Looking at the network it looks like a 1*100*2 calculation. Could you please explain how you arrived at 300 weights ? Thanks

    • @statquest
      @statquest  Рік тому +2

      @@sandeepgiri2374 We have 100 weights leading to the activation functions and 200 weights leaving the activation functions. 100 + 200 = 300 weights total. I think the problem with your math is that instead of 1*100*2, you need 100 * (1+ 2). 1 for the weights leading to the activation function and 2 for the weights leaving the activation function.

  • @yuhanzhou6963
    @yuhanzhou6963 Рік тому

    Hi Josh! Thank you so much for the clear explanation! I'm just having trouble understanding why is it that we DON'T want to predict "abandon" but we are still predicting the weights that lead to it? Shouldn't it be that we WANT to predict "abandon", and the Negative Sampling selects a subset of words that we WANT TO PREDICT?

    • @statquest
      @statquest  Рік тому

      What time point, minutes and seconds, are you asking about?

    • @add-mt5xc
      @add-mt5xc Рік тому

      @@statquest ua-cam.com/video/viZrOnJclY0/v-deo.html

    • @yuhanzhou6963
      @yuhanzhou6963 Рік тому

      It's from 14:35 to 15:07, thank you!@@statquest

    • @statquest
      @statquest  Рік тому +1

      @@yuhanzhou6963 I see your confusion. I should have chosen my words better. By "words we don't want to predict", I mean "words that we want to have 0s as output". So we want to predict that "A" gets a 1 and and "Abandon" gets a 0. So we need to optimizes the weights leading to A and the weights leading to "Abandon".

  • @pakaponwiwat2405
    @pakaponwiwat2405 Рік тому +1

    Wow, Awesome. Thank you so much!

  • @nouraboub4805
    @nouraboub4805 5 місяців тому +1

    ‏‪goood, thenk you so much for this playlist is the best ❤️😍

    • @statquest
      @statquest  5 місяців тому +1

      Glad you enjoy it!

  • @BalintHorvath-mz7rr
    @BalintHorvath-mz7rr 9 місяців тому

    Awesome video! This time, I feel I miss one step through. Namely, how do you train this network? I mean, I get that we want the network as such that similar words have similar embeddings. But what is the 'Actual' we use in our loss function to measure the difference from and use backpropagation with?

    • @statquest
      @statquest  9 місяців тому

      Yes

    • @balintnk
      @balintnk 9 місяців тому

      @@statquest haha I feel like I didn't ask the question well :D How would the network know, without human input, that Troll 2 and Gymkata is very similar and so it should optimize itself so that ultimately they have similar embeddings? (What "Actual" value do we use in the loss function to calculate the residual?)

    • @statquest
      @statquest  9 місяців тому

      @@balintnk We just use the context that the words are used in. Normal backpropagation plus the cross entropy loss function where we use neighboring words to predict "troll 2" and "gymkata" is all you need to use to get similar embedding values for those. That's what I used to create this video.

  • @denismarcio
    @denismarcio 9 місяців тому +1

    Extremamente didático! Parabéns.

    • @statquest
      @statquest  9 місяців тому +1

      Muito obrigado! :)

  • @КристиКристи-т7к
    @КристиКристи-т7к Місяць тому

    Could you please explain to me why we get 300 weights for optimization in the negative sampling part? As I thought the math should be as follows:
    1 word (aardvark) x 100 weights per word and phrase leading to the hidden layer x 2 weights that get us from the activation fuctions = 200 Weights

    • @statquest
      @statquest  Місяць тому

      In this example we want the output for "A" to be 1 and the output for "abandon" to be 0. So we have 100 weights from "aardvark" to the hidden layer. And then 100 weights going from the hidden layer to the output for "A" and 100 weights going from the hidden layer to the output for "abandon". 100 + 100 + 100 = 300.

  • @lancezhang892
    @lancezhang892 Рік тому

    If we use softmax function as activation function, in the last step whether should we use entropy loss function with prediction value y_head and label value y=1 to get the loss function value ?And then use backpropagation to optimize the weights?

    • @statquest
      @statquest  Рік тому +1

      We use the cross entropy loss with the softmax.

  • @MannyBernabe
    @MannyBernabe 2 місяці тому +1

    Great work. Thank you.

  • @张超-o2z
    @张超-o2z 10 місяців тому

    Hey, Josh! Absolutely amazing series!!!
    If I understand correctly, the input weights of a specific word (e.g., gymkata) are its coordinates in multi-dimensional space? The coordinates can be used to calculate cosine similarity to find similar meanings as well(e.g., girl queen, guy king)?
    And is that true the philosophy applies to LLMs such as GPT embeddings? GPT Text-embeddings-ada-002 has 1536 dimensions, which means there are 1536 nodes in the 1st hidden layer?

    • @statquest
      @statquest  10 місяців тому

      In theory it applies to LLMs, but those networks are so complex that I'm not 100% sure they do. And a model with 1536 dimensions has 1536 nodes in the first layer.

    • @张超-o2z
      @张超-o2z 10 місяців тому

      You mean 1536 dimensions, not 1546, right? @@statquest

    • @statquest
      @statquest  10 місяців тому

      @@张超-o2z yep

  • @jayachandrarameshkalakutag7329

    Hi josh firstly thank you for all your videos. I had one doubt , in skip gram what will be the loss function on which the network is been optimized, in CBOW i can see that cross entropy is enough

    • @statquest
      @statquest  Рік тому

      I believe it's cross entropy in both.

  • @ang3dang2
    @ang3dang2 3 місяці тому +1

    Can you do one for wav2vec? It seemingly taps on the same concept as word2vec but the equations are so much more complex.

    • @statquest
      @statquest  3 місяці тому

      I'll keep that in mind.

  • @JohnDoe-r3m
    @JohnDoe-r3m Рік тому +1

    That's awesome! But how would the multilingual word2vec be trained? Would the training dataset simply include corpus of two (or more) languages? or would additional NN infrastructure be required?

    • @statquest
      @statquest  Рік тому

      Are you asking about something that can translate one language to another? If so, then, yes, additional infrastructure is needed and I'll describe it in my next video in this series (it's called "sequence2sequence").

    • @JohnDoe-r3m
      @JohnDoe-r3m Рік тому

      @@statquest not exactly, it's more like having similar words from multiple languages to be mapped within the same vector spaces. so for example King and "King" in French, German and Spanish - would appear to be the same.

    • @statquest
      @statquest  Рік тому +1

      @@JohnDoe-r3m Hmmm.. I'm not sure how that would work because the the english word "king" and the Spanish translation, "rey", would be in different contexts (For example, the english "king" would be in a phrase "all hail the king", and the spanish version would be in a sentence that had completely different words (even if they meant the same thing).

  • @exxzxxe
    @exxzxxe 9 місяців тому +2

    You ARE the Batman and Superman of machine learning!