Q&A - Hierarchical Softmax in word2vec

ChrisMcCormickAI

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 22 сер 2024
What is the "Hierarchical Softmax" option of a word2vec model? What problems does it address, and how does it differ from Negative Sampling? How is Hierarchical Softmax implemented?
For more insights into word2vec, check out my full online course on word2vec here:
www.chrismccor...

КОМЕНТАРІ • 35

@vgkk5637 4 роки тому ⁺¹⁰
Thank you Chris. well explained and just perfect for people like me who are interested in understanding the concepts and usage rather than academic maths behind it..
@ChrisMcCormickAI 4 роки тому
Thanks VeniVig! That's nice to hear, and I'm glad it provided a practical understanding.
@abhijeetsharma5715 3 роки тому ⁺²
This was the best explanation of HS that I saw! Very clearly explained.
In my opinion, the most essential part of this was that even with HS we do have |V|-1 output units..but only log|V| units need to be computed while training since remaining units are "dont-cares" and we can compute loss based on the log|V| output units only.
However, while testing, we would have certainly required to compute all |V| softmax probabilities to make any prediction, but we don't really care about testing/predicting since our aim is just to train the embedding.
@gemini_537 7 місяців тому
I like your comments, but I don't quite understand why only log|V| units need to be computed during training. Could you give an example?
@doctorshadow2482 4 місяці тому
Thank you. Good explanation.
Some questions:
1. At 2:40. Why are we interested in getting the sum as 1, which softmax provides? What's wrong with using existing output values, we already have the weight for 8 higher than others, so we have the answer. Why do we need the extra work at all?
2. At 9:49. What is this "word vector"? Is it still one-hot vector for the word from the dictionary or something else? How is this vector represented in this case?
3. At 15:00. That's fine, but if we trained for "cpupacabra", what would be with the weights when we train for the other words? Wouldn't it just blend or "blur" the coefficients making them closer to "white noise"?
@stasbekman8852 4 роки тому ⁺¹⁴
there is a small typo at 13:50 - it should be .72, instead of .62 so that it adds up to 1.
and thank you!
@user-re1bi2bc8b 3 роки тому ⁺¹
Incredibly easy to understand thanks to your explanation. Thank you Chris!
@joyli9106 3 роки тому
Thank you Chris! I would say it's the best explanation I've ever seen about HS.
@user-hp7ut9gp5e Рік тому
This is amazing. Although I am Korean and bad at English, your lecture made me smart.
@hamzaleb9215 4 роки тому
Always clear explanation and right to the point. Thanks Chris. Waiting for next videos. Your 2 articles explaining Word2Vec were just perfect.
@ChrisMcCormickAI 4 роки тому
Thanks for the encouragement, Hamza!
@ariverhorse 7 місяців тому
Best explanation of HS I have seen!
@j045ua 4 роки тому
These videos have been a great help for my thesis! Thank you Chris!
@amratanshu99 3 роки тому ⁺¹
Nicely explained. Thanks man!
@guaguago2583 4 роки тому
Your fans, second comment, I am a chinese phd student, very very expect next videos :D
@ChrisMcCormickAI 4 роки тому
Thank you! I'm hoping to upload a new video about every week or so.
@Dao007forever 2 роки тому
Great explanation!
@nikolaoskaragkiozis5330 3 роки тому
Hi Chris, thank you for the video. So, if I understand correctly there are 2 things that are being learned here. 1) The word embedding, 2) the Output Matrix which contains the weights associated with the output layer?
@samba789 4 роки тому ⁺¹
Great videos Chris! I absolutely love your content!
Just a quick clarification, is the output matrix also going to be weights that we need to learn?
@mahdiamrollahi8456 3 роки тому
Hello dear Chris,
Hope all is well,
Thanks for your lecture, that was fabulous.
I have some tiny questions:
- For negative sampling, it is said that negative samples will selected randomly. So, It means that we just need to update the params for those samples instead of all possible words in softmax? (so, in softmax we need to update the params for both correct and incorrect classes, true?)
-How we can calculate the output matrix? How do we have it?
-if we want to calculate the prob of all context words, we need to traverse over all the tree, right?
Best Wishes.Mahdi
@utubemaloy 4 роки тому
Thank you!!! I loved this.
@8g8819 4 роки тому
Great video series, keep it going !!!!
@ChrisMcCormickAI 4 роки тому
Thanks, giavo!
@souvikjana2048 4 роки тому ⁺¹
Hi Chris. Great Video. Could you explain the part how the binary tree is trained. I can't seem to understand for the input-context pair(chewpacabra-active) how do we select 0/1 at the root node or the subsequent nodes below.
@haardshah1676 4 роки тому
I think you know the sequence of 0/1s for the context word. So for each node, you have a logistic regression model that takes as input the embedding for the input word and outputs probability of 0/1 for that node. So for the example you describe, we know the true label for root node should be "1", for 4th node is "0", and for 3rd node is "1".
@libzz6133 Рік тому
at 10:53, we got label for 4 is 0, how about the others label ? like the label for number 1 ?
@gavin8535 3 роки тому
Nice. What does the vector 6 look like and where is it? Is it in the output layer?
@yuantao563 4 роки тому
The video is great! Is there any rules in why each blue node corresponds to the given row in the output matrix? Like why first blue node is row 6? How is it determined?
@ChrisMcCormickAI 4 роки тому
Hi Yuan,
It's just a byproduct of the Huffman tree building algorithm. If I recall correctly, I think it does result in the rows being sorted relative to the tree depth (the frequency of the word). This isn't important to the implementation, though.
@abhijeetsharma5715 3 роки тому
You can assign each blue node to any row in output-matrix. Order of assigning rows is unimportant since this is not like any RNN output(i.e. it isn't sequential output). Just like input units can be in any order in a vanilla neural net.
@maliozers 4 роки тому
First like, first comment :) thanks for your share Chris.
@ANSHULJAINist 4 роки тому
how to implement the hierarchical softmax for any model? Do frameworks like pytorch tensorflow have inbuilt implementation for them? If not, how can be build to work it on any model ?
@mikemihay 4 роки тому
Waiting for more videos
@anoop5611 3 роки тому
What does the output vector list in blue contain?
Something from the hidden-output weights?
@anoop5611 3 роки тому
Okay, I missed the part that answers it. So, output vector's particular row would correspond to one of those non-leaf nodes, and the size of the row would be equal to the number of units in the hidden layer?
Thank you Chri,s!

Наступне

Автоматичне відтворення

BERT Research - Ep. 8 - Inner Workings V - Masked Language Model