The Skip-gram model discussion starts at 17:20 (we transition away from the "intractable" continuous bag of words model). The Skip-gram training objective is to learn word vector representations that are good at predicting nearby words (context). The GloVe (Global Vectors for Word Representation) model starts ay 54:36.
I think 7:26 is just gradient descent, and the guy who reminded that the sigma sign shouldn't exist actually understand it wrong because Prof. Ghodsi may have used confusing notation. In the log-likelihood and the summation over "w", the "w" means every word from the training set (the word as prediction given context "c"); however, when taking the derivative with respect to v_w, the "w" here actually can be any word in the vocabulary and v_w any column of weight matrix W' to be learned. So we should use a different notation, e.g., w*, in the partial derivative in v_w*. Accordingly, the summation over w should exist in the first place, because w* and w are not the same thing. Later removal of the summation in the adjustment rule, w* = w* - r(1-p(w) ) \frac{\partial v_c^T v_w}{\partial w*}, can be seen as changing from GD to SGD. The only reason that the final result didn't go wrong is because the partial derivative with respect to w* when w* eq w is just zero. That is, during the SGD, only v_w is updated.
Best lecture on word2vec. It covers everything that the papers are ambiguous on the notations and explanations on what to optimize and why.
The Skip-gram model discussion starts at 17:20 (we transition away from the "intractable" continuous bag of words model).
The Skip-gram training objective is to learn word vector representations that are good at predicting nearby words (context).
The GloVe (Global Vectors for Word Representation) model starts ay 54:36.
Thanks Professor for such wonderful lecture on word2vec.
Thanks for the informative lecture. At time 7:26 shouldn't it be gradient ascent? After all we are trying to maximize the likelihood function.
I think 7:26 is just gradient descent, and the guy who reminded that the sigma sign shouldn't exist actually understand it wrong because Prof. Ghodsi may have used confusing notation.
In the log-likelihood and the summation over "w", the "w" means every word from the training set (the word as prediction given context "c"); however, when taking the derivative with respect to v_w, the "w" here actually can be any word in the vocabulary and v_w any column of weight matrix W' to be learned. So we should use a different notation, e.g., w*, in the partial derivative in v_w*.
Accordingly, the summation over w should exist in the first place, because w* and w are not the same thing. Later removal of the summation in the adjustment rule,
w* = w* - r(1-p(w) ) \frac{\partial v_c^T v_w}{\partial w*}, can be seen as changing from GD to SGD.
The only reason that the final result didn't go wrong is because the partial derivative with respect to w* when w*
eq w is just zero. That is, during the SGD, only v_w is updated.
If we have 4 words in the dictionary, we will have 4 v_w values and in the gradient descent update we will update each v_w seperately right ?
@@yangli7741 Thanks, I had same confusion in mind.
@@yangli7741 Are not also labels missed in formulation (i.e., empirical probabilities)?
Awesome lecture!
Thank you so much. Much easier to understand.
Thank you for this lecture! It is very understandable. I wonder if the slides from this lecture are available somewhere?
+Alexey Grigorev with a bit of googling, I found them here: uwaterloo.ca/data-science/deep-learning
says "page not found"
@Alexey Grigorev Still page is not found
@7:20 - how can we factor v_c? there is no summation in the right term
@16:40
@19:00 - Negative sampling