Ali Ghodsi, Lec [3,2]: Deep Learning, Word2vec

Поділитися
Вставка
  • Опубліковано 16 січ 2025

КОМЕНТАРІ • 17

  • @jerry11111
    @jerry11111 9 років тому +11

    Best lecture on word2vec. It covers everything that the papers are ambiguous on the notations and explanations on what to optimize and why.

  • @autripat
    @autripat 9 років тому +6

    The Skip-gram model discussion starts at 17:20 (we transition away from the "intractable" continuous bag of words model).
    The Skip-gram training objective is to learn word vector representations that are good at predicting nearby words (context).
    The GloVe (Global Vectors for Word Representation) model starts ay 54:36.

  • @niteshroyal30
    @niteshroyal30 8 років тому

    Thanks Professor for such wonderful lecture on word2vec.

  • @m.farahmand7440
    @m.farahmand7440 8 років тому +1

    Thanks for the informative lecture. At time 7:26 shouldn't it be gradient ascent? After all we are trying to maximize the likelihood function.

    • @yangli7741
      @yangli7741 7 років тому +1

      I think 7:26 is just gradient descent, and the guy who reminded that the sigma sign shouldn't exist actually understand it wrong because Prof. Ghodsi may have used confusing notation.
      In the log-likelihood and the summation over "w", the "w" means every word from the training set (the word as prediction given context "c"); however, when taking the derivative with respect to v_w, the "w" here actually can be any word in the vocabulary and v_w any column of weight matrix W' to be learned. So we should use a different notation, e.g., w*, in the partial derivative in v_w*.
      Accordingly, the summation over w should exist in the first place, because w* and w are not the same thing. Later removal of the summation in the adjustment rule,
      w* = w* - r(1-p(w) ) \frac{\partial v_c^T v_w}{\partial w*}, can be seen as changing from GD to SGD.
      The only reason that the final result didn't go wrong is because the partial derivative with respect to w* when w*
      eq w is just zero. That is, during the SGD, only v_w is updated.

    • @cem9927
      @cem9927 6 років тому

      If we have 4 words in the dictionary, we will have 4 v_w values and in the gradient descent update we will update each v_w seperately right ?

    • @tejasduseja
      @tejasduseja 4 роки тому

      @@yangli7741 Thanks, I had same confusion in mind.

    • @imanshojaei7784
      @imanshojaei7784 4 роки тому

      @@yangli7741 Are not also labels missed in formulation (i.e., empirical probabilities)?

  • @paolofreuli1686
    @paolofreuli1686 7 років тому

    Awesome lecture!

  • @wanminghuang1722
    @wanminghuang1722 8 років тому

    Thank you so much. Much easier to understand.

  • @stolzenable
    @stolzenable 8 років тому

    Thank you for this lecture! It is very understandable. I wonder if the slides from this lecture are available somewhere?

    • @stolzenable
      @stolzenable 8 років тому +3

      +Alexey Grigorev with a bit of googling, I found them here: uwaterloo.ca/data-science/deep-learning

    • @aseefzahir3977
      @aseefzahir3977 6 років тому

      says "page not found"

    • @srujohn652
      @srujohn652 3 роки тому +1

      @Alexey Grigorev Still page is not found

  • @rajupowers
    @rajupowers 8 років тому

    @7:20 - how can we factor v_c? there is no summation in the right term