For discussion at 14:00 - my 2 cents - In LSI the context is provided in the numbers through a term-document matrix. In the PCA your proposed context is provided in the numbers through a covariance matrix. PCA can be used for any high dimensional data. Its a more general class of analysis of finding a better feature space to represent your data. LSI on the other hand, is very specific to text corpora in analyzing which terms are more similar and what is the latent class of the words that are in a corpus.
I get confused between the vocabulary set, big "V", cardinality d (about 1.5 M is very rich), and the reference corpus, big "T", which has a much higher cardinality (Wikipedia as a corpus would be Billions of words). When we calculate p(w|c) - in the last 10 mn of this video-, I would think that the quotient of this Softmax function is a sum computed over "V", and not "T". Am I correct ? Thanks ! PS: Notation is indeed a nightmare in this chapter !
Why the input is assumed to be one hot vector? I know it is sparse, and most of them are zeros, but shouldn't the actual condition be k-hot vector( k>0)?
one hot encoding is assigning each word an unique vector to identify itself. [0 0 0 0 1] So for "the cat jumped" , its [1 0 0] , [ 0 1 0], [0 0 1], or concatenated its [1 0 0 0 1 0 0 0 1 ] to represent the sentence. But the one hot encoding itself for the word "the" is [1 0 0]
As I understood, U matrix is basically the reduced-dimension representation of the columns, i.e. the articles, V matrix is for the words. If the representation is good, we should see the word clusters by applying some visualization techs, e.g. tSNE to V matrix. Rows of V are the bases of word space, projecting them to lower dimension in order to visualize is kind of recombination and transformation. Just my 2 cents, not sure if 100% solid.
I guess there is a mistake which he made while writing the equation for the hidden layer of the neural net. He forgot to apply a non-linear functions to W_T*x. So h should be Phi(W_T * x) and not simply W_T * x (Assume Phi as non-linear function). Correct me if i'm wrong.
I guess the professor doesn't really mean that is an actual hidden layer, just want to illustrate how to convert one-hot vector to the input vector, and from the last step generating output vector. (Introducing the matrix W and W prime). In the actual neural network, there should be an activation function. I am just guessing..
Very nice lecture, very clear, not too hasty, not too slow.
best explanation, and lot of patience. ...
For discussion at 14:00 - my 2 cents -
In LSI the context is provided in the numbers through a term-document matrix. In the PCA your proposed context is provided in the numbers through a covariance matrix. PCA can be used for any high dimensional data. Its a more general class of analysis of finding a better feature space to represent your data. LSI on the other hand, is very specific to text corpora in analyzing which terms are more similar and what is the latent class of the words that are in a corpus.
Love this lecture! So clear and well explained
Thanks Professor for such wonderful lecture on word2vec
Boss Professor!
Clear explanation. Thanks Professor.
Nice Video! Gives the basic understanding of word2vec. Optimization in next lecture.
Every second is worth it....awesome
It was great, especially the notation of softmax function.
I feel there is a mistake on the slides showed at 8:35. Diagonal value of \Sigma should be the square roots of eigenvalues of XX^T or (X^T)X.
At some point, the prof should say that we will take questions at the end of the class :-) Great lecture
such a good explanation, Thank you, Professor :)
the most detailed and clear explanation! thank you!
Thank you. You saved my life
writing notes out of this lecture , thanks professor :-)
Please, explain why does it have no activation function on the hidden layer neurons?
Awesome lesson
very clear and informative thanks so much!
Great, Thanks
Good lecture
I get confused between the vocabulary set, big "V", cardinality d (about 1.5 M is very rich), and the reference corpus, big "T", which has a much higher cardinality (Wikipedia as a corpus would be Billions of words).
When we calculate p(w|c) - in the last 10 mn of this video-, I would think that the quotient of this Softmax function is a sum computed over "V", and not "T".
Am I correct ?
Thanks !
PS: Notation is indeed a nightmare in this chapter !
W2V starting @41:00
topic modeling 14:10 >> non negative matrix factorization
How did you calculate W ?
Why the input is assumed to be one hot vector? I know it is sparse, and most of them are zeros, but shouldn't the actual condition be k-hot vector( k>0)?
one hot encoding is assigning each word an unique vector to identify itself. [0 0 0 0 1] So for "the cat jumped" , its [1 0 0] , [ 0 1 0], [0 0 1], or concatenated its [1 0 0 0 1 0 0 0 1 ] to represent the sentence. But the one hot encoding itself for the word "the" is [1 0 0]
21:49
can u please upload d slides
still cannot understand the function of SVD to the relationship of word...
As I understood, U matrix is basically the reduced-dimension representation of the columns, i.e. the articles, V matrix is for the words. If the representation is good, we should see the word clusters by applying some visualization techs, e.g. tSNE to V matrix. Rows of V are the bases of word space, projecting them to lower dimension in order to visualize is kind of recombination and transformation. Just my 2 cents, not sure if 100% solid.
I guess there is a mistake which he made while writing the equation for the hidden layer of the neural net. He forgot to apply a non-linear functions to W_T*x. So h should be Phi(W_T * x) and not simply W_T * x (Assume Phi as non-linear function). Correct me if i'm wrong.
I guess the professor doesn't really mean that is an actual hidden layer, just want to illustrate how to convert one-hot vector to the input vector, and from the last step generating output vector. (Introducing the matrix W and W prime). In the actual neural network, there should be an activation function. I am just guessing..
Nice sentence you used, Silence is the language of the God, all else of poor translation!
At some point, the prof should say that we will take questions at the end of the class :-) Great lecture