🎯 Key Takeaways for quick navigation: 00:29 Today's *lecture focuses on word vectors, touching on word senses, and introduces neural network classifiers, aiming to enhance understanding of word embeddings papers like word2vec or GLoVe.* 01:52 The *word2vec model, using a simple algorithm, learns word vectors by predicting surrounding words based on dot products between word vectors, achieving word similarity in a high-dimensional space.* 03:15 Word2vec *is a "bag of words" model, ignoring word order, but still captures significant properties of words. Probabilities are often low (e.g., 0.01), and word similarity is achieved by placing similar words close in a high-dimensional vector space.* 06:31 Learning *good word vectors involves gradient descent, updating parameters based on the gradient of the loss function. Stochastic gradient descent is preferred due to its efficiency, especially in large corpora.* 10:18 Stochastic *gradient descent in word2vec involves estimating gradients based on small batches of center words, enabling faster learning. The sparsity of gradient information is addressed, and word vectors are often represented as row vectors.* 15:21 Word2vec *encompasses the skip-gram and continuous bag of words (CBOW) models. Negative sampling is introduced as a more efficient training method, using logistic regression to predict context words and reducing the computational load of softmax.* 20:57 Negative *sampling involves creating noise pairs to train binary logistic regression models efficiently. The unigram distribution with a 3/4 power transformation is used to sample words, mitigating the difference between common and rare words.* 23:40 Co-occurrence *matrices, an alternative to word2vec, represent word relationships based on word counts in context windows. The matrix can serve as a word vector representation, capturing word similarity and usage patterns.* 28:23 When *working with negative words in word vectors, sampling 10-15 negative words provides more stable results than just one. This helps capture different parts of the space and improves learning.* 30:46 Co-occurrence *matrices can be created using a window around the word (similar to word2vec) or by considering entire documents. However, these matrices are large and sparse, leading to noisier results. To address this, low-dimensional vectors (25-1,000 dimensions) are preferred.* 32:42 Singular *Value Decomposition (SVD) is used to reduce the dimensionality of count co-occurrence vectors. By deleting some singular values, lower-dimensional representations of words are obtained, capturing important information efficiently.* 35:54 Scaling *counts in the cells of the co-occurrence matrix addresses issues with extremely frequent words. Techniques like taking the log of counts or capping maximum counts can improve word vectors obtained through SVD.* 37:52 The *GLoVe algorithm, developed in 2014, unifies linear algebra-based methods (like LSA and COALS) with neural models (like skip-gram and CBOW). GLoVe uses a log-bilinear model to approximate the log of co-occurrence probabilities, aiming for efficient training and meaningful word vectors.* 43:29 GLoVe *introduces an explicit loss function, ensuring the dot product of word vectors approximates the log of co-occurrence probabilities. This model helps prevent very common words from dominating and demonstrates efficient training scalable to large corpora.* 51:50 Intrinsic *evaluation of word vectors, such as word analogies, demonstrates the effectiveness of models. GLoVe's linear component property aids in solving analogies, and its performance benefits from diverse data sources, like Wikipedia.* 56:34 Another *intrinsic evaluation involves measuring how well models match human judgments of word similarity. GLoVe, trained on diverse data, outperforms plain SVD but shows similar performance to word2vec on word similarity tasks.* 58:00 The *objective function aims for the dot product to represent the log probability of co-occurrence, leading to the log-bilinear model with wy, wj, and bias terms.* 59:24 In *model building, a bias term is added for each word to account for general word probabilities, enhancing the representation.* 01:00:23 Multiplying *by the frequency of a word adjusts for common words, giving more importance to those with higher co-occurrence counts.* 01:02:40 Word *vectors can be applied to end-user tasks like named entity recognition, significantly improving performance by capturing word meanings.* 01:06:23 Exploring *word senses, having separate vectors for each meaning was experimented with, but the majority practice involves a single vector per word type.* 01:11:05 Word *vectors for a word type can be seen as a superposition of sense vectors, a weighted average where weighting corresponds to sense frequencies.* Made with HARPA AI
I would added these ones: 45:55 Word vector evaluation 48:30 Intrinsic evaluation 57:42 Question 1:01:45 Extrinsic evaluation 1:03:25 Word sense & ambiguity
Wonderful course! Clarification: @11:29: The sparseness of affected/updated J(θ) elements depends only on the window size, not whether Simple Grad Descent or Stochastic Grad descent is used, right? Since within a window, the computation doesn't change across the two methods.
At 2:45 I think what you said about the word2vec model being a bag-of-words model is not strictly correct. Word2vec does gain some understanding of local word ordering. If I am incorrect, could you please explain?
Good point. It is not clear if the lecturer drew the vectors or if it's taken as is from the paper and the mismatch may indicate that the system is not perfect.
@@carlloseduardofl Can you explain how the log-bilinear model 'with vector differences' formula came out that? Which property of conditional probability was used? Any useful links? Timestamp 43:03
While using Stochastic Gradient descent, if we choose a corpus of 32 center words, how do we make updates to the outside (context) words that surround it. Because these words will show up when we compute the likelihood and if out corpus doesn't include them then how does their probability of occurring get updated? Thanks!
i think you query is answered at 11:36 , so actually we are only calculating gradient for those 32 words hence we are getting a sparse gradient update.
i think you should read paper "Efficient Estimation of Word Representations in Vector Space", "Distributed Representations of Words and Phrases and their Compositionality" and "GloVe: Global Vectors for Word Representation" to better understand this lecture. I don't think he can cover all the concepts in detail in just 1 hour and 15 minutes.
🎯 Key Takeaways for quick navigation:
00:29 Today's *lecture focuses on word vectors, touching on word senses, and introduces neural network classifiers, aiming to enhance understanding of word embeddings papers like word2vec or GLoVe.*
01:52 The *word2vec model, using a simple algorithm, learns word vectors by predicting surrounding words based on dot products between word vectors, achieving word similarity in a high-dimensional space.*
03:15 Word2vec *is a "bag of words" model, ignoring word order, but still captures significant properties of words. Probabilities are often low (e.g., 0.01), and word similarity is achieved by placing similar words close in a high-dimensional vector space.*
06:31 Learning *good word vectors involves gradient descent, updating parameters based on the gradient of the loss function. Stochastic gradient descent is preferred due to its efficiency, especially in large corpora.*
10:18 Stochastic *gradient descent in word2vec involves estimating gradients based on small batches of center words, enabling faster learning. The sparsity of gradient information is addressed, and word vectors are often represented as row vectors.*
15:21 Word2vec *encompasses the skip-gram and continuous bag of words (CBOW) models. Negative sampling is introduced as a more efficient training method, using logistic regression to predict context words and reducing the computational load of softmax.*
20:57 Negative *sampling involves creating noise pairs to train binary logistic regression models efficiently. The unigram distribution with a 3/4 power transformation is used to sample words, mitigating the difference between common and rare words.*
23:40 Co-occurrence *matrices, an alternative to word2vec, represent word relationships based on word counts in context windows. The matrix can serve as a word vector representation, capturing word similarity and usage patterns.*
28:23 When *working with negative words in word vectors, sampling 10-15 negative words provides more stable results than just one. This helps capture different parts of the space and improves learning.*
30:46 Co-occurrence *matrices can be created using a window around the word (similar to word2vec) or by considering entire documents. However, these matrices are large and sparse, leading to noisier results. To address this, low-dimensional vectors (25-1,000 dimensions) are preferred.*
32:42 Singular *Value Decomposition (SVD) is used to reduce the dimensionality of count co-occurrence vectors. By deleting some singular values, lower-dimensional representations of words are obtained, capturing important information efficiently.*
35:54 Scaling *counts in the cells of the co-occurrence matrix addresses issues with extremely frequent words. Techniques like taking the log of counts or capping maximum counts can improve word vectors obtained through SVD.*
37:52 The *GLoVe algorithm, developed in 2014, unifies linear algebra-based methods (like LSA and COALS) with neural models (like skip-gram and CBOW). GLoVe uses a log-bilinear model to approximate the log of co-occurrence probabilities, aiming for efficient training and meaningful word vectors.*
43:29 GLoVe *introduces an explicit loss function, ensuring the dot product of word vectors approximates the log of co-occurrence probabilities. This model helps prevent very common words from dominating and demonstrates efficient training scalable to large corpora.*
51:50 Intrinsic *evaluation of word vectors, such as word analogies, demonstrates the effectiveness of models. GLoVe's linear component property aids in solving analogies, and its performance benefits from diverse data sources, like Wikipedia.*
56:34 Another *intrinsic evaluation involves measuring how well models match human judgments of word similarity. GLoVe, trained on diverse data, outperforms plain SVD but shows similar performance to word2vec on word similarity tasks.*
58:00 The *objective function aims for the dot product to represent the log probability of co-occurrence, leading to the log-bilinear model with wy, wj, and bias terms.*
59:24 In *model building, a bias term is added for each word to account for general word probabilities, enhancing the representation.*
01:00:23 Multiplying *by the frequency of a word adjusts for common words, giving more importance to those with higher co-occurrence counts.*
01:02:40 Word *vectors can be applied to end-user tasks like named entity recognition, significantly improving performance by capturing word meanings.*
01:06:23 Exploring *word senses, having separate vectors for each meaning was experimented with, but the majority practice involves a single vector per word type.*
01:11:05 Word *vectors for a word type can be seen as a superposition of sense vectors, a weighted average where weighting corresponds to sense frequencies.*
Made with HARPA AI
how about the prompt?
I guess the second question section ends on 45:55 and you might want to add a timestamp there
I would added these ones:
45:55 Word vector evaluation
48:30 Intrinsic evaluation
57:42 Question
1:01:45 Extrinsic evaluation
1:03:25 Word sense & ambiguity
bless you@@Xufana
Thank you so much for this great course!
Hi Chrisoloni! Thanks for your comment, we're glad to hear you're enjoying the content - happy learning!
@@stanfordonline Can I get the lecture slides somewhere?
Thank you so much for these lectures Stanford!
when first two videos will be understood then I will be on the ladder number two
At 37:00,marry-> bride might be more appropriate than to priest.
Wonderful course!
Clarification: @11:29: The sparseness of affected/updated J(θ) elements depends only on the window size, not whether Simple Grad Descent or Stochastic Grad descent is used, right? Since within a window, the computation doesn't change across the two methods.
have u figured it out?
39:03 How are we doing less use of statistics as compared to LSA based algos here ? Co-occurrence also uses windows isn't it?
Great again!
Just found out he wrote the GLoVe Paper.
At 2:45 I think what you said about the word2vec model being a bag-of-words model is not strictly correct. Word2vec does gain some understanding of local word ordering. If I am incorrect, could you please explain?
If you look at the probability formula, it only contains dot products and doesn't have any specific position information.
I think... marry should be matched to bride and pray to priest on page 21.
Good point. It is not clear if the lecturer drew the vectors or if it's taken as is from the paper and the mismatch may indicate that the system is not perfect.
@@jeromeeusebius it looks like the lecturer drew the vectors as the endpoints are varying distances from the words
could be that the corpus where the embedding model where trained had more sentences with marry and priest in the same context
@@carlloseduardofl Can you explain how the log-bilinear model 'with vector differences' formula came out that? Which property of conditional probability was used? Any useful links? Timestamp 43:03
i agree with you.
22:36 word2vec ends
3:40 reasonably hahahahah
I don’t think the loss function for GloVe is well explained. I’ve spent over an hour trying to understand it, but I still don’t get it.
Can we get the lecture slides somewhere ?
this lecture is about neural classifiers
I feel this course giving me tough time doing Mathematics. Sad : _ _ (
have you understood deep learning standards yet
While using Stochastic Gradient descent, if we choose a corpus of 32 center words, how do we make updates to the outside (context) words that surround it. Because these words will show up when we compute the likelihood and if out corpus doesn't include them then how does their probability of occurring get updated?
Thanks!
i think you query is answered at 11:36 , so actually we are only calculating gradient for those 32 words hence we are getting a sparse gradient update.
what are neural classifiers
some parts are damn confusing
1:10:37
the lectures are not so good as per stanford its just recitation
38:00
any chinese student here?
Great material , bad explanation
i think you should read paper "Efficient Estimation of Word Representations in Vector Space", "Distributed Representations of Words and Phrases and their Compositionality" and "GloVe: Global Vectors for Word Representation" to better understand this lecture. I don't think he can cover all the concepts in detail in just 1 hour and 15 minutes.