Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers

Поділитися
Вставка
  • Опубліковано 10 лют 2025

КОМЕНТАРІ • 42

  • @AramR-m2w
    @AramR-m2w Рік тому +13

    🎯 Key Takeaways for quick navigation:
    00:29 Today's *lecture focuses on word vectors, touching on word senses, and introduces neural network classifiers, aiming to enhance understanding of word embeddings papers like word2vec or GLoVe.*
    01:52 The *word2vec model, using a simple algorithm, learns word vectors by predicting surrounding words based on dot products between word vectors, achieving word similarity in a high-dimensional space.*
    03:15 Word2vec *is a "bag of words" model, ignoring word order, but still captures significant properties of words. Probabilities are often low (e.g., 0.01), and word similarity is achieved by placing similar words close in a high-dimensional vector space.*
    06:31 Learning *good word vectors involves gradient descent, updating parameters based on the gradient of the loss function. Stochastic gradient descent is preferred due to its efficiency, especially in large corpora.*
    10:18 Stochastic *gradient descent in word2vec involves estimating gradients based on small batches of center words, enabling faster learning. The sparsity of gradient information is addressed, and word vectors are often represented as row vectors.*
    15:21 Word2vec *encompasses the skip-gram and continuous bag of words (CBOW) models. Negative sampling is introduced as a more efficient training method, using logistic regression to predict context words and reducing the computational load of softmax.*
    20:57 Negative *sampling involves creating noise pairs to train binary logistic regression models efficiently. The unigram distribution with a 3/4 power transformation is used to sample words, mitigating the difference between common and rare words.*
    23:40 Co-occurrence *matrices, an alternative to word2vec, represent word relationships based on word counts in context windows. The matrix can serve as a word vector representation, capturing word similarity and usage patterns.*
    28:23 When *working with negative words in word vectors, sampling 10-15 negative words provides more stable results than just one. This helps capture different parts of the space and improves learning.*
    30:46 Co-occurrence *matrices can be created using a window around the word (similar to word2vec) or by considering entire documents. However, these matrices are large and sparse, leading to noisier results. To address this, low-dimensional vectors (25-1,000 dimensions) are preferred.*
    32:42 Singular *Value Decomposition (SVD) is used to reduce the dimensionality of count co-occurrence vectors. By deleting some singular values, lower-dimensional representations of words are obtained, capturing important information efficiently.*
    35:54 Scaling *counts in the cells of the co-occurrence matrix addresses issues with extremely frequent words. Techniques like taking the log of counts or capping maximum counts can improve word vectors obtained through SVD.*
    37:52 The *GLoVe algorithm, developed in 2014, unifies linear algebra-based methods (like LSA and COALS) with neural models (like skip-gram and CBOW). GLoVe uses a log-bilinear model to approximate the log of co-occurrence probabilities, aiming for efficient training and meaningful word vectors.*
    43:29 GLoVe *introduces an explicit loss function, ensuring the dot product of word vectors approximates the log of co-occurrence probabilities. This model helps prevent very common words from dominating and demonstrates efficient training scalable to large corpora.*
    51:50 Intrinsic *evaluation of word vectors, such as word analogies, demonstrates the effectiveness of models. GLoVe's linear component property aids in solving analogies, and its performance benefits from diverse data sources, like Wikipedia.*
    56:34 Another *intrinsic evaluation involves measuring how well models match human judgments of word similarity. GLoVe, trained on diverse data, outperforms plain SVD but shows similar performance to word2vec on word similarity tasks.*
    58:00 The *objective function aims for the dot product to represent the log probability of co-occurrence, leading to the log-bilinear model with wy, wj, and bias terms.*
    59:24 In *model building, a bias term is added for each word to account for general word probabilities, enhancing the representation.*
    01:00:23 Multiplying *by the frequency of a word adjusts for common words, giving more importance to those with higher co-occurrence counts.*
    01:02:40 Word *vectors can be applied to end-user tasks like named entity recognition, significantly improving performance by capturing word meanings.*
    01:06:23 Exploring *word senses, having separate vectors for each meaning was experimented with, but the majority practice involves a single vector per word type.*
    01:11:05 Word *vectors for a word type can be seen as a superposition of sense vectors, a weighted average where weighting corresponds to sense frequencies.*
    Made with HARPA AI

  • @Xufana
    @Xufana 2 роки тому +19

    I guess the second question section ends on 45:55 and you might want to add a timestamp there

    • @Xufana
      @Xufana 2 роки тому +16

      I would added these ones:
      45:55 Word vector evaluation
      48:30 Intrinsic evaluation
      57:42 Question
      1:01:45 Extrinsic evaluation
      1:03:25 Word sense & ambiguity

    • @sumekenov
      @sumekenov Рік тому

      bless you@@Xufana

  • @Chrisoloni
    @Chrisoloni 2 роки тому +30

    Thank you so much for this great course!

    • @stanfordonline
      @stanfordonline  2 роки тому +12

      Hi Chrisoloni! Thanks for your comment, we're glad to hear you're enjoying the content - happy learning!

    • @tseringjorgais2811
      @tseringjorgais2811 2 роки тому +2

      @@stanfordonline Can I get the lecture slides somewhere?

  • @Salverse05
    @Salverse05 2 місяці тому

    Thank you so much for these lectures Stanford!

  • @goanshubansal8035
    @goanshubansal8035 Рік тому +2

    when first two videos will be understood then I will be on the ladder number two

  • @whatsupLoading
    @whatsupLoading 4 місяці тому +1

    At 37:00,marry-> bride might be more appropriate than to priest.

  • @jded1346
    @jded1346 8 місяців тому +1

    Wonderful course!
    Clarification: @11:29: The sparseness of affected/updated J(θ) elements depends only on the window size, not whether Simple Grad Descent or Stochastic Grad descent is used, right? Since within a window, the computation doesn't change across the two methods.

  • @yagneshbhadiyadra7938
    @yagneshbhadiyadra7938 2 місяці тому

    39:03 How are we doing less use of statistics as compared to LSA based algos here ? Co-occurrence also uses windows isn't it?

  • @nanunsaram
    @nanunsaram Рік тому +2

    Great again!

  • @darkmember727
    @darkmember727 8 місяців тому +1

    Just found out he wrote the GLoVe Paper.

  • @ryancodrai487
    @ryancodrai487 Рік тому

    At 2:45 I think what you said about the word2vec model being a bag-of-words model is not strictly correct. Word2vec does gain some understanding of local word ordering. If I am incorrect, could you please explain?

    • @mrfli24
      @mrfli24 Рік тому +1

      If you look at the probability formula, it only contains dot products and doesn't have any specific position information.

  • @jongsong5370
    @jongsong5370 2 роки тому +13

    I think... marry should be matched to bride and pray to priest on page 21.

    • @jeromeeusebius
      @jeromeeusebius 2 роки тому +2

      Good point. It is not clear if the lecturer drew the vectors or if it's taken as is from the paper and the mismatch may indicate that the system is not perfect.

    • @jakanader
      @jakanader 2 роки тому

      @@jeromeeusebius it looks like the lecturer drew the vectors as the endpoints are varying distances from the words

    • @carlloseduardofl
      @carlloseduardofl Рік тому

      could be that the corpus where the embedding model where trained had more sentences with marry and priest in the same context

    • @kiran.pradeep
      @kiran.pradeep Рік тому

      ​@@carlloseduardofl Can you explain how the log-bilinear model 'with vector differences' formula came out that? Which property of conditional probability was used? Any useful links? Timestamp 43:03

    • @vohiepthanh9692
      @vohiepthanh9692 Рік тому

      i agree with you.

  • @ronitmndl
    @ronitmndl Рік тому +3

    22:36 word2vec ends

  • @yukisuki5380
    @yukisuki5380 Рік тому +2

    3:40 reasonably hahahahah

  • @muyuanliu3175
    @muyuanliu3175 Місяць тому

    I don’t think the loss function for GloVe is well explained. I’ve spent over an hour trying to understand it, but I still don’t get it.

  • @AdityaAVG
    @AdityaAVG 8 місяців тому +1

    Can we get the lecture slides somewhere ?

  • @goanshubansal8035
    @goanshubansal8035 Рік тому

    this lecture is about neural classifiers

  • @raghavkansal9701
    @raghavkansal9701 4 місяці тому

    I feel this course giving me tough time doing Mathematics. Sad : _ _ (

  • @goanshubansal8035
    @goanshubansal8035 Рік тому

    have you understood deep learning standards yet

  • @RomilVikramSonigra
    @RomilVikramSonigra Рік тому +1

    While using Stochastic Gradient descent, if we choose a corpus of 32 center words, how do we make updates to the outside (context) words that surround it. Because these words will show up when we compute the likelihood and if out corpus doesn't include them then how does their probability of occurring get updated?
    Thanks!

    • @AshishBangwal
      @AshishBangwal Рік тому

      i think you query is answered at 11:36 , so actually we are only calculating gradient for those 32 words hence we are getting a sparse gradient update.

  • @goanshubansal8035
    @goanshubansal8035 Рік тому

    what are neural classifiers

  • @shawnyang2851
    @shawnyang2851 5 місяців тому +1

    some parts are damn confusing

  • @annawilson3824
    @annawilson3824 Рік тому

    1:10:37

  • @amitabhachakraborty497
    @amitabhachakraborty497 Рік тому +7

    the lectures are not so good as per stanford its just recitation

  • @annawilson3824
    @annawilson3824 Рік тому

    38:00

  • @葛浩宇
    @葛浩宇 Рік тому +4

    any chinese student here?

  • @happylife4775
    @happylife4775 Рік тому +6

    Great material , bad explanation

    • @vohiepthanh9692
      @vohiepthanh9692 Рік тому +12

      i think you should read paper "Efficient Estimation of Word Representations in Vector Space", "Distributed Representations of Words and Phrases and their Compositionality" and "GloVe: Global Vectors for Word Representation" to better understand this lecture. I don't think he can cover all the concepts in detail in just 1 hour and 15 minutes.