Transformers, Simply Explained | Deep Learning

Поділитися
Вставка
  • Опубліковано 26 лис 2024

КОМЕНТАРІ • 17

  • @JuergenAschenbrenner
    @JuergenAschenbrenner 4 місяці тому

    exzellent Stuff, who needs netflix with this ;-)

  • @TJVideoChannelUTube
    @TJVideoChannelUTube Рік тому

    When word2vec is used in Transformer encoder as in ChatGPT, does word2vec work with a tokenizer? In other words, is word2vec exactly the Embedder as shown at 9:15, apart from the Tokenizer?

    • @deepbean
      @deepbean  Рік тому

      Hi TJ;
      Transformer models use a scheme where each word is tokenized and then each token is converted to an embedding vector, which is learnt during training. This is similar though different to Word2Vec, which uses a so-called "static" embedding, which is pre-trained and not learnt during model training. So in summary, the scheme is similar to Word2Vec (tokenization + embedding), but Word2Vec itself is not used.
      Hope this answers your question!

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@deepbean So word2vec has not been used in ChatGPT? I read somewhere that word2vec was used in ChatGPT early and GPT-3 used singular value decomposition (SVD).
      If word2vec is used in Transformer, word2vec will cover both Tokenizer + Embedder as shown at 9:15 ? Correct?

    • @deepbean
      @deepbean  Рік тому

      @TJ yeah, word2vec is not used in ChatGPT; the embeddings are learnt, rather than using a pre-trained embedded like word2vec.

  • @TJVideoChannelUTube
    @TJVideoChannelUTube Рік тому

    Where are the deep learning layers where activation functions and backpropagation are involved? In other words, the training parts?

    • @deepbean
      @deepbean  Рік тому

      In general, all the fully-connected layers that form Q, K and V from X are the trained parts, as well as the embedding vectors. I hope all my answers are clear! Please let me know if not.

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@deepbean So all the deep learning (activation functions and backpropagation) parts in Transformer are in Q, K and V in between each layers?
      How about Feed-Forward layers? They are fully connected MLP and they are supposed to be part of training, correct?
      I am not aware that word embedding part in Transformer are also trained. I though some pre-trained word embedding mechanism, like word2vec, is used in stead. In other words, word embedding is not trained in Transformer.

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@deepbean In Transformer model, only these layer types are involved in the deep learning/containing trainable parameters, and (3) with activation functions/back propagation:
      (1). Word Embedding Layer;
      (2). Weighted matrices for K, V, Q;
      (3). Feed Forward Layer or Fully Connected Layer.
      Correct?

  • @TJVideoChannelUTube
    @TJVideoChannelUTube Рік тому

    Still not very clear how to get Q, K, V 18:55. In decoder, cross attention layer, 23:46, K, V are from encoder? Q from the masked multi-head attention? How they are formed?

    • @deepbean
      @deepbean  Рік тому

      So, Q, K and V are formed using linear layers from the output of previous layers (denoted as X at 18:55). So, in other words, they're formed using a fully-connected layer whose weights are adapted during training. The information flows I've showed are abstract (and I apologize if they're not clear!) but all these are formed via fully-connected layers in practice.

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@deepbean In cross attention layer, 23:46, K, V are from encoder? Q from the masked multi-head attention? Why has to be K, V from encoder, not thinking about K, Q or even V, Q ?

    • @deepbean
      @deepbean  Рік тому

      @@TJVideoChannelUTube Hi,
      This is because the encoder output provides the "context" (K and V) under which the query Q is to be processed, and Q comes from the decoder since this is part of the decoder processing.

  • @TJVideoChannelUTube
    @TJVideoChannelUTube Рік тому

    When word2vec is used in Transformer decoder as in ChatGPT, how word2vec translate score matrix S 25:20 into words for final output?

    • @deepbean
      @deepbean  Рік тому

      In this stage, words (or more accurately, tokens) are chosen based on the maximum score in each row of the score matrix S. So each row represents each token slot in the sequence, and each column represents the score for each token in that slot. The model then selects output words based on whichever word has the maximum score for each slot.

    • @TJVideoChannelUTube
      @TJVideoChannelUTube Рік тому

      @@deepbean So the final output words are not generated by some pre-trained word embedding system, like word2vec ? I thought there are some reverse mapping mechanism by pre-trained word embedding system, like word2vec, for generating the final output. How to map maximum score token to the word it represents?

    • @deepbean
      @deepbean  Рік тому

      @TJ yes, that's correct, it learns the vectorization rather than using a pre-trained vectorizer like word2vec (though I suppose it can theoretically be done this way). At the end, each token is mapped back into the word using the initial word-token mapping created during the embedding stage.