Transformer Encoder in 100 lines of code!

Поділитися
Вставка
  • Опубліковано 11 чер 2024
  • ABOUT ME
    ⭕ Subscribe: ua-cam.com/users/CodeEmporiu...
    📚 Medium Blog: / dataemporium
    💻 Github: github.com/ajhalthor
    👔 LinkedIn: / ajay-halthor-477974bb
    RESOURCES
    [ 1 🔎] Code for Video: github.com/ajhalthor/Transfor...
    PLAYLISTS FROM MY CHANNEL
    ⭕ Transformers from scratch playlist: • Self Attention in Tran...
    ⭕ ChatGPT Playlist of all other videos: • ChatGPT
    ⭕ Transformer Neural Networks: • Natural Language Proce...
    ⭕ Convolutional Neural Networks: • Convolution Neural Net...
    ⭕ The Math You Should Know : • The Math You Should Know
    ⭕ Probability Theory for Machine Learning: • Probability Theory for...
    ⭕ Coding Machine Learning: • Code Machine Learning
    MATH COURSES (7 day free trial)
    📕 Mathematics for Machine Learning: imp.i384100.net/MathML
    📕 Calculus: imp.i384100.net/Calculus
    📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
    📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
    📕 Linear Algebra: imp.i384100.net/LinearAlgebra
    📕 Probability: imp.i384100.net/Probability
    OTHER RELATED COURSES (7 day free trial)
    📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
    📕 Python for Everybody: imp.i384100.net/python
    📕 MLOps Course: imp.i384100.net/MLOps
    📕 Natural Language Processing (NLP): imp.i384100.net/NLP
    📕 Machine Learning in Production: imp.i384100.net/MLProduction
    📕 Data Science Specialization: imp.i384100.net/DataScience
    📕 Tensorflow: imp.i384100.net/Tensorflow
    TIMESTAMP
    0:00 What we will cover
    0:53 Introducing Colab
    1:24 Word Embeddings and d_model
    3:00 What are Attention heads?
    3:59 What is Dropout?
    4:59 Why batch data?
    7:46 How to sentences into the transformer?
    9:03 Why feed forward layers in transformer?
    9:44 Why Repeating Encoder layers?
    11:00 The “Encoder” Class, nn.Module, nn.Sequential
    14:38 The “EncoderLayer” Class
    17:45 What is Attention: Query, Key, Value vectors
    20:03 What is Attention: Matrix Transpose in PyTorch
    21:17 What is Attention: Scaling
    23:09 What is Attention: Masking
    24:53 What is Attention: Softmax
    25:42 What is Attention: Value Tensors
    26:22 CRUX OF VIDEO: “MultiHeadAttention” Class
    36:27 Returning the flow back to “EncoderLayer” Class
    37:12 Layer Normalization
    43:17 Returning the flow back to “EncoderLayer” Class
    43:44 Feed Forward Layers
    44:24 Why Activation Functions?
    46:03 Finish the Flow of Encoder
    48:03 Conclusion & Decoder for next video

КОМЕНТАРІ • 67

  • @CodeEmporium
    @CodeEmporium  Рік тому +23

    If you think I deserve it, please consider hitting the like button and subscribe for more content like this :)

  • @surajgorai618
    @surajgorai618 Рік тому +3

    This is the best explanation I have gone through

  • @user-yk2bh8ns5y
    @user-yk2bh8ns5y Рік тому +8

    This is the most detailed Transformer video, THANK YOU!
    I have one question, the values is [30, 8, 200, 64], before we reshape it, shouldn't we permute it first? like:
    values = values.permute(0, 2, 1, 3).reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)

  • @jingcheng2602
    @jingcheng2602 3 місяці тому +1

    Superb and so love these classes! Will watch all of them one by one

  • @sushantmehta7789
    @sushantmehta7789 Рік тому +4

    Next level video *especially* because of the dimensions laid out and giving intuition for things like k.transpose(-1, -2). Likely the best resource out right now!! Thanks for all your work!

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Super glad you find this all useful!

  • @user-ul2mw6fu2e
    @user-ul2mw6fu2e 5 місяців тому +1

    You are awesome .The way you teach is incredible.

    • @CodeEmporium
      @CodeEmporium  5 місяців тому +1

      Thanks so much for this compliment. Super glad you enjoyed this

  • @gigabytechanz9646
    @gigabytechanz9646 Рік тому +1

    Very clear, useful and helpful explanation! Thank you!

  • @xingfenyizhen
    @xingfenyizhen 10 місяців тому +3

    Really friendly for the beginners!😁

    • @CodeEmporium
      @CodeEmporium  10 місяців тому +1

      Thanks a lot! Glad you found it useful

  • @seyedmatintavakoliafshari8272
    @seyedmatintavakoliafshari8272 3 місяці тому +1

    This video was really informative. Thank you for all the detailed explanations!

  • @danielbrooks6246
    @danielbrooks6246 Рік тому +1

    I watched the entire series and it gave me a deeper understanding on how all of this works. Very well done!!!! Takes a real master to take a complex topic and break it down in such a consumable way. I do have one question: What is the point of the permute? Can we not specify the shape we want in the reshape call?

  • @pierrelebreton7634
    @pierrelebreton7634 Рік тому +1

    Thank you, I going through all your videos. great work!

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому +2

    It's really helpful that you are going through all the sizes of the various vectors and matrices.

  • @moseslee8761
    @moseslee8761 9 місяців тому +1

    bro... i love how u dive deep into explanations. You're a very good teacher holy shit

  • @Zero-ss6pn
    @Zero-ss6pn 3 місяці тому +1

    Just amazing!!!

  • @DeanLa
    @DeanLa Рік тому +1

    This is the best content on youtube

  • @FAHMIAYARI
    @FAHMIAYARI Рік тому +1

    bro you're a legend!

  • @KurtGr
    @KurtGr Рік тому +1

    Appreciate your work! As someone else mentioned, hope you can do an implementation of training the network for a few iterations.

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Yea. That’s the plan. I am currently working on setting the full thing up.

  • @salemibrahim2933
    @salemibrahim2933 Рік тому +1

    @CodeEmporium
    The transformer series is awesome!
    It is very informative.
    I have one comment, It is usually recommended to perform dropout before normalization layers. This is because normalization layers may undo dropout effects by re-scaling the input. By performing dropout before normalization, we ensure that the inputs to the normalization layer are still diverse and have different scales.

  • @chenmargalit7375
    @chenmargalit7375 11 місяців тому

    Thanks for the great series. Would be very helpful if you'd attach the Colab.

  • @nallarajeshkumar9036
    @nallarajeshkumar9036 9 місяців тому

    Wonderful explanation

  • @TransalpDave
    @TransalpDave Рік тому +1

    Awesome content as always ! Are you planning to demonstrate a training example of training for the encoder for the next video ? For example on a wikipedia data sample or something like that ?

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Hoping to get to that stage. I currently have the code ready but it’s a lil strange during inference. for more context : I am running into a situation where it’s predicting the End of Sentence token only. Planning to fix this soon and have a full overview of the transformer soon. But in the mean time there are so many more videos I can make on the decoder

    • @TransalpDave
      @TransalpDave Рік тому

      @@CodeEmporium Oh ok i see, i'm also close to that step, i'll let you know if i find something

  • @li-pingho1441
    @li-pingho1441 Рік тому

    awesome content! thanks a lot!!

  • @user-ut2xu8eb7c
    @user-ut2xu8eb7c 8 місяців тому

    thank you!

  • @ramanshariati5738
    @ramanshariati5738 11 місяців тому

    you are awesome bro

  • @qingjieqi3379
    @qingjieqi3379 Рік тому

    Amazing video series! At 39:07, why does the layer normalization just consider 1 dimension, the length of parameter shape, but not consider the batch size? Your previous video about the layer normalization mentioned layer normalization should consider both. Am I missing something?

  • @RanDuan-dp6oz
    @RanDuan-dp6oz Рік тому +1

    Thanks!

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Thanks for the donation and for watching!

  • @chrisillas3010
    @chrisillas3010 Рік тому

    Great video!!!!best content for transformer... Ca n you suggest ways to implement transformer encoder for a time series data

  • @user-im5qb7ix6e
    @user-im5qb7ix6e Рік тому

    thank u a lot

  • @cmacompilation4649
    @cmacompilation4649 Рік тому

    Please, blow up the decoder as well hahaa !!
    Thank Ajay, these videos were very helpful for me.

  • @-mwolf
    @-mwolf Рік тому

    Thanks!
    Please do Cross Attention and maybe Attention visualizations next!

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Yep! I plan to do some more videos on the decoder part too

  • @prashantlawhatre7007
    @prashantlawhatre7007 Рік тому +2

    Hi Ajay. I think, we need to make a small change in the forward() function of the encoder class. We should be doing `x_residual = x.clone() # or x_residual = x[:]` instead of `x_residual =x`. This will ensure that x_residual contains a copy of the original x and is not affected by any changes made to x.

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Oh interesting. I have been running into issues during training. I’ll make this change and check. Thanks a ton for surfacing!

  • @dwarakanathchandra7611
    @dwarakanathchandra7611 9 місяців тому

    Hats Off to you for explaining such a complex topic with simplicity and understanding. Thanks a lot. Is there any course you're offering besides these awesome videos on youtube? Want to learn more concept from you.

    • @CodeEmporium
      @CodeEmporium  9 місяців тому

      Thanks so much for the compliments. At the moment , my best teaching resources are on UA-cam. Luckily, there are hundreds of videos on the channel haha

    • @dwarakanathchandra7611
      @dwarakanathchandra7611 9 місяців тому

      @@CodeEmporium Thanks for the info, sir. I am a student of AI and ML interested very much in NLP. If you have any suggestions for research projects that I can pursue for my academic research. Kindly suggest. I am reading the papers one by one. If you have any interesting ideas, it would help me a lot.

  • @creativeuser9086
    @creativeuser9086 Рік тому +1

    I know it’s a lazy question, but can someone tell me why is multi-head better than single head for performing attention?

  • @GIChow
    @GIChow Рік тому +1

    I am looking forward to see whether you will try to put all the bits of the transformer together i.e. the positional encoder before this "encoder" and then the decoder after. I wonder whether/how it will respond to the input text "My name is Ajay". Would it respond as though in a conversation "Hi, how are you" / "My name is Bot" or generate more text in the same vein e.g. "I am 28 years old" or translate it to another language or something else. To achieve an end-to-end use case I guess we will also need appropriate data to be able to train the models and then actually train the models, save the model weights somehow, etc. Am new to all this but your videos are gradually helping me understand more e.g. encoder input and output matrix being of the same size to permit stacking. Thanks 👍

    • @CodeEmporium
      @CodeEmporium  Рік тому +2

      This is the goal. I am constructing this transformer bit by bit and showing my findings. We will eventually have the full thing

  • @eekinchan6620
    @eekinchan6620 8 місяців тому

    Hi. Great video but i have a question. Referring to 19:31, why is the dimension of k found by using the code q.size()[-1], shouldn't it be k.size()[-1] instead. Thnx in advance:)

  • @convolutionalnn2582
    @convolutionalnn2582 Рік тому +2

    What would be the best book to learn probability and statistics for Machine Learning?

    • @linkinlinkinlinkin654
      @linkinlinkinlinkin654 Рік тому

      Before any book just take a 500 level course on probability and linear algebra each from any universities free online classes. These two topics are not truly understood with even the best explanations, just by solving problems

  • @hermannangstl1904
    @hermannangstl1904 Рік тому

    I understand how the forward way works, but not how the learning works. Basically all videos I have seen so far covering Transformers "only" explain the way forward, but not the training. For example I'd like to know what the loss function is.
    Question 2: afaik an Encoder can work on its own and doesn't (necessarily) need a Decoder (for example for non-translation use cases). How does the training work in this case? What is the loss function here? (-> we don't have a target sentence)

    • @CodeEmporium
      @CodeEmporium  Рік тому

      If you go further into the playlist (I just uploaded the code for this my my most recent video in the playlist), it is a cross entropy loss. We compare every character generated to the label; take the average loss; and perform backpropogation to update all weights in the network once after seeing all sentences in the batch
      For your Question 2, I am not exactly sure what you are alluding to. Yes, you can just use the encoder but depending on the task you want to solve, you’ll need to define an appropriate loss. For example, BERT architectures are encoder only architectures that may append additional feed forward networks to solve a specific task. These architectures will also learn via back propagation once we are able to quantify a loss.

    • @hermannangstl1904
      @hermannangstl1904 Рік тому

      @@CodeEmporium Thank you for your reply. For Q2: My plan is to deal/code/understand the Encoder and the Decoder part separately, starting with the Encoder. Especially how this Attention vectors develop over time. How they actually look for a small example, trained with a couple of sentences. Visualize them. See how, for example, "dog" is closer to "cat" than to, for example, "screwdriver".
      But I don't know what the loss function would be to train this model. Could I maybe feed the network with parts of a sentence so that it can learn how to predict the next word?
      E.G. Full sentence could be: "my dog likes to chase the cat of my neighbor".
      X: "my" Y: "dog"
      X: "my dog" Y: "likes"
      X: "my dog likes" Y: "to"
      X: "my dog likes to" Y: "chase"
      ... and so on ...
      Would this kind of training be sufficient for the network to calculate the Attention vectors?

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Рік тому

    where did you get the 3 for 3 times 512 =1536? Is it 3 because you have query, key, and value?

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      For every token (word or character), we have 3 vectors: query, key and value. Each token is represented by a 512 dimensional vector. This is encoded into the query key and value vectors that are also 512 dimensions each. Hence 3 * 512

  • @amiralioghli8622
    @amiralioghli8622 9 місяців тому

    Overall your explaination is great, But I little confiused. Actually i could not understand the difference between positinal encoding and Position-wise Feed Forward Network. Can anyone explain to me?

  • @froozynoobfan
    @froozynoobfan Рік тому

    your code is pretty clean, except i more like "black" code formatting

  • @godly_wisdom777
    @godly_wisdom777 Рік тому

    a video about how to code chatgpt in which the code is generated by chatgpt 😁

  • @-mwolf
    @-mwolf Рік тому

    I think you forgot to address in you MHA code to pass the mask value.. I think here you need ModuleList and can't use nn.Sequential

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      I definitely need this for the decoder and I get around this by implementing my custom “Sequential” class. I was able to run this code tho just fine as is (sorry if I missed exactly what you are alluding to)

    • @-mwolf
      @-mwolf Рік тому

      @@CodeEmporium Ah of course - I missed that we don't need it for the encoder (and that you could implement custom nn.Sequential as opposed to a ModuleList of the Layers. Although I'm not sure which of the approaches would be nicer).

  • @vigneshvicky6720
    @vigneshvicky6720 9 місяців тому

    Yolov8

  • @user-kz2es8sg3f
    @user-kz2es8sg3f 9 місяців тому

    Did he just mimic what Andrej Kaparthy was doing. Explanation not even 10% as clear as what Andrej did. So bad.