Coding a Transformer from scratch on PyTorch, with full explanation, training and inference.

Поділитися
Вставка
  • Опубліковано 7 лис 2024

КОМЕНТАРІ • 358

  • @comedyman4896
    @comedyman4896 Рік тому +133

    personally, I find that seeing someone actually code something from scratch is the best way to get a basic understanding

    • @zhilinwang6303
      @zhilinwang6303 9 місяців тому

      indeed

    • @馬桂群
      @馬桂群 9 місяців тому

      indeed

    • @CM-mo7mv
      @CM-mo7mv 8 місяців тому

      i don't need to see someone typing... but you might also enjoy watching the gras grow or paint dry

    • @FireFly969
      @FireFly969 6 місяців тому +2

      Yeah, and you see how these technologies works.
      It's insane, that in the end it looks easy that you can do something that companies of millions and billions of dollars do. In a small way but the same idea at the end.

    • @AtomicPixels
      @AtomicPixels 6 місяців тому

      Yeah kinda ironic how that works. The simplest stuff required the most complex explanations

  • @umarjamilai
    @umarjamilai  Рік тому +155

    The full code is available on GitHub: github.com/hkproj/pytorch-transformer
    It also includes a Colab Notebook so you can train the model directly on Colab.
    Of course nobody reinvents the wheel, so I have watched many resources about the transformer to learn how to code it. All of the code is written by me from zero except for the code to visualize the attention, which I have taken from the Harvard NLP group article about the Transformer.
    I highly recommend all of you to do the same: watch my video and try to code your own version of the Transformer... that's the best way to learn it.
    Another suggestion I can give is to download my git repo, run it on your computer while debugging the training and inference line by line, while trying to guess the tensor size at each step. This will make sure you understand all the operations. Plus, if some operation was not clear to you, you can just watch the variables in real time to understand the shapes involved.
    Have a wonderful day!

    • @AiEdgar
      @AiEdgar Рік тому +1

      The best video ever

    • @odyssey0167
      @odyssey0167 Рік тому

      Can you provide with the pretrained models?

    • @wilfredomartel7781
      @wilfredomartel7781 7 місяців тому

      🎉is this Bert architecture?

    • @sachinmohanty4577
      @sachinmohanty4577 2 місяці тому

      @@wilfredomartel7781 Its complete encoder- decoder based model, bert is the encoder part of this encoder-decoder model

  • @ArslanmZahid
    @ArslanmZahid 11 місяців тому +28

    I have browsed UA-cam for the perfect set of videos on transformer, but your set of videos (the video explanation you did on the transformer architecture) and this one is by far the best !! Take a bow brother, you have really contributed to the viewers in amount you cant even imagine. Really appreciate this !!!

  • @linyang9536
    @linyang9536 10 місяців тому +9

    这是我见过最详细的从零创建Transformer模型的视频,从代码实现到数据处理,再到可视化,up主真是嚼碎磨细了讲,感谢!

    • @decarteao
      @decarteao 9 місяців тому

      Nn entendi nada! Mas botei meu like.

    • @astrolillo
      @astrolillo 7 місяців тому +1

      @@decarteaoO cara da China e muito engracado con o video

  • @physicswithbilalasmatullah
    @physicswithbilalasmatullah 7 місяців тому +33

    Hi Umar. I am a first year student at MIT who wants to do AI startups. Your explanation and comments during coding were really helpful. After spending about 10 hours on the video, I walk away with great learnings and great inspiration. Thank you so much, you are an amazing teacher!

    • @umarjamilai
      @umarjamilai  7 місяців тому +2

      Best of luck with your studies and thank you for your support!

    • @shauryaomar5090
      @shauryaomar5090 Місяць тому

      I am a 3rd semester student at IIT Roorkee. I am also interested in AI startups.

  • @yangrichard7874
    @yangrichard7874 11 місяців тому +41

    Greeting from China! I am PhD student focused on AI study. Your video really helped me a lot. Thank you so much and hope you enjoy your life in China.

    • @umarjamilai
      @umarjamilai  11 місяців тому +2

      谢谢你!我们在领英联系吧

    • @germangonzalez3063
      @germangonzalez3063 2 місяці тому

      I am also a Ph.D. student. This video is valuable. Many thanks!

  • @kozer1986
    @kozer1986 Рік тому +7

    I'm not sure if it is because I have study this content 1000000 times or not, but is the first time that I understood the code, and feel confident about it. Thanks!

  • @MuhammadArshad
    @MuhammadArshad Рік тому +14

    Thank God, it's not one of those 'ML in 5 lines of Python code' or 'learn AI in 5 minutes'. Thank you. I can not imagine how much time you must have spent on making this tutorial. thank you so much. I have watched it three times already and wrote the code while watching the second time (with a lot of typos :D).

  • @mittcooper
    @mittcooper Місяць тому +2

    Hi Umar. Absolutely amazing 🤯. Your clear breakdown and explanation of the concepts and code is just next level. Until I watched your video I had a very tentative handle on transformers. After watching I have a much better fundamental grasp of EVERY component. I can't say thank you enough. Please keep doing what you are doing.

  • @faiyazahmad2869
    @faiyazahmad2869 4 місяці тому +5

    One of the best tutorial to understand and implement the Transformer model...Thank you for making such a wonderful video

  • @abdullahahsan3859
    @abdullahahsan3859 Рік тому +27

    Keep doing what you are doing. I really appreciate you taking out so much time to spread such knowledge for free. Been studying transformers for a long time but never have I understood it so well. The theoretical explanation in the other video combined with this practical implementation, just splendid. Will be going through your other tutorials as well. I know how much time taking it is to produce such high level content and all I can really say is that I really am grateful for what you are doing and hope that you continue doing it. Wish you a great day!

    • @umarjamilai
      @umarjamilai  Рік тому +3

      Thank you for your kind words. I wish you a wonderful day and success for your journey in deep learning!

  • @abdulkarimasif6457
    @abdulkarimasif6457 Рік тому +6

    Dear Umar, your video is full of knowledge; thanks for sharing.

  • @SaltyYagi
    @SaltyYagi 2 місяці тому +2

    I really appreciate your efforts. The explanations are very clear. This is a great service for people that wish to learn the future of AI! All the best from Spain!

  • @JohnSmith-he5xg
    @JohnSmith-he5xg Рік тому +7

    Loving this video (only 13 minutes in), really like you using type hints, commenting, descriptive variable names, etc. Way better coding practices than most of the ML code I've looked at.
    At 13:00, for the 2nd arg of the array indexing, you could just do ":" and it would be identical.

    • @tonyt1343
      @tonyt1343 10 місяців тому +2

      Thank you for this comment! I'm coding along with this video and I wasn't sure if my understanding was correct. I'm glad someone else was thinking the same thing. Just to be clear, I am VERY THANKFUL for this video and am in no way complaining. I just wanted to make sure I understand because I want to fully internalize this information.

  • @jerrysmith3593
    @jerrysmith3593 10 днів тому +1

    老哥你救了我啊, 我是中科大的一名研究生,看你的视频,不仅学习了深度学习,还练习了我的英语听力 😁

    • @umarjamilai
      @umarjamilai  9 днів тому +1

      不客气! 我最近会发新视频, stay tuned!

  • @aiden3085
    @aiden3085 11 місяців тому +4

    Thank you Umar for our extraordinary excellent work! Best transformer tutorial ever I have seen!

  • @manishsharma2211
    @manishsharma2211 11 місяців тому +2

    WOW WOW WOW, though it was a bit tough for me to understand it, I was able to understand around 80 % of the code, beautiful. Thank you soo much

  • @zhengwang1402
    @zhengwang1402 11 місяців тому +2

    This feels really fantastic when looking someone write a program from bottom up

  • @mikehoops
    @mikehoops Рік тому +2

    Just to repeat what everyone else is saying here - many thanks for an amazing explanation! Looking forward to more of your videos.

  • @raviparihar3298
    @raviparihar3298 5 місяців тому +2

    best video I have ever seen on whole youtube eon transformer model. Thank you so much sir!

  • @balajip5030
    @balajip5030 Рік тому +2

    Thanks Bro. With your explanation, I am able to build the transformer model for my application. You explained so awesome. Please do what you are doing.

  • @saziedhassan3976
    @saziedhassan3976 Рік тому +2

    Thank you so much for taking the time to code and explain the transformer model in such detail. You are amazing and please do a series on how transformers can be used for time series anomaly detection and forecasting!

  • @dengbuqi
    @dengbuqi 6 місяців тому +1

    What a WONDERFUL example of transformer! I am Chinese and I am doing my PhD program in Korea. My research is also about AI. This video helps me a lot. Thank you!
    BTW, your Chinese is very good!😁😁

  • @mohamednabil374
    @mohamednabil374 Рік тому +5

    Thanks Umar for this comprehensive tutorial, after watching many videos I would say, this is AWESOME! It would be really nice if you can provide us with more tutorials on Transformers especially training them for longer sequences. :)

    • @umarjamilai
      @umarjamilai  Рік тому +1

      Hi mohamednabil374, stay tuned for my next video on the LongNet, a new transformer architecture that can scale up to 1 billion tokens.

  • @ghabcdef
    @ghabcdef 9 місяців тому +3

    Thanks a ton for making this video and all your other videos. Incredibly useful.

    • @umarjamilai
      @umarjamilai  9 місяців тому

      Thanks for your support!

  • @terryliu3635
    @terryliu3635 5 місяців тому +1

    I learnt a lot from following the steps out of this video and create a transformer myself step by step!! Thank you!!

  • @shresthsomya7419
    @shresthsomya7419 9 місяців тому +2

    Thanks a lot for such a detailed video. Your videos on transformer are best.

  • @maxmustermann1066
    @maxmustermann1066 Рік тому +4

    This video is incredible, never understood it like this before. I will watch your next videos for sure, thank you so much!

  • @SaiManojPrakhya-mp4oe
    @SaiManojPrakhya-mp4oe 3 місяці тому

    Dear Umar - thank you so much for this amazing and very clear explanation. It has deeply helped me and many others in understanding the theoretical and practical implementation of transformers! Take a bow!

  • @123Handbuch
    @123Handbuch 16 днів тому +1

    ATTENTION: You don't need torchtext anymore, it's deprecated. Just remove the line "import torchtext.datasets as datasets" and install "pip install datasets"

  • @si0n4ra
    @si0n4ra Рік тому +1

    Umar, thank you for the amazing example and clear explanation of all your steps and actions.

    • @umarjamilai
      @umarjamilai  Рік тому

      Thank you for watching my video and your kind words! Subscribe for more videos coming soon!

    • @si0n4ra
      @si0n4ra Рік тому

      @@umarjamilai , mission completed 😎.
      Already subscribed.
      All the best, Umar

  • @phanindraparashar8930
    @phanindraparashar8930 Рік тому +10

    It is really amazing video. I tried understanding the code of it from various other youtube channel; but was always getting confused. Thanks a lot :) . Can you make a series on BERT & GPT aswell; where you build these models and train on custom data?

    • @umarjamilai
      @umarjamilai  Рік тому +21

      Hi Phanindra! I'll definetely continue making more videos. It takes a lot of time and patience to make just one video, not considering the preparation time to study the model, write the code and test it. Please share the channel and subscribe, that's the biggest motivation to continue providing high quality content to you all.

    • @rubelahmed5458
      @rubelahmed5458 11 місяців тому

      A coding example for BERT would be great!@@umarjamilai

  • @VishnuVardhan-sx6bq
    @VishnuVardhan-sx6bq 10 місяців тому +1

    This is such a great work, I don't really know how to thank you but this is an amazing explanation of an advanced topic such as transformer.

  • @shakewingo3216
    @shakewingo3216 Рік тому +2

    Thanks for making it so easy to understand. I definitely learn a lot and gain much more confidence from this!

  • @sagarpadhiyar3666
    @sagarpadhiyar3666 6 місяців тому +1

    Best video I came across for transformer from scratch.

  • @codevacaphe3763
    @codevacaphe3763 5 місяців тому +1

    Hi, I just happen to see your video. It's really amazing, your channel is so good with valuable information. Hope, you keep this up because I really love your contents.

  • @Patrick-wn6uj
    @Patrick-wn6uj 7 місяців тому +1

    Hi Umar thank you for all the work you are doing, please consider making a video like this on vision transformers

  • @Hdjandbkwk
    @Hdjandbkwk Рік тому +2

    Just want to say thank you!! This is easily one of my favorite video on UA-cam! I have watched a few videos on transformers but none explained it as clear as you, at first I was scared by the length of the video but you managed to have my attention for the full 3 hours! Following your instructions I am now able to train my very first transformer!
    Btw, I am using the tokenizer the way you are but looking at the tokenizer file it looks like my tokenizer didn’t split the sentences into words and it is using the whole sentence as token. Do you have any idea why? I am using mac if that matters.

    • @umarjamilai
      @umarjamilai  Рік тому +1

      Hi! Thanks for your kind words! Make sure your PreTokenizer is the "Whitespace" one and that the Tokenizer is the "WordLevel" tokenizer. As a last resort, you can clone the repository from my GitHub and compare my code with yours. Have a wonderful rest of the day!

    • @Hdjandbkwk
      @Hdjandbkwk Рік тому

      I have PreTokenizer set as whitespace and using WordLevel tokenizer and trainer but it will still encode the sentence as a whole. I did a direct swap to use BPE tokenizer and that is correctly encoding the sentences, maybe there is bug in WordLevel tokenizer for macOS.
      Another question that I have is what determines the max context size for LLMs? Is it the d_model size?@@umarjamilai

  • @angelinakoval8360
    @angelinakoval8360 11 місяців тому +1

    Dear Umar, thank you so so much for the video! I don't have much experience in deep learning, but your explanations are so clear and detailed I understood almost everything 😄. It wil be a great help for me at my work. Wish you all the best! ❤

    • @umarjamilai
      @umarjamilai  11 місяців тому

      Thank you for your kind words, @angelinakoval8360!

  • @CathyLiu-d4k
    @CathyLiu-d4k Рік тому +1

    Really great explanation to understand Transformer, many thanks to you.

  • @ansonlau7040
    @ansonlau7040 7 місяців тому +1

    Big thankyou for the video, makes transformer so easy to learn(also the explanation video)👍👍

  • @lyte69
    @lyte69 Рік тому +1

    Hey there! I enjoyed watching that video, you did a wonderful job explaining everything, and I found it super easy to follow along. Overall, it was a really great experience!

  • @vigenisayan2343
    @vigenisayan2343 8 місяців тому +1

    it was very useful to watch. Question: What books or learning sources you would suggest to learn pytorch deeply. Thanks

  • @michaelscheinfeild9768
    @michaelscheinfeild9768 Рік тому

    Im enjoying clear explanation of The Transformer Coding !

  • @MrSupron00
    @MrSupron00 Рік тому

    This is excellent! Thank you for putting this together. I do have one point of confusion with how the final multihead attention concatenation takes place. I believe the concatenation takes place on line 110 where V' = (V1, V2,.. Vh) (sequenc_length, h*dk) This is intended to be multiplied by matrix W0 (h*dk, dmodel) to give something of shape (sequenc_length, dmodel ) as is required. However, here you implement a linear layer operation which takes the concat V' (sequence_length, d_model) and is fed into a linear layer constructed so that we do the following: W*V'+b where the dimension of W and b are chose to satisfy the output dimension. This is different from multiplying directly with a predefined trainable matrix of size W0. Now, I can see how these are nearly the same thing and in practice it may not matter, but it would be helpful to point out these tricks of the trade so folks like myself don't get bogged down with these subtleties. Thanks

  • @dapostop7384
    @dapostop7384 5 місяців тому +1

    Wow super usefull! Coding really helps me understand the process better than visuals.

  • @californiaBala
    @californiaBala Місяць тому

    This is the best one; we need to train a model; let the model observe your actions; and learn from you. With a physical structure, Tesla robot, could take classes based on your training.

  • @goldentime11
    @goldentime11 6 місяців тому +1

    Thanks for your detailed tutorial. Learned a lot!

  • @albert4392
    @albert4392 Рік тому +1

    This is an excellent video, your explanation is so clear and the live coding helps understanding!
    Can you give us tips to debug such an huge model? Because it is really hard to make sure the model works well.
    My tips on debugging is to print out the shape of the tensor in each step, but this only make sure the shape is correct, there may be some logical error I may miss out. Thank you!

    • @umarjamilai
      @umarjamilai  Рік тому +2

      Hi! I'd love to give a golden rule for debugging models, but unfortunately, it depends highly on the architecture/loss/data itself.
      One thing that you can do is, before training the model on a big dataset, it is recommended to train it on a very small dataset to make sure everything is working and the model should overfit on the small dataset. For example, if instead of training on many books, you train a LLM on a single book, hopefully it should be able to write sentences from that book, given a prompt.
      The second most important thing is to validate the model as the training is proceeding to verify that the quality is improving over time.
      Last but not least, use metrics to decide if the model is going in the right direction and make experiments on hyper parameters to verify assumptions, do not just make assumptions without validating them. When you have a model with billions of parameters, it is difficult to predict patterns, so every assumption must be verified experimentally.
      Have a nice day!

  • @DatabaseAdministration
    @DatabaseAdministration 9 місяців тому

    You are one of the coolest dude in this area. It'd be helpful if you provide a roadmap to reach your expertise. I'd really love to learn from you but i can't understand. Roadmap will help so many of your subscribers.

  • @prajolshrestha9686
    @prajolshrestha9686 Рік тому +1

    I appreciate you for this explanation. Great video!

  • @OleksandrAkimenko
    @OleksandrAkimenko Рік тому +1

    You are a great professional, thanks a ton for this

  • @salmagamal5676
    @salmagamal5676 10 місяців тому +1

    I can't possibly thank you enough for this incredibly informative video

  • @skirazai7591
    @skirazai7591 10 місяців тому +2

    Great video, you are insanely talented btw.

  • @TheAwedExplorer
    @TheAwedExplorer 19 днів тому +1

    Great Explanation. Thanks👍

  • @jihyunkim4315
    @jihyunkim4315 Рік тому +1

    perfect video!! Thank you so much. I always wonder the detail code and its explanation and now I almost understand all of it. thanks:) you are the best for me!

  • @JohnSmith-he5xg
    @JohnSmith-he5xg Рік тому +1

    OMG. And you also note Matrix shapes in comments! Beautiful. I actually know the shapes without having to trace some variable backwards.

  • @solomonhan2235
    @solomonhan2235 26 днів тому +1

    Note: this implementation follows 'pre-LN' version of transformer -- which is slightly different from the original transformer in residual connection part. In the original block diagram, the layer normalization(LN) should be applied AFTER multi-head attention / feed-forward network. However, this code applies the LN BEFORE multi-head attention and feed-forward network. You can see the difference by comparing the ResidualConnection forward() code and section 3.2 of original "Attention Is All You Need" paper. This is a valid architecture too (proposed by the other papers), but it is not exactly as proposed in the original one.

  • @ChathikaGunaratne
    @ChathikaGunaratne 3 місяці тому +1

    Amazingly useful video. Thank you.

  • @gunnvant
    @gunnvant Рік тому +1

    This was really good. I understood multihead attention better with the code explanation.

  • @forresthu6204
    @forresthu6204 Рік тому +1

    At 22:39, it describes the essentials of self-attentions computation in very clear and easy to understand way.

  • @aspboss1973
    @aspboss1973 Рік тому +1

    Its really awesome video with clear explanations. And flow of code is very easy to understand. One question, how to implement this transformer architecture for Question-Answer based model ? (Q/A on very specific topic lets say a manual of instrument..)
    Thank you ! so much for this video !!!

  • @AyushRaj-nt3ot
    @AyushRaj-nt3ot 5 місяців тому +1

    sir, your explanation is just beyond awesome!!! Thank you so much for creating such content. Sir I didn't get the residual connections part. As I am from India, I was working on Indic Languages, so i had to make more code but that's just okay. I just want if you could please help in understanding beam search code, the one which you also gave in the GitHub File. Also, if you could give the code for evaluating the BLEU Score. I'll be really grateful to you.
    And again, thank you so much for such a comprehensive content. We'd love to see your more videos especially in Generative AI!
    P.S. : I didn't understand how you wrote it, what I've understood is that we have to take the input of the previous layer and then add with o/p of the same layer and then apply layer norm on that. Basically Add and then LayerNorm. Please help me correct mysefl!

  • @AdityaSharma-hk8iy
    @AdityaSharma-hk8iy 5 днів тому

    Hi, thanks for the video, it was really helpful, I do have 1 question though.
    at 57:01, shouldn't the Q, K, V be encoder_output, encoder_output, x; instead of x, encoder_output, encoder_output?
    if we're calculating Q@K.T, I think of that somewhat similar to "capturing the essence of the input sentence", which would be the encoder output for both Q & K, and the V would build on top of the input sentence's "essence". Can you please elaborate why the order of inputs are what they are? Even in the attention paper, the first 2 inputs(Q and K) are coming from the encoder output, and the Last input V comes from the decoder's self attention output.

  • @texwiller7577
    @texwiller7577 7 місяців тому +1

    Dottore...sei un grande!

  • @ageofkz
    @ageofkz 8 місяців тому +2

    At 29:14, the part on multihead attention, we feed each Q,V, K multiply by Wq, Wv, Wk then split them into n heads then dot product and concat them again. But should we not split them first, then apply Wq_h where Wq_h is the weight matrix for the hth query matrix, same for V and K? Because it seems like we just split them, apply attention, then concat?

  • @juliosantisteban8452
    @juliosantisteban8452 Місяць тому

    Excellent lecture. What lib or method do you recommend to parallelise your code using more than 1 GPU?

  • @keflat23
    @keflat23 10 місяців тому +1

    what to say.. just WOW! thank you so much !!

  • @peregudovoleg
    @peregudovoleg 10 місяців тому +1

    Thank you for your straight to the point no bs videos. Good code alongs and commentary.
    But it looks like positional encoding aren't correct (as per paper). There is power there in: denominator = np.power(10000, 2*i/d). I get it you decided to use exp+log pair for stability, but no mentions of the power gone.
    And extra layer norm after encoder. As in we "norm + add" (like you defined in the video, instead add & norm as per paper, but you said "lets stick with it", I understand) but than we norm again, after the last encoder block. Like so:
    layer norm + add (in last encoder block (inside residual connection ) + layer norm again (inside Encoder class)
    (I could be wrong but it looks like that)

  • @mohsinansari3584
    @mohsinansari3584 Рік тому +1

    Just finished watching. Thanks so much for the detailed video. I plan to spend this weekend on coding this model. How long did it take to train on your hardware?

    • @umarjamilai
      @umarjamilai  Рік тому +1

      Hi Mohsin! Good job! It took around 3 hours to train 30 epochs on my computer. You can train even for 20 epochs to see good results.
      Have a wonderful day!

    • @NaofumiShinomiya
      @NaofumiShinomiya Рік тому

      ​@@umarjamilai what is your hardware? Just started studying deep learning few days ago and i didnt know transformers could take this long to train

    • @umarjamilai
      @umarjamilai  Рік тому

      @@NaofumiShinomiya Training time depends on the architecture of the network, on your hardware and the amount of data you use, plus other factors like learning rate, learning scheduler, optimizer, etc. So many conditions to factor in.

  • @ZhenjiaoDu
    @ZhenjiaoDu 8 місяців тому +1

    in the 13:13/2:59:23, when we build the PositionalEncoding function,
    this line x = x + (self.pe[:,:x.shape[1],:]).requires_grad_(False), the x.shape[1] looks like not be used in the transformer model, because when we build the dataset.py function, we pad all the sentences into the same length, and then we load the (batch, seq_len, input_embedding_dim) into the PositionalEncoding function, where all x.shape[1] in the batch is the seq_len, instead of varying by their original sentence length.

    • @mittcooper
      @mittcooper Місяць тому

      @umarjamilai. I have the same question. x.shape[1] in this case will alway equal seq_len. So every time this will just return the entire pe tensor. Wondering if this is unique to this use case example??

  • @nhutminh1552
    @nhutminh1552 10 місяців тому +1

    Thank you admin. Your video is great. It helps me understand. Thank you very much.

  • @jeremyregamey495
    @jeremyregamey495 11 місяців тому +1

    I love your videos. Thank you for sharing your knowledge and i cant wait to learn more.

  • @babaka1850
    @babaka1850 5 місяців тому

    for determining the max len of tgt sentence, I believe you should point to tokenizer_tgt rather than tokenizer_src. tgt_ids = tokenizer_tgt.encode(item['transaltion'][config['ang_tgt']]).ids

  • @andybhat5988
    @andybhat5988 3 місяці тому +1

    Thanks for a great video.

  • @oborderies
    @oborderies 11 місяців тому +2

    Sincere congratulations for this fine and very useful tutorial ! Much appreciated 👏🏻

  • @Mostafa-cv8jc
    @Mostafa-cv8jc 11 місяців тому +1

    Very good video. Tysm for making this, you are making a difference

  • @godswillanosike896
    @godswillanosike896 7 місяців тому +1

    Great explanation! Thanks very much

  • @FireFly969
    @FireFly969 6 місяців тому

    Thank you umar jamil for this wonderfull content, to be honest i find it so hard to keep undertanding each part and what happens in each line of code for a beginner in pytorch.
    I wonder what i need to know before starting one of your videos.
    I think i need to read the paper multiple times till understand it?

  • @ebadsayed487
    @ebadsayed487 4 місяці тому

    Your video is truly amazing, thanks a lot for this. I want to train this model on Summarization Task so what changes I need to do?

  • @nareshpant7792
    @nareshpant7792 Рік тому +1

    Thanks so much such a great video. Really liked it a lot. I have a small query. For ResidualConnection, in the paper the equation is given by "LayerNorm(x + Sublayer(x))". In the code, we have: x + self.dropout(sublayer(self.norm(x))). Why it is not self.norm(self.dropout((x + sublayer(x))) ?

  • @zhuxindedongchang4229
    @zhuxindedongchang4229 7 місяців тому

    Hello Umar, really impressive work on Transformer. I have followed your step on this experiment. One small thing I am not sure is when you compute the loss you use the nn.CrossEntropyLoss() method, this method have already apply the softmax itself. As their document said:"The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general)". But in your project method in the built Transformer model, it has applied softmax. I wonder if we should only output the logits without this softmax to fit the nn.CrossEntropyLoss() method? Thank you anyway.

  • @toxicbisht4344
    @toxicbisht4344 10 місяців тому +1

    Amazing explanation
    Thank you for this

  • @divyanshbansal2321
    @divyanshbansal2321 9 місяців тому +1

    Thank you mate. You are a godsend!

  • @txxie
    @txxie 11 місяців тому +1

    This video is great! But can you explain how you convert the formula of positional embeddings into log form?

  • @rafa_br34
    @rafa_br34 5 місяців тому

    Great video! I'm wondering, is there any reason to save the positional encoding vector? I don't see why you would need to save it since it seems to always be the same value considering the init parameters don't change.

  • @subusrable
    @subusrable 7 днів тому +1

    seriously awesome

  • @shengjiadiao3166
    @shengjiadiao3166 4 місяці тому +1

    the contents are crazy !!!!

  • @daviderizzotti2724
    @daviderizzotti2724 8 місяців тому +1

    Why at 51:08 are you applying an extra normalization at the end of the whole encoder pass?
    The tutorial has been amazing so far ;)

  • @AdityaAgarwal-v3b
    @AdityaAgarwal-v3b Рік тому +1

    one of the best videos thanks a lot for the video.

  • @kindahall666
    @kindahall666 4 місяці тому

    Thank you for such a great video. However, it seems that the softmax layer after the decoder is not included in your code. I tried implementing it myself, but after adding the final softmax, the loss function becomes extremely difficult to converge and decreases very slowly. How can this be resolved?

  • @DavideStortoni
    @DavideStortoni Місяць тому

    Great video! Where the Residual connection is calculated?

  • @bhuvandwarasila
    @bhuvandwarasila 19 днів тому

    I believe I’m going to have to code and understand all the code to be able to replicate this for other use case! At the moment I am not able to follow the code as I am new to python! Im going to stick to this and understand this no matter how long it takes! I really did want to get into the visual transformer video, but I believe I should master this first!

  • @minister1005
    @minister1005 Рік тому

    This is definitely one of the best video to learn about Transformers. Is it normal for me(as a beginner in ai) to see this 4 or 5 times struggling for weeks to understand it fully? 😅
    For example, I had to go study huggingface tutorial to understand the tokenizer part and I wonder if I should learn it 100% or just enough to implement it.

    • @umarjamilai
      @umarjamilai  Рік тому +6

      Hi! At the beginning it's normal to learn things "just to make things work". Most people who start coding, use libraries without knowing how they work internally. As time goes, you will naturally find yourself curious on how things work on a "lower level" and will explore and study them more deeply.
      The goal in life is not to understand everything today, but rather, to understand a little more compared to yesterday. Have a nice day!

  • @user-ul2mw6fu2e
    @user-ul2mw6fu2e 10 місяців тому +1

    Wow Your explanation amazing

  • @linlinpan3150
    @linlinpan3150 3 місяці тому

    For the Encoder/Decode code - why is the last step in these a normalization layer? We wrote the Residual Connection layers with a pre-normalization step (instead of post-normalization as was in the original Transorfer paper).

  • @kailazarov107
    @kailazarov107 Рік тому

    Really great video - learned a lot. Your inference notebook is using the dataset batching, but how can you build inference with user typed sentences?

  • @neelarahimi1053
    @neelarahimi1053 2 місяці тому +1

    Great video! Thanks :)

  • @cicerochen313
    @cicerochen313 Рік тому +2

    Awesome! Highly appreciate. 超級讚!非常的感謝。

  • @panchajanya91
    @panchajanya91 10 місяців тому

    Hello Umar, Thank you very much for this video. This is one of the best. I took inspiration and tried to implement the paper "Attention is all you need" for English French translation. I used opus-book en-fr dataset to train the transformer. I ran for 40 epochs and my batch size was 64. In the end the loss converges to around 3.6. Between epoch 30 and epoch 40, for 10 epochs, for each mini batch the step loss was around 3.6. The model's performance after 40 epoch on test set was not good. I would be thankful to you if you share some details like, for how many epochs did you train and what was your step loss towards the end of your training. Probably you took en-it dataset, still it would be helpful for me to get an idea. Thank You very much.