BERT Research - Ep. 8 - Inner Workings V - Masked Language Model

Поділитися
Вставка

КОМЕНТАРІ • 71

  • @sahibsingh1563
    @sahibsingh1563 4 роки тому +11

    What a amazing work you have putten up in making this series
    Absolutely Highly Recommendable
    I can now say
    "BERT RESEARCH by Chris McCormick" is all you need.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому +1

      Wow! Thanks for that. And congrats on making it through the whole series--you did it!

  • @kumudayanayanajith6427
    @kumudayanayanajith6427 Рік тому

    Dear Chris,
    This explanation saves thousands of hours of BERT research for its further developments. You give an exceptional review in to the core basics of the model's architecture. Thank you for your enormous effort put into making such a detailed video series and the book out of it. Keep up this phenomenal work.
    Cheers!

  • @simonnick1585
    @simonnick1585 4 роки тому +25

    Thank you very much Chris, you helped us a lot to understand BERT and the transformer model. Your colab notebook on classification is also superb. Best wishes from machine learning students at UCL :)

  • @syedhasany1809
    @syedhasany1809 4 роки тому +6

    I followed the entire series, and it was highly enjoyable. Hopefully you'll continue with this format.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому

      Thanks Syed, I'm glad you enjoyed it! Definitely more to come.

  • @evanozaroff4742
    @evanozaroff4742 Рік тому

    Just finished this series, it has been extremely helpful. Thanks Chris!

  • @zd676
    @zd676 4 роки тому +1

    Amazing series! What's equally valuable here is that the way you learn and teach is an big inspiration for us all! Please keep up with the awesome content!

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому +1

      Thank you so much! I hope circle back on your comment on the other video -- trying to catch up on my backlog of questions! :-O

  • @goelnikhils
    @goelnikhils 2 роки тому

    Amazing Amazing Work. Too Good and so much hardwork. So grateful to you , as this BERT inner working series clarified so many of my fundamentals.

  • @oostopitre
    @oostopitre 4 роки тому

    Thank you so much Chris, for not just going into the concepts, intuition, and implementation details but also the context around how these models evolved with respect to each other. Now, these things make so much more sense. I feel like now I can have a much more meaningful conversations with peers when talking about these models and concepts.

  • @kayayala9091
    @kayayala9091 3 роки тому

    Thank you Chris! It's refreshing to see such a thorough breakdown of an algorithm with code. Your careful reading of the paper and other's breakdowns was a huge help!

  • @abhishek-shrm
    @abhishek-shrm 2 роки тому +1

    Great series, Chris. Thanks for making it.

  • @steventhanhlee
    @steventhanhlee 4 роки тому +1

    Big thanks to Chris. Through this series, i can understand deeply about Bert model. Before, i've wasted a lot of time to find information about Bert but i couldn't find any clear sources. Luckily, you're here :)). Hopefully, you will keep making another valuable researches about AI, ML techniques like this. Again, thank you so much.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому

      Thanks Thành! I appreciate the encouragement, and I'm so glad it helped clear up confusion.

  • @bhargavasavi
    @bhargavasavi 4 роки тому

    Your way of explaining concepts is really nice...I am not a pytorch person, but you have explained the code quite well that even I could understand...I enjoyed this series .... Looking forward to more of your videos

  • @nikhilkumarghanghor524
    @nikhilkumarghanghor524 4 роки тому

    Thanks a lot Chris , it was a really helpful series for understanding BERT.Hope you come up with some more series like this in the future.

  • @DarkerThanBlack89
    @DarkerThanBlack89 4 роки тому

    Thank you again Chris. I just finished it and you have done an amazing job! Not did I just benefit from it, but I actually enjoyed it a lot. You have a great personality for teaching.

  • @pratik6447
    @pratik6447 3 роки тому +1

    The series was awesome. Helped me a lot to understand BERT. I see that using [CLS] token we can get sentence or document embedding. Can you also show us how can we fetch that embedding vector for our sentence encoding. ?

  • @LourencoVazPato
    @LourencoVazPato 4 роки тому +1

    Thank you Chris, this series was really helpful. I really enjoyed your easy to understand style.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому

      Thanks, Lourenço! Congrats on making it all the way through :)

  • @thebasant
    @thebasant 4 роки тому

    Hi Chris, I really enjoyed your BERT series. I tried a lot to understand it, but i could not understand it well before reaching to your videos. Fortunately, I could reach to your videos. Many thanks to you !!

  • @ax5344
    @ax5344 4 роки тому

    Could you illustrate more on why the fine tuning works? I think I understand the two pre-training tasks: so through learning to predict what the masked word is and whether the second sentence is a valid follow-up that we can obtain good word embedding. But why the same model architecture works with so many downstream tasks? With word2vec, we need to take the embedding into new architectures in order to do classification or NER, why in the case of BERT, nothing is needed in downstream tasks? We only need to organize our data in certain form and that's all. How come? NER, classification, QA look so different...
    Thanks so much for your clear and thoughtful lectures. I've read quite a bit on the topic. This one is the clearest! Kudos to your great work!

  • @jianhuiben5272
    @jianhuiben5272 4 роки тому +1

    Hi Chris, I really like your video! One addition to the BERT tokenization part in the previous videos, you were adding padding from Keras. Actually huggingface library can also do it in tokenizer.batch_encode_plus(*args). It automate the token id conversion, padding/truncating and batching in one step.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому

      Thanks,
      jianhui! I recently updated a number of my Notebooks to use the `encode_plus` function; for example: colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX I'll have to check out the "batch" variant of the function, though!

  • @FelipeValen
    @FelipeValen 3 роки тому

    Thanks a lot, you did a great job explaining the model through the series.

  • @tanmeyrawal644
    @tanmeyrawal644 2 роки тому +1

    Thanks man, good quality content and crisp

  • @yilinghe9178
    @yilinghe9178 4 роки тому

    Thank you very much Chris! I finished watching this channel interruptedly, and it really helped me a lot!

  • @red__guy
    @red__guy Рік тому

    I DID IT!!!
    I have a vague understanding
    Thank you so much!

  • @lxkp3233
    @lxkp3233 4 роки тому

    thank you very much for posting these videos about BERT. I followed the whole series, and it helped me a lot to understand the model. If possible, I would like to suggest talking about the RoBERTa model. best wishes.

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому +2

      Hi Lis, thank you! I agree, a video or two on RoBERTa definitely seems warranted. It's in the queue! :)

  • @hadeelsaadany1965
    @hadeelsaadany1965 4 роки тому +2

    Thank you Chris for an easy-to-understand videos and hands on code. Can you possibly post a video on using BERT with languages other than English? I tried BERT multilingual pretrained embedding on Arabic dataset (following your Fine-tuning part 3 code), but it is seems it needs some tweaks. Hope that it can work on other languages as well as it does with English. Thank you again for your great videos and looking forward for any multilingual applications in future videos

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому +1

      Thanks Hadeel, I hadn’t planned on that, but I could imagine there’s demand for it. I’ll give it some thought!

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому +1

      Hi Hadeel,
      So Nick and I did end up digging into multilingual BERT--it’s a really interesting topic!
      We put together a Notebook on using the (relatively new) XLMR model for this--the Notebook supports 15 different languages (including Arabic!). The Notebook’s available to purchase on my website here: bit.ly/3kCBvHB
      I think it turned out really well--thanks again for the inspiration!
      Chris

  • @kleemc
    @kleemc 4 роки тому

    This is a great series. I finally understand what’s under the hood of bert. Thank you.
    One question I have is why are the position and segment embeddings added to the token embeddings instead of being concatenated. I’m thinking once the vectors are added, the position and segment information are “lost” in the result isn’t it? That’s why I think concatenation makes more sense.
    And i’m also not sure why they have to use such a complex way to represent a position. Since there are only 512 positions, it could be represented easily as an extra dimension in the vector.

  • @WildBoy105
    @WildBoy105 4 роки тому

    Thank you very much Chris, really appreciated it. I subscribed and also can't wait for your NER video using BERT :)

  • @razor420100
    @razor420100 4 роки тому

    Thanks Chris for this series. Really been helpful to understand the underpinnings of BERT.
    Is there a residual and layer norm video, couldn't find in your library.

  • @syedhamza3314
    @syedhamza3314 Рік тому

    Hi Chris, absolutely amazing series on Transformers. I have a question regarding how transformers handle the variable length inputs. Suppose I set the max_lenght for my sequences to be 32 and feed the input_id and attnetion_mask only for 32 tokens during training, given some tokens can be padded tokens since each sequence won't be exactly of 32 lengths. Now if we talk about bert the default max_lenght is 512 tokens so my question is does the transformer implicitly add 512-32 padded tokens to calculate MHA on 512 tokens as it will not attend to the tokens with the padded token ID? If that's the case then are we not updating the parameters directly attach to the remaining 512-32 positional vectors?

  • @parikshitagarwal3901
    @parikshitagarwal3901 4 роки тому

    Thanks a lot Chris, really appreciate your efforts in making of these valuable content on bert, just a request can you please upload a video on pre-training multilingual bert on any of language to make it robust on that language,

  • @nicolasmontes3123
    @nicolasmontes3123 4 роки тому +1

    Hi excelent tutorial, but i have one important question. On the training step of the masked lenguaje model, we constuct the embedding of the "masked" token using the embeddings of the contextual words, right? Then with a softmax layer we predict the "masked" word".
    If we construct the "masked" embedding with the contextual tokens, we would need to calculate the dot product of the query of the "masked" embedding and the key of each contextual tokens. My question is....how can we calculate the query of the "masked" token if we dont know the input embedding of it (because we "masked" it intentionaly)?
    For example in minute 10:39 when the model try to calculate the embedding of the "masked" token, in order to calculate the attention weigth of the word "like" we need to know the key vector of the word "like" and the query vector of the "masked" token...but which is the embedding of the "mask" token? How can we calculate te query vector of the "masked" word if we dont know the embedding of the "masked" token?

    • @TOpCoder100
      @TOpCoder100 4 роки тому

      Nicolas Montes The same question I came around and still not sure but looks like It projecting the embedding for the [“MASK”] token and when it comes to finally predicting the probability distribution over vocabulary - the MLM classifier will take into account surrounding context (words). Why? Because of the way the Bert was trained (MLM). Still unsure, though. Better double check it.

  • @muckvix
    @muckvix 3 роки тому

    20:02 The exact mechanism for the leakage is this. Suppose you have sequence length 3. You could try to predict token 1 using only tokens 2, 3; token 2 using only tokens 1, 3; and token 3 using only tokens 1, 2. However, that only works if you have a single attention layer. If you have 2 or more attention layers, then the information leaks. For example, consider token 1 output from the transformer layer 2. Sure, you masked input 1. But input 1 was used to produce token 2 and token 3 outputs in transformer layer 1. Token 1 layer 2 prediction takes as an input token 2 and token 3 outputs from layer 1, each of which contains information about token 1 input.

  • @ramakantshakya5478
    @ramakantshakya5478 2 роки тому

    series is really amazing learn a lot 👍

  • @noorulamin9515
    @noorulamin9515 4 роки тому +1

    Great Work :-) !! Please also make videos of OpenAI GPT-2

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 4 роки тому

    wow, this is a great series.

  • @ais3153
    @ais3153 3 роки тому

    GREAT! Thank you!... Could you add another payment method to buy BERT collections?

  • @johngrabner
    @johngrabner 3 роки тому

    Excellent video. Do you have any insight on how BERT deals with pre-training of words that are represented/encoded as multiple tokens?

  • @themrchappi1
    @themrchappi1 4 роки тому

    very well done series! Thanky you!

  • @clementolivier777
    @clementolivier777 4 роки тому

    Thank you Chris, amazing content 💪

  • @junbowang9075
    @junbowang9075 4 роки тому

    Hi Chris. Thanks for all the great videos for BERT! I am really appreciated and have learned a lot from them.
    I have one question. I am applying the following code from your Colab document:
    "with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask)"
    However, my input_id contains more than10,000,000 sentences. This will create a huge data frame and can take more than 300GB memories. I realized that we only need last_hidden_states[0][:,0,:].numpy(), which is a much smaller portion of the whole data frame. Thus, my question is: whether it is possible to return only last_hidden_states[0][:,0,:].numpy() from the model? Thanks in advance!

  • @myonlineschool4759
    @myonlineschool4759 4 роки тому

    Thank you so much Chris!

  • @fatemeh23
    @fatemeh23 4 роки тому

    Hello Chris! Thank you for your great videos! Helped me a lot!
    I have a question; to find out what label the model is assigning to a data point, we are using argmax of logits. If the bigger number is in column 0, then the label is 0 and vice versa. How can I find the probability of correctness of these labels? I mean suppose we were using sigmoid function. It gives a number between 0 and 1 and if the number is close to 0.5 this means that the model is not so sure about the label of the corresponding data point. Here, I was thinking about checking how close the numbers in columns 0 and 1 are but didn't really find a good metric to check that.

  • @teetanrobotics5363
    @teetanrobotics5363 4 роки тому

    Thank you Chris for this amazing course on BERT . Could you also make similar ones for RBM,VAE,GANS ?

  • @ugurkaraaslan9285
    @ugurkaraaslan9285 3 роки тому

    Thank you very much.

  • @prakashkafle454
    @prakashkafle454 3 роки тому

    If the number of word in sentence is more than 512 how to use bert . For example in news classification we need to pass whole news to train with bert but it contains more than 512 words how to handle that issue ??? Watched entire series but not clear about that 😃

  • @IbrahimSobh
    @IbrahimSobh 3 роки тому

    Thank you very much. I believe that GPT is based on Decoders not Encoders 18:02

  • @mohamedramadanmohamed3744
    @mohamedramadanmohamed3744 4 роки тому

    Thank you Chris

  • @nikhilgjog
    @nikhilgjog 4 роки тому

    could you create a tutorial to show how to multilevel classification on text data where classes are unbalanced. Moreover, there are some numerical inputs along with some text. For example, we want to classify when a customer liked the movie, neutrail or didnt like the movie and we have customers' text reviews and their age.

  • @anunaysanganal
    @anunaysanganal 4 роки тому

    I guess you made a mistake in discussing the architecture of GPT, it has the decoder layers and not the encoder layers of the transformer. Otherwise great series, learnt a lot!

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому

      Hi Anunay, thanks for your comment! In this video, I’m actually referring to the original GPT, and I think you’re thinking of their newer GPT-2 model from last year.

  • @peterpirog5004
    @peterpirog5004 4 роки тому

    Thank Yoy for Your videos. Is possible to do video how to train own transformer from scratch if hou have language corpus as huge txt file?

  • @pardisranjbar-noiey2596
    @pardisranjbar-noiey2596 4 роки тому

    Thank you!

  • @linduine1054
    @linduine1054 3 роки тому

    8:36 here comes the doggie again!

  • @galabpokharel8414
    @galabpokharel8414 4 роки тому +1

    Damn, god bless you....

    • @ChrisMcCormickAI
      @ChrisMcCormickAI  4 роки тому +2

      Haha, thank you. It can be so hard to find good explanations for this stuff!

    • @galabpokharel8414
      @galabpokharel8414 4 роки тому +1

      @@ChrisMcCormickAI I started spending more time with bert,
      Your video gave me kick start earlier
      Quality content