What a amazing work you have putten up in making this series Absolutely Highly Recommendable I can now say "BERT RESEARCH by Chris McCormick" is all you need.
Dear Chris, This explanation saves thousands of hours of BERT research for its further developments. You give an exceptional review in to the core basics of the model's architecture. Thank you for your enormous effort put into making such a detailed video series and the book out of it. Keep up this phenomenal work. Cheers!
Thank you very much Chris, you helped us a lot to understand BERT and the transformer model. Your colab notebook on classification is also superb. Best wishes from machine learning students at UCL :)
Amazing series! What's equally valuable here is that the way you learn and teach is an big inspiration for us all! Please keep up with the awesome content!
Thank you so much Chris, for not just going into the concepts, intuition, and implementation details but also the context around how these models evolved with respect to each other. Now, these things make so much more sense. I feel like now I can have a much more meaningful conversations with peers when talking about these models and concepts.
Thank you Chris! It's refreshing to see such a thorough breakdown of an algorithm with code. Your careful reading of the paper and other's breakdowns was a huge help!
Big thanks to Chris. Through this series, i can understand deeply about Bert model. Before, i've wasted a lot of time to find information about Bert but i couldn't find any clear sources. Luckily, you're here :)). Hopefully, you will keep making another valuable researches about AI, ML techniques like this. Again, thank you so much.
Your way of explaining concepts is really nice...I am not a pytorch person, but you have explained the code quite well that even I could understand...I enjoyed this series .... Looking forward to more of your videos
Thank you again Chris. I just finished it and you have done an amazing job! Not did I just benefit from it, but I actually enjoyed it a lot. You have a great personality for teaching.
The series was awesome. Helped me a lot to understand BERT. I see that using [CLS] token we can get sentence or document embedding. Can you also show us how can we fetch that embedding vector for our sentence encoding. ?
Hi Chris, I really enjoyed your BERT series. I tried a lot to understand it, but i could not understand it well before reaching to your videos. Fortunately, I could reach to your videos. Many thanks to you !!
Could you illustrate more on why the fine tuning works? I think I understand the two pre-training tasks: so through learning to predict what the masked word is and whether the second sentence is a valid follow-up that we can obtain good word embedding. But why the same model architecture works with so many downstream tasks? With word2vec, we need to take the embedding into new architectures in order to do classification or NER, why in the case of BERT, nothing is needed in downstream tasks? We only need to organize our data in certain form and that's all. How come? NER, classification, QA look so different... Thanks so much for your clear and thoughtful lectures. I've read quite a bit on the topic. This one is the clearest! Kudos to your great work!
Hi Chris, I really like your video! One addition to the BERT tokenization part in the previous videos, you were adding padding from Keras. Actually huggingface library can also do it in tokenizer.batch_encode_plus(*args). It automate the token id conversion, padding/truncating and batching in one step.
Thanks, jianhui! I recently updated a number of my Notebooks to use the `encode_plus` function; for example: colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX I'll have to check out the "batch" variant of the function, though!
thank you very much for posting these videos about BERT. I followed the whole series, and it helped me a lot to understand the model. If possible, I would like to suggest talking about the RoBERTa model. best wishes.
Thank you Chris for an easy-to-understand videos and hands on code. Can you possibly post a video on using BERT with languages other than English? I tried BERT multilingual pretrained embedding on Arabic dataset (following your Fine-tuning part 3 code), but it is seems it needs some tweaks. Hope that it can work on other languages as well as it does with English. Thank you again for your great videos and looking forward for any multilingual applications in future videos
Hi Hadeel, So Nick and I did end up digging into multilingual BERT--it’s a really interesting topic! We put together a Notebook on using the (relatively new) XLMR model for this--the Notebook supports 15 different languages (including Arabic!). The Notebook’s available to purchase on my website here: bit.ly/3kCBvHB I think it turned out really well--thanks again for the inspiration! Chris
This is a great series. I finally understand what’s under the hood of bert. Thank you. One question I have is why are the position and segment embeddings added to the token embeddings instead of being concatenated. I’m thinking once the vectors are added, the position and segment information are “lost” in the result isn’t it? That’s why I think concatenation makes more sense. And i’m also not sure why they have to use such a complex way to represent a position. Since there are only 512 positions, it could be represented easily as an extra dimension in the vector.
Thanks Chris for this series. Really been helpful to understand the underpinnings of BERT. Is there a residual and layer norm video, couldn't find in your library.
Hi Chris, absolutely amazing series on Transformers. I have a question regarding how transformers handle the variable length inputs. Suppose I set the max_lenght for my sequences to be 32 and feed the input_id and attnetion_mask only for 32 tokens during training, given some tokens can be padded tokens since each sequence won't be exactly of 32 lengths. Now if we talk about bert the default max_lenght is 512 tokens so my question is does the transformer implicitly add 512-32 padded tokens to calculate MHA on 512 tokens as it will not attend to the tokens with the padded token ID? If that's the case then are we not updating the parameters directly attach to the remaining 512-32 positional vectors?
Thanks a lot Chris, really appreciate your efforts in making of these valuable content on bert, just a request can you please upload a video on pre-training multilingual bert on any of language to make it robust on that language,
Hi excelent tutorial, but i have one important question. On the training step of the masked lenguaje model, we constuct the embedding of the "masked" token using the embeddings of the contextual words, right? Then with a softmax layer we predict the "masked" word". If we construct the "masked" embedding with the contextual tokens, we would need to calculate the dot product of the query of the "masked" embedding and the key of each contextual tokens. My question is....how can we calculate the query of the "masked" token if we dont know the input embedding of it (because we "masked" it intentionaly)? For example in minute 10:39 when the model try to calculate the embedding of the "masked" token, in order to calculate the attention weigth of the word "like" we need to know the key vector of the word "like" and the query vector of the "masked" token...but which is the embedding of the "mask" token? How can we calculate te query vector of the "masked" word if we dont know the embedding of the "masked" token?
Nicolas Montes The same question I came around and still not sure but looks like It projecting the embedding for the [“MASK”] token and when it comes to finally predicting the probability distribution over vocabulary - the MLM classifier will take into account surrounding context (words). Why? Because of the way the Bert was trained (MLM). Still unsure, though. Better double check it.
20:02 The exact mechanism for the leakage is this. Suppose you have sequence length 3. You could try to predict token 1 using only tokens 2, 3; token 2 using only tokens 1, 3; and token 3 using only tokens 1, 2. However, that only works if you have a single attention layer. If you have 2 or more attention layers, then the information leaks. For example, consider token 1 output from the transformer layer 2. Sure, you masked input 1. But input 1 was used to produce token 2 and token 3 outputs in transformer layer 1. Token 1 layer 2 prediction takes as an input token 2 and token 3 outputs from layer 1, each of which contains information about token 1 input.
Hi Chris. Thanks for all the great videos for BERT! I am really appreciated and have learned a lot from them. I have one question. I am applying the following code from your Colab document: "with torch.no_grad(): last_hidden_states = model(input_ids, attention_mask)" However, my input_id contains more than10,000,000 sentences. This will create a huge data frame and can take more than 300GB memories. I realized that we only need last_hidden_states[0][:,0,:].numpy(), which is a much smaller portion of the whole data frame. Thus, my question is: whether it is possible to return only last_hidden_states[0][:,0,:].numpy() from the model? Thanks in advance!
Hello Chris! Thank you for your great videos! Helped me a lot! I have a question; to find out what label the model is assigning to a data point, we are using argmax of logits. If the bigger number is in column 0, then the label is 0 and vice versa. How can I find the probability of correctness of these labels? I mean suppose we were using sigmoid function. It gives a number between 0 and 1 and if the number is close to 0.5 this means that the model is not so sure about the label of the corresponding data point. Here, I was thinking about checking how close the numbers in columns 0 and 1 are but didn't really find a good metric to check that.
If the number of word in sentence is more than 512 how to use bert . For example in news classification we need to pass whole news to train with bert but it contains more than 512 words how to handle that issue ??? Watched entire series but not clear about that 😃
could you create a tutorial to show how to multilevel classification on text data where classes are unbalanced. Moreover, there are some numerical inputs along with some text. For example, we want to classify when a customer liked the movie, neutrail or didnt like the movie and we have customers' text reviews and their age.
I guess you made a mistake in discussing the architecture of GPT, it has the decoder layers and not the encoder layers of the transformer. Otherwise great series, learnt a lot!
Hi Anunay, thanks for your comment! In this video, I’m actually referring to the original GPT, and I think you’re thinking of their newer GPT-2 model from last year.
What a amazing work you have putten up in making this series
Absolutely Highly Recommendable
I can now say
"BERT RESEARCH by Chris McCormick" is all you need.
Wow! Thanks for that. And congrats on making it through the whole series--you did it!
Dear Chris,
This explanation saves thousands of hours of BERT research for its further developments. You give an exceptional review in to the core basics of the model's architecture. Thank you for your enormous effort put into making such a detailed video series and the book out of it. Keep up this phenomenal work.
Cheers!
Thank you very much Chris, you helped us a lot to understand BERT and the transformer model. Your colab notebook on classification is also superb. Best wishes from machine learning students at UCL :)
Thanks! Is that University College London?
ChrisMcCormickAI yes exactly
I followed the entire series, and it was highly enjoyable. Hopefully you'll continue with this format.
Thanks Syed, I'm glad you enjoyed it! Definitely more to come.
Just finished this series, it has been extremely helpful. Thanks Chris!
Amazing series! What's equally valuable here is that the way you learn and teach is an big inspiration for us all! Please keep up with the awesome content!
Thank you so much! I hope circle back on your comment on the other video -- trying to catch up on my backlog of questions! :-O
Amazing Amazing Work. Too Good and so much hardwork. So grateful to you , as this BERT inner working series clarified so many of my fundamentals.
Thank you so much Chris, for not just going into the concepts, intuition, and implementation details but also the context around how these models evolved with respect to each other. Now, these things make so much more sense. I feel like now I can have a much more meaningful conversations with peers when talking about these models and concepts.
Thank you Chris! It's refreshing to see such a thorough breakdown of an algorithm with code. Your careful reading of the paper and other's breakdowns was a huge help!
Great series, Chris. Thanks for making it.
Thanks, glad it was helpful!
Big thanks to Chris. Through this series, i can understand deeply about Bert model. Before, i've wasted a lot of time to find information about Bert but i couldn't find any clear sources. Luckily, you're here :)). Hopefully, you will keep making another valuable researches about AI, ML techniques like this. Again, thank you so much.
Thanks Thành! I appreciate the encouragement, and I'm so glad it helped clear up confusion.
Your way of explaining concepts is really nice...I am not a pytorch person, but you have explained the code quite well that even I could understand...I enjoyed this series .... Looking forward to more of your videos
Thanks a lot Chris , it was a really helpful series for understanding BERT.Hope you come up with some more series like this in the future.
Thank you again Chris. I just finished it and you have done an amazing job! Not did I just benefit from it, but I actually enjoyed it a lot. You have a great personality for teaching.
The series was awesome. Helped me a lot to understand BERT. I see that using [CLS] token we can get sentence or document embedding. Can you also show us how can we fetch that embedding vector for our sentence encoding. ?
Thank you Chris, this series was really helpful. I really enjoyed your easy to understand style.
Thanks, Lourenço! Congrats on making it all the way through :)
Hi Chris, I really enjoyed your BERT series. I tried a lot to understand it, but i could not understand it well before reaching to your videos. Fortunately, I could reach to your videos. Many thanks to you !!
Could you illustrate more on why the fine tuning works? I think I understand the two pre-training tasks: so through learning to predict what the masked word is and whether the second sentence is a valid follow-up that we can obtain good word embedding. But why the same model architecture works with so many downstream tasks? With word2vec, we need to take the embedding into new architectures in order to do classification or NER, why in the case of BERT, nothing is needed in downstream tasks? We only need to organize our data in certain form and that's all. How come? NER, classification, QA look so different...
Thanks so much for your clear and thoughtful lectures. I've read quite a bit on the topic. This one is the clearest! Kudos to your great work!
Hi Chris, I really like your video! One addition to the BERT tokenization part in the previous videos, you were adding padding from Keras. Actually huggingface library can also do it in tokenizer.batch_encode_plus(*args). It automate the token id conversion, padding/truncating and batching in one step.
Thanks,
jianhui! I recently updated a number of my Notebooks to use the `encode_plus` function; for example: colab.research.google.com/drive/1pTuQhug6Dhl9XalKB0zUGf4FIdYFlpcX I'll have to check out the "batch" variant of the function, though!
Thanks a lot, you did a great job explaining the model through the series.
Thanks man, good quality content and crisp
I appreciate that, thanks!
Thank you very much Chris! I finished watching this channel interruptedly, and it really helped me a lot!
I DID IT!!!
I have a vague understanding
Thank you so much!
thank you very much for posting these videos about BERT. I followed the whole series, and it helped me a lot to understand the model. If possible, I would like to suggest talking about the RoBERTa model. best wishes.
Hi Lis, thank you! I agree, a video or two on RoBERTa definitely seems warranted. It's in the queue! :)
Thank you Chris for an easy-to-understand videos and hands on code. Can you possibly post a video on using BERT with languages other than English? I tried BERT multilingual pretrained embedding on Arabic dataset (following your Fine-tuning part 3 code), but it is seems it needs some tweaks. Hope that it can work on other languages as well as it does with English. Thank you again for your great videos and looking forward for any multilingual applications in future videos
Thanks Hadeel, I hadn’t planned on that, but I could imagine there’s demand for it. I’ll give it some thought!
Hi Hadeel,
So Nick and I did end up digging into multilingual BERT--it’s a really interesting topic!
We put together a Notebook on using the (relatively new) XLMR model for this--the Notebook supports 15 different languages (including Arabic!). The Notebook’s available to purchase on my website here: bit.ly/3kCBvHB
I think it turned out really well--thanks again for the inspiration!
Chris
This is a great series. I finally understand what’s under the hood of bert. Thank you.
One question I have is why are the position and segment embeddings added to the token embeddings instead of being concatenated. I’m thinking once the vectors are added, the position and segment information are “lost” in the result isn’t it? That’s why I think concatenation makes more sense.
And i’m also not sure why they have to use such a complex way to represent a position. Since there are only 512 positions, it could be represented easily as an extra dimension in the vector.
Thank you very much Chris, really appreciated it. I subscribed and also can't wait for your NER video using BERT :)
Thanks Chris for this series. Really been helpful to understand the underpinnings of BERT.
Is there a residual and layer norm video, couldn't find in your library.
Hi Chris, absolutely amazing series on Transformers. I have a question regarding how transformers handle the variable length inputs. Suppose I set the max_lenght for my sequences to be 32 and feed the input_id and attnetion_mask only for 32 tokens during training, given some tokens can be padded tokens since each sequence won't be exactly of 32 lengths. Now if we talk about bert the default max_lenght is 512 tokens so my question is does the transformer implicitly add 512-32 padded tokens to calculate MHA on 512 tokens as it will not attend to the tokens with the padded token ID? If that's the case then are we not updating the parameters directly attach to the remaining 512-32 positional vectors?
Thanks a lot Chris, really appreciate your efforts in making of these valuable content on bert, just a request can you please upload a video on pre-training multilingual bert on any of language to make it robust on that language,
Hi excelent tutorial, but i have one important question. On the training step of the masked lenguaje model, we constuct the embedding of the "masked" token using the embeddings of the contextual words, right? Then with a softmax layer we predict the "masked" word".
If we construct the "masked" embedding with the contextual tokens, we would need to calculate the dot product of the query of the "masked" embedding and the key of each contextual tokens. My question is....how can we calculate the query of the "masked" token if we dont know the input embedding of it (because we "masked" it intentionaly)?
For example in minute 10:39 when the model try to calculate the embedding of the "masked" token, in order to calculate the attention weigth of the word "like" we need to know the key vector of the word "like" and the query vector of the "masked" token...but which is the embedding of the "mask" token? How can we calculate te query vector of the "masked" word if we dont know the embedding of the "masked" token?
Nicolas Montes The same question I came around and still not sure but looks like It projecting the embedding for the [“MASK”] token and when it comes to finally predicting the probability distribution over vocabulary - the MLM classifier will take into account surrounding context (words). Why? Because of the way the Bert was trained (MLM). Still unsure, though. Better double check it.
20:02 The exact mechanism for the leakage is this. Suppose you have sequence length 3. You could try to predict token 1 using only tokens 2, 3; token 2 using only tokens 1, 3; and token 3 using only tokens 1, 2. However, that only works if you have a single attention layer. If you have 2 or more attention layers, then the information leaks. For example, consider token 1 output from the transformer layer 2. Sure, you masked input 1. But input 1 was used to produce token 2 and token 3 outputs in transformer layer 1. Token 1 layer 2 prediction takes as an input token 2 and token 3 outputs from layer 1, each of which contains information about token 1 input.
series is really amazing learn a lot 👍
Great Work :-) !! Please also make videos of OpenAI GPT-2
wow, this is a great series.
GREAT! Thank you!... Could you add another payment method to buy BERT collections?
Excellent video. Do you have any insight on how BERT deals with pre-training of words that are represented/encoded as multiple tokens?
very well done series! Thanky you!
Thank you Chris, amazing content 💪
Hi Chris. Thanks for all the great videos for BERT! I am really appreciated and have learned a lot from them.
I have one question. I am applying the following code from your Colab document:
"with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask)"
However, my input_id contains more than10,000,000 sentences. This will create a huge data frame and can take more than 300GB memories. I realized that we only need last_hidden_states[0][:,0,:].numpy(), which is a much smaller portion of the whole data frame. Thus, my question is: whether it is possible to return only last_hidden_states[0][:,0,:].numpy() from the model? Thanks in advance!
Thank you so much Chris!
Hello Chris! Thank you for your great videos! Helped me a lot!
I have a question; to find out what label the model is assigning to a data point, we are using argmax of logits. If the bigger number is in column 0, then the label is 0 and vice versa. How can I find the probability of correctness of these labels? I mean suppose we were using sigmoid function. It gives a number between 0 and 1 and if the number is close to 0.5 this means that the model is not so sure about the label of the corresponding data point. Here, I was thinking about checking how close the numbers in columns 0 and 1 are but didn't really find a good metric to check that.
Thank you Chris for this amazing course on BERT . Could you also make similar ones for RBM,VAE,GANS ?
Thank you very much.
If the number of word in sentence is more than 512 how to use bert . For example in news classification we need to pass whole news to train with bert but it contains more than 512 words how to handle that issue ??? Watched entire series but not clear about that 😃
Thank you very much. I believe that GPT is based on Decoders not Encoders 18:02
Thank you Chris
could you create a tutorial to show how to multilevel classification on text data where classes are unbalanced. Moreover, there are some numerical inputs along with some text. For example, we want to classify when a customer liked the movie, neutrail or didnt like the movie and we have customers' text reviews and their age.
I guess you made a mistake in discussing the architecture of GPT, it has the decoder layers and not the encoder layers of the transformer. Otherwise great series, learnt a lot!
Hi Anunay, thanks for your comment! In this video, I’m actually referring to the original GPT, and I think you’re thinking of their newer GPT-2 model from last year.
Thank Yoy for Your videos. Is possible to do video how to train own transformer from scratch if hou have language corpus as huge txt file?
Thank you!
8:36 here comes the doggie again!
Damn, god bless you....
Haha, thank you. It can be so hard to find good explanations for this stuff!
@@ChrisMcCormickAI I started spending more time with bert,
Your video gave me kick start earlier
Quality content