Nice video but the part which I don't understand is why would you say teacher network is kinda ahead of student network? I mean we're using studentt's weight to update teacher's weight. Please explain this to me
yes but teacher is pre-trained and so I feel so. :) I am asking the audience what sort of contents they would like in the coming weeks. Would you like more videos on papers or hands-on coding style videos. Your feedback will be valuable. Thx.
@@AIBites sorry but I'd have to disagree here. This paper doesn't typically employ pretrained teacher, it's mentioned in this paper as well. The teacher is learned on the fly.
@@sushilkhadka8069 I had the same problem too went straight to chatgpt, here is what I understood from it, the concept doesn't imply that the teacher is learning more or faster but instead it means that the teacher's parameters are more stable and less noisy, because the teacher integrates changes gradually and over a long period of time (thanks to EMA) it accumulates and reflects a more generalized, smoother version of what the student is learning. The stability and the smoothing effect allows the teacher to provide consistent and high-quality targets to the student, reducing the risk of overfiting to noisy aspect of the data.
Thanks for the explanation. Keep up the excellent work!
Thank you 🙂
Explained well... Thanks for the video
Nice video but the part which I don't understand is why would you say teacher network is kinda ahead of student network? I mean we're using studentt's weight to update teacher's weight. Please explain this to me
yes but teacher is pre-trained and so I feel so. :)
I am asking the audience what sort of contents they would like in the coming weeks. Would you like more videos on papers or hands-on coding style videos. Your feedback will be valuable. Thx.
@@AIBites sorry but I'd have to disagree here. This paper doesn't typically employ pretrained teacher, it's mentioned in this paper as well. The teacher is learned on the fly.
@@sushilkhadka8069 I had the same problem too went straight to chatgpt, here is what I understood from it, the concept doesn't imply that the teacher is learning more or faster but instead it means that the teacher's parameters are more stable and less noisy, because the teacher integrates changes gradually and over a long period of time (thanks to EMA) it accumulates and reflects a more generalized, smoother version of what the student is learning. The stability and the smoothing effect allows the teacher to provide consistent and high-quality targets to the student, reducing the risk of overfiting to noisy aspect of the data.
@@AIBites more papers
You are being listened. Next comes a paper (next week) 😊