Professor is very sneaky and uses the TV series approach. Every time I am saying "well, one more video and that's it", but in the end we have "John kills Bob" and a big "to be continued sign". So here I am, sitting for 2 hours straight, watching about another important technique that happens to exist, mentioned at the end of each video :D Thanks!
Thanks for uploading the concise, core idea centric along with implementational explanation of core elements needed to build an efficient neural network.
For input normalization x divided by variance. Instead of variance should it be standard deviation? I mean not sigma^2 but sqrt(sigma^2) at 0:58 time of video.
it is so confusing how he explains this detail because he surely means that (i) is the ith neuron of n in layer [l]. Clearly (and is a rare case ) he made some sort of mistake, and in fact, in a subsequent video, he talks about (i) as an example in a batch and not as a neuron.
If you imagine the variance which is small, you will produce inputs near zero, where sigmoid behaves as a linear function. This would make the whole layer a useless linear transformation.
I thought the purpose of batch normalization is a regularization technique for neural networks with the objective of reducing overfitting. Perhaps it both increases performance also and addresses overfitting?
if we set gamma and beta to get a different mean and variance each time for a different layer, what is the purpose of batch normalization? or is the effect of batch normalization restricted to each layer individually?
It is the value of the neuron before applying the activation function. So a = h(z) for some activation function h. z itself is the dot product of the weights w with the inputs to the neuron.
I found it a bit unclear what the axis of normalisation was (I believe each individual activation is normalised using the mean and standard deviation of that activation over the batch?), and how many learnable parameters there are -- is there a gamme and beta for each activation? It's not clear whether the zs, gammas, betas, etc. are scalars of single activations or vectors of the whole layer.
As I understood, Z is the vector of whole output of one layer. You find the variance and the mean over this vector, too. Probably, you have found your answer while ago nevertheless I wanted to answer.
They are the vectors of the whole layer. For each z, four parameters are attached to it. Two learnable parameters, which are gamma and beta, and two unlearnable parameters, which are the mean and deviation. All the calculations are element-wise but the representations are all vectors. Hopefully, my answer dissipates your doubts. If it is not clear for you, leave your question here.
@@beizhou2488 Could you please explain why don't we want to make the mean and standard deviation come from the same distribution all the time. I mean why did we add beta and gamma to the normalized equation.
📺💬 We can use Z cap I instead of Z I 🧸💬 Does it mean that we do not find all the value from all nodes because probability start from the same the update value is in linear with beta and gamma parameter⁉
Hidden unit value refers to the output vector of multiplication of previous layer activation and the weights before applying non linearity to it. z = WX+b here z = hidden unit values i.e preactivation values of some hidden layer W = Weight matrix of that layer (the weight values are what are learnt by the network) X= Input to that layer. -> These are the values that are normalized. b= bias.
"The effect of gamma and beta is to set the mean to whatever you want it to be", you forgot to mention variance. Should've been that the effect of gamma and beta is to set the mean and variance to whatever you want it to be.
Professor is very sneaky and uses the TV series approach. Every time I am saying "well, one more video and that's it", but in the end we have "John kills Bob" and a big "to be continued sign". So here I am, sitting for 2 hours straight, watching about another important technique that happens to exist, mentioned at the end of each video :D Thanks!
ha ha. i like your comment......how many ways he has!!
Thanks for uploading the concise, core idea centric along with implementational explanation of core elements needed to build an efficient neural network.
For input normalization x divided by variance. Instead of variance should it be standard deviation? I mean not sigma^2 but sqrt(sigma^2) at 0:58 time of video.
What is lowercase m here? Is it number of hidden units in layer l or the number of samples in mini batch?
It is the batch size
Normally, It is the number of training data = batch size here.
Thank you
it's the samples numbers
it is so confusing how he explains this detail because he surely means that (i) is the ith neuron of n in layer [l]. Clearly (and is a rare case ) he made some sort of mistake, and in fact, in a subsequent video, he talks about (i) as an example in a batch and not as a neuron.
7:28 , it is said we might want larger variance for z, but why? Wouldn't that lead to slow learning problem / vanishing gradient in case of sigmoid?
If you imagine the variance which is small, you will produce inputs near zero, where sigmoid behaves as a linear function. This would make the whole layer a useless linear transformation.
I thought the purpose of batch normalization is a regularization technique for neural networks with the objective of reducing overfitting. Perhaps it both increases performance also and addresses overfitting?
Beautiful explanation. This makes so much more sense now.
What if the layer is not fully connected? How is the batch normalization done?
if we set gamma and beta to get a different mean and variance each time for a different layer, what is the purpose of batch normalization? or is the effect of batch normalization restricted to each layer individually?
They are trainable so you do not set them
So gamma and beta are learning the true variance and mean of the dataset rite ?
great explanation.Need to make notes
I'm making notes
What is z? 2:30
It is the value of the neuron before applying the activation function. So a = h(z) for some activation function h. z itself is the dot product of the weights w with the inputs to the neuron.
I found it a bit unclear what the axis of normalisation was (I believe each individual activation is normalised using the mean and standard deviation of that activation over the batch?), and how many learnable parameters there are -- is there a gamme and beta for each activation? It's not clear whether the zs, gammas, betas, etc. are scalars of single activations or vectors of the whole layer.
As I understood, Z is the vector of whole output of one layer. You find the variance and the mean over this vector, too. Probably, you have found your answer while ago nevertheless I wanted to answer.
They are the vectors of the whole layer. For each z, four parameters are attached to it. Two learnable parameters, which are gamma and beta, and two unlearnable parameters, which are the mean and deviation. All the calculations are element-wise but the representations are all vectors. Hopefully, my answer dissipates your doubts. If it is not clear for you, leave your question here.
@@beizhou2488 Could you please explain why don't we want to make the mean and standard deviation come from the same distribution all the time. I mean why did we add beta and gamma to the normalized equation.
Amazing thanks
Should it not be
Sigma^2=(1/m) summation (zi^2-u^2)
Since he has already subtracted the value of mean from X, i.e, X = X - mu, the new mean becomes 0, so, variance = X^2/N
📺💬 We can use Z cap I instead of Z I
🧸💬 Does it mean that we do not find all the value from all nodes because probability start from the same the update value is in linear with beta and gamma parameter⁉
How to initialize gamma and beta? (gamma=1, beta=0)?
it should be x=x/sigma not sigma^2 ??
Thanks for the video. I have a dumb question about what do you mean by the hidden unit value? or what does z refer to? It confused me a lot.
Hidden unit value refers to the output vector of multiplication of previous layer activation and the weights before applying non linearity to it.
z = WX+b
here z = hidden unit values i.e preactivation values of some hidden layer
W = Weight matrix of that layer (the weight values are what are learnt by the network)
X= Input to that layer. -> These are the values that are normalized.
b= bias.
z VALUE comes form normal distribution statistical tables....hidden unit are variables in the hidden layers
"The effect of gamma and beta is to set the mean to whatever you want it to be", you forgot to mention variance. Should've been that the effect of gamma and beta is to set the mean and variance to whatever you want it to be.
Excelent presentation, terrible letter :v