Це відео не доступне.

Перепрошуємо.

Stochastic Gradient Descent vs Batch Gradient Descent vs Mini Batch Gradient Descent |DL Tutorial 14

codebasics

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 сер 2020
Stochastic gradient descent, batch gradient descent and mini batch gradient descent are three flavors of a gradient descent algorithm. In this video I will go over differences among these 3 and then implement them in python from scratch using housing price dataset. At the end of the video we have an exercise for you to solve.
🔖 Hashtags 🔖
#stochasticgradientdescentpython #stochasticgradientdescent #batchgradientdescent #minibatchgradientdescent #gradientdescent
Do you want to learn technology from me? Check codebasics.io/... for my affordable video courses.
Next Video: • Chain Rule | Deep Lear...
Previous video: • Implement Neural Netwo...
Code of this tutorial: github.com/cod...
Exercise: Go at the end of above link to find description for exercise
Deep learning playlist: • Deep Learning With Ten...
Machine learning playlist : www.youtube.co...
Prerequisites for this series:
1: Python tutorials (first 16 videos): www.youtube.co...
2: Pandas tutorials(first 8 videos): • Pandas Tutorial (Data ...
3: Machine learning playlist (first 16 videos): www.youtube.co...
#️⃣ Social Media #️⃣
🔗 Discord: / discord
📸 Dhaval's Personal Instagram: / dhavalsays
📸 Instagram: / codebasicshub
🔊 Facebook: / codebasicshub
📝 Linkedin (Personal): / dhavalsays
📝 Linkedin (Codebasics): / codebasics
📱 Twitter: / codebasicshub
🔗 Patreon: www.patreon.co...

КОМЕНТАРІ • 251

@codebasics 2 роки тому ⁺⁶
Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced
@user-zy8sf7tv2f 3 роки тому ⁺⁸
I've followed your words to implement the minibatch gradient descent algorithm myself and learned a lot after wathing your implementation about that,
thank you very much.
@ryansafourr3866 2 роки тому ⁺³⁰
The world is better with you in it!
@codebasics 2 роки тому ⁺⁶
Glad you liked it Ryan and thanks for the donation
@prashantbhardwaj7041 2 роки тому ⁺¹⁴
At about 14:43, a clarification may help someone as to why the Transpose is required. For Matrix product, the thumb rule is that Columns of the 1st matrix must be the same as the rows of the 2nd matrix. since our "w" is 2 columns, the "X_scaled" has to be transposed from a 22X2 matrix into a 2X22 matrix. Yes, the resulting matrix will be a 22 column, 2 rows matrix.
@mikeguitar-michelerossi8195 Рік тому ⁺¹
Why don't we make np.dot(scaled_X, w)? Should give the same result, without the transpose operation
@ankitjhajhria7443 Рік тому
w.shape is (2*1) means 1 column and x_scaled.T has (2*20) means 2 rows ? your rule does not follow why ?
@zhaoharry4113 4 роки тому ⁺²²
love how you always put memes in your videos HAHA, great work!
@zhaoharry4113 4 роки тому
and thank you for the videos Sir :3
@NguyenNhan-yg4cb 3 роки тому ⁺³⁷
Lol i do not want go to sleep and i dont have enough money to watch netflix, so i just take care of my career sir
@sanjivkumar8187 2 роки тому ⁺³
Hello Sir, i am following your tutorials by sitting in Germany. You made thing's so simple. Better then Udemy,coursera,.. etc courses. I highly recommend.
Please take care of your health as well and hopefully you will be fatter in coming Video 🙂
@yen__0515 2 роки тому ⁺⁴
Sincerely appreciate for your enrich content, it helps me a lot!
@codebasics 2 роки тому
Thanks for the generous donation 🙏👍
@nahidakhter8646 3 роки тому ⁺⁸
Video was fun to watch and the jokes helped keep me focused. Thanks for this :)
@codebasics 3 роки тому
Glad you enjoyed it!
@vincemegasonic 2 роки тому ⁺⁵
Good day to you sir! I'm currently an undergraduate in Computer Science, currently working on a paper that is using this neural network. This tutorial helped me understand the neural network pretty quick and helped me adjust our software to function how we intend it to. Please keep up the good work and hope that other students like me can come across and use this in their upcoming studies!!
Godspeed on your future content!!
@codebasics 2 роки тому ⁺²
Best of luck! and I am happy this video helped
@girishtripathy275 3 роки тому ⁺²
After So many videos I watched to learn ML (Self learn, I am complete noob in ML currently), this playlist might be the best one I got on youtube! Kudos man. Must respect
@spiralni 3 роки тому ⁺¹
When you understand the topic you can explain it easily, and you are a sir, are a master. thanks.
@siddharthsingh2369 2 роки тому ⁺¹⁰
If someone is facing trouble in the value of w_grad, b_grad, here is my explanation, please correct me if somewhere i am wrong -
I think the error is calculated using the formula (y_predicted - y_true)**2, if u notice in the starting. Hence total error in that case will be mean of all the errors found. However when u do the derivate square term i.e. error **2 will also give 2 in the front ( By derivation of x**2) and along the weight it is showing 2 in front. The -ve value which u are seeing is just reversal of (y_true - y_predicted) in this video. As in previous video it was (y_predicted - y_true).
Also if somehow u are getting confused in the transpose implementation of the matrix as the one which is shown here is little different then the one video 13 , then u can use below code for w_grad, b_grad. They will give u the exact value.
# Similarity from video 13 while finding w1 , w2, bias -
w_grad = ( 2 / total_samples )*np.dot( np.transpose( x ), ( y_predicted - y_true )) .
b_grad = 2 * np.mean( y_predicted - y_true ).
@vishaljaiswar6441 2 роки тому ⁺²
Thank you so much, sir! I think you taught way better than my university lecturer and helped me understand much better!
@codebasics 2 роки тому
👍👍👍☺️🎉
@watch_tolearn 4 місяці тому
You are the best teacher I have come across. you bring understanding in a humble way. Stay blessed.
@kasyapdharanikota8570 2 роки тому ⁺²
when you explain I find deep learning very easy and interesting. Thank you sir!
@kaiyunpan358 3 роки тому ⁺³
Thank you for your patient and easily understood explanation which solved my question !!!
@spg2731476 2 роки тому ⁺⁴
At 3:21, why do you need 20 million derivatives.
It would just be 3 derivatives - 2 for features and 1 for bias. Isn't it? If so, please update it so that audience are not confused.
@JunOfficial16 2 роки тому ⁺¹
I have the same question. for the first epoch, 3 derivatives, the second would be 3 more, and so on. so the number of derivatives depends on how many epochs we go through, right?
@JunOfficial16 2 роки тому ⁺¹
And with SGD, at every sample, we calculate 3 derivatives until the error is minimized. If the err is not minimized to 0, it would go through 10m samples, and that would be 10m x 3 = 30m derivatives.
@shamikgupta2018 2 роки тому ⁺²
17:26 --> Sir it looks like the derivative formulae for w1 and bias are different than what you had shown in previous video.
@bestineouya5716 Рік тому
I spent days trying to learn gradient descent and its types. Happy you cleared the mess. Thanks again teacher
@priyajain6791 2 роки тому ⁺²
@codebasics Loving your videos so far. The way you present the examples and explanations, things really seems to be easy to understand. Thanks a lot for thoughtful content! Just one request, can you please share the PPT you're using as well?
@tarunjnv1995 Рік тому ⁺¹
@codebasics Yes, your content is really outstanding. Also for quick revision of all these concepts we need ppt. Could you please provide it?
@shouyudu936 3 роки тому ⁺⁴
I have a question, why do we also need to divide by n in stochastic gradient descent, isn't that we are going through each different point?
@r0cketRacoon Місяць тому
same question, do you have an answer for that?
@spoonstraw7522 7 місяців тому
Thank you so much and that cat trying to learn, mini batch gradient, descent is so relatable. In fact, that’s the reason I’m here. My cat is a nerd. We were partying, and then my cat the party pooper he is asked what is mini batch gradient descent and he kind ruined the party. He always does this last time he was annoying everyone by trying to explain what bullion algebra is What a nerd
@MrBemnet1 3 роки тому ⁺⁵
question , why do you have to do 20 million derivatives for 10 million samples? The number of derivatives you have to do should be equal to the number of W's and B's.
@danielahinojosasada3158 2 роки тому
Remember that there are multiple features. One sample --> multiple features. This means calculating multiple derivatives per sample.
@rahulnarayanan5152 2 роки тому
@golden water Same question
@uttamagrahari 2 роки тому ⁺¹
Here in these 10 million samples there are 10 million weights and 10 million biases. So we have to do derivatives for every weight and bias, so we have to do 20 million derivatives while updating for the new weight and bias.
@farrugiamarc0 5 місяців тому
Thank you for sharing your knowledge on the subject with very good and detailed explanation. I have a question with reference to the slide shown at time 3:29. When configured to do batch gradient descent, and there are 2 features with 1 million samples, why is the total number of derivatives equal to 2 million? Isn't it 2 derivatives per epoch? After going through all the 1 million samples you calculate the MSE and then do back propagation to optimise W1 and W2. Am I missing something?
@user-qi8xj8jh9m Рік тому
This is called teaching, love your teaching sir!!
@malharlumbhani8700 3 роки тому
Ekdum jordaar bhanavo sir tame, Bov ucchu :)))))
@fahadakhtar6366 13 днів тому
very engaging tutorials!
@rofiqulalamshehab8528 Рік тому ⁺¹
Your explanation is excellent. It would be great if you could make a computer vision playlist.Did you make any plans for it?
@manishdubey4772 3 місяці тому ⁺¹
Sir, I think you miss that sum part in w_grad:
it should be :
w_grad = -(2/total_sample)*np.sum((x.T.dot(y_true-y_pred)))🙂🙂
@r0cketRacoon Місяць тому
you better learn about matrix multiplicaton
@harshalbhoir8986 Рік тому
Thank you so much sir
Now really dont have porblem with Gradient Descent
and the exercise at last helps alot!!
@waseemabbas5078 2 роки тому
Hi! Sir i am from pakistan i am following your tutorials, thank you very much for such an amazing guiding material.
@tiyasachakraborty4786 2 роки тому
You are my best teacher. I am becoming a big fan of such a great teacher.
@ashimanazar1193 3 роки тому
The explanation was very clear. What if the input data X has outliers then if one takes a small batch size then one can't just compare the last two values for theta or cost function. What shall be the convergence condition then? Please explain
@fariya6119 3 роки тому
I think you have just made everything easy and clear. Thanks a lot . You have just allayed my fears to learn Deep learning.
@codebasics 3 роки тому ⁺¹
Glad to hear that
@optimizedintroverts668 2 місяці тому
hats of to you for making this topic easy to understand
@satinathdebnath5333 2 роки тому ⁺²
Thanks for uploading such informative and helpful videos. I am really enjoying it and looking forward to use it in my MS works. Please let me know where I can find the input data like the .CSV file. I could not find it in the link provided in the description.
@sunilkumar-pp6eq 3 роки тому ⁺¹
Your Videos are really helpful, you are so good in coding, it takes time for me to understand. But Thank you so much for making it simple!
@codebasics 3 роки тому ⁺²
I am happy this was helpful to you.
@AJAYKUMAR-gl1vx 2 роки тому ⁺¹
and second doubt is in SGD implementation. When we are taking only one random sample then why you are dividing the error by total number of sample?
@prasanth123cet 2 роки тому
Even I have this doubt. I rerun without the total number of terms in denominator. Values were different from what I got from batch gradient descent
@diligentguy4679 6 місяців тому ⁺¹
Great Content! I have a question as to why you did not use an activation function here? Is it something we can do?
@jeethendra374 3 місяці тому
dude that's the same question got incase if you know the answer please share it with me
@diligentguy4679 3 місяці тому ⁺¹
@@jeethendra374 nope
@kumudr 3 роки тому
thanks, i understood finally gradient descent, sgd & mini batch
@williammartin4416 5 місяців тому
Excellent lecture
@abhisheknagar9000 4 роки тому ⁺¹
Very nice explanation. Could you please let me the parameter value while training (for SCD, mini batch and batch) using Keras.
@RV-qf1iz Рік тому
Like the way of your teaching less theory more coding
@ISandrucho 3 роки тому ⁺¹
Thanks for the video. I noticed one thing. In SGD you didn't change the partial derivative formula of cost function (but cost function had changed).
@r0cketRacoon Місяць тому
the same question, I wonder why do we need the derivatives divided by total samples when we only pick a stochastic sample? Have u figured out the answer?
@otsogileonalepelo9610 3 роки тому ⁺³
Great content and tutorials, thank you so much.🙏 But I have a few questions:
When do you implement early stopping to prevent overfitting?
Aren't you supposed to stop training the moment the loss function value increases compared to the last iteration? For instance the zig-zag pattern for the loss displayed by SGD, is that just fine?
@piyalikarmakar5979 3 роки тому
Sir, your vedios always answer my all queries around the topics...Thank you so much sir..
@fahadreda3060 4 роки тому ⁺²
Thanks for the video , wish you all the best
@codebasics 4 роки тому
I am glad you liked it
@010-haripriyareddy5 4 місяці тому
can we say With large training datasets, SGD converges faster compared to Batch Gradient Descent
@raom2127 2 роки тому
Great videos and in simplicity in detailed explanation with coding is super.............
@mohdsyukur1699 4 місяці тому
You are the best my boss
@yogeshbharadwaj6200 3 роки тому
Tks a lot for the detailed explanation...learned a lot...
@vinny723 Рік тому
Great series of tutorials. I would like to know for this tutorial (#14), why the implementations of Stochastic Gradient Descent or Batch Gradient Descent did not include an activation function? Thanks.
@r0cketRacoon Місяць тому
no need to do that because this is a regression task, just classification problems that use sigmoid or softmax
@adityabhatt4173 Рік тому
Good Bro, The way u used memes is expectational It makes learning fun.
@shuaibalghazali3405 9 місяців тому
Thanks for making this tutorial I think am getting somewhere
@vin-deep Рік тому
Super explanation skill that you have!!!
@suenosn562 2 роки тому
you are great teacher thank you so much sir
@Vikings_004 Рік тому
Sir Big Fan ….best and simple explanation
@rociodelarosa1549 2 роки тому
Excellent explanation, keep up the good work 👏
@prakashkrishnan7132 6 місяців тому
Instead of implementing in python code . Why dont we use keras model for the same dataset>
@aryac845 Рік тому
I was following ur playlist and it's very helpful. But from where I can get the data u used ? So that I can work on it
@udaysadhukhan1 4 роки тому
Just one clarification !!! per Epoc there will be [number of independent feature + 1 {bias}] derivatives , number of derivatives = Epoc*(no of independent feature +1) .. As per my understanding it does not depend on number of samples... number of calculation (squire sum / sum / dot products) depends on sample size...
@codebasics 4 роки тому
Actually it depends on features as well as samples. When you are using numpy for vectorised operations internally it is still doing floating point multiplication for every single sample in a vector. Hence computing requirements are high if you use batch GD for a huge data set.
@udaysadhukhan1 4 роки тому
@@codebasics Yes. Agreed because in partial derivative there is bunch of sum and dot operation and that depends on the dataset size. Thanks.
@palashsrivastav6748 6 місяців тому
sir why did you use sigmoid_numpy() to calculate y_pred in last code and not in this code for Batch Gradient descent
@very_nice_777 Рік тому
Thanks a lot sir. Love from Bangladesh!
@GojoKamando 4 місяці тому
why we used mean square error but not log loss function?
@chalmerilexus2072 Рік тому
Lucid explanation. Thank you
@ashishmalhotra2230 9 місяців тому
Hi, why did you do "y_predicted = np.dot(w, X.T) + b". Why is X transpose required here?
@ramimoustafa Рік тому
Thank you man for this perfect explanation
@abhaydadhwal1521 2 роки тому
Sir i have a question ... in stochastic u wrote -(2/total_samples) in formula of w_grad and b_grad. But in mini-batch u have written -(2/ len(Xj). why the difference?
@shaikansarbasha4169 3 роки тому ⁺¹
sir in the equation (-2/len(total_samples))*something why did you take all samples sir, we should take 1 intstead of len(tatal_samples) i think so,
because in batch_wise_gd we consider all samples so we take all samples, in mini_batch_gd we consider 5 (in our example) so we take 5, like wise in SGD we should take 1 sir, this is my doubt sir
@r0cketRacoon Місяць тому
yeah, same question, have u figured the answer? plz enlighten me
@surabhisummi Рік тому
One request, I am not able to find the csv file which you have used here. Please attach that as well, it would be a great help. Again thanks for teaching!
@sahinmuratogur7556 2 роки тому
I have a question why do you calculate cost for each epoch? if you would like to plot the costs for each 5 or 10 steps, is it logical to calculate the costs only at for every 10 th or 5 th step?
@tinanajafpour7214 Рік тому
sorry, could you please explain why you have put[0][0] in this line? return sy.inverse_transform([[scaled_price]])[0][0]
I will really be appreciated.🙏🙏
@ahsanshafiqchaudhry Рік тому
I believe there is a mistake. Please correct me if I am wrong.
In stochastic gradient descent code, when you are calculating the gradients (w_grad and b_grad), I think you should not use "total_samples" in the formula because the sample contains just one example.
@r0cketRacoon Місяць тому
yeah, same question, have u figured out the answer?
@nicholasyuen9206 10 місяців тому
Can someone please explain why @3:30 that there will be 20 million derivatives computed in the first epoch? Should'nt there be just 2 derivatives for the first epoch since there would only be solving for 2 partial derivatives (respective to the 2 features) of the MSE computed from all the 10million samples? Thanks.
@swaralipibose9731 3 роки тому
You are truly talented in teaching
@codebasics 3 роки тому
👍☺️
@danielniels22 2 роки тому
love the party cat!
@Breaking_Bold 8 місяців тому
Great explanation !!!
@9427gyan 3 роки тому
I think there is a need for improvement while explaining scaling at a timeline near 9:45. As per your explanation, the scaling is making it look like 2D but as I think since the data is derived from a column so it's natural occurrence is of column so it appears to be 2D
@nastaran1010 6 місяців тому
I never recieved the answer of my questions from you. This is chalenging for me to know the equation for updating weights. my mean is for example here ( w1 = w1- rate * x), the amount of x. this is the derivate of loss corresponding to weights, how you achieve those?,
@ritik444 2 роки тому
You are an actual legend
@danianiazi8229 3 роки тому
Why are we taking transpose and dot product? why not simply w * df['area'] + w*df['bedrooms']+bias
@skhurana333 2 місяці тому
at 1.55, how did he arrive at -50 and -8 ?
@dutta.alankar 4 роки тому
Really well explained in simple terms!
@codebasics 4 роки тому
😊👍
@dimmak8206 3 роки тому
you have a talent at teaching cheers!
@codebasics 3 роки тому ⁺¹
Glad you enjoyed it
@sindhuswrp 2 роки тому
For SGD isn’t it supposed to be ‘m’ iterations per epoch? In the video it’s only 1 iteration per epoch.
@nasgaroth1 3 роки тому
Awesome teaching skills, nice work
@codebasics 3 роки тому
Glad you think so!
@alidakhil3554 Рік тому
Very nice lesson
@nsikanessien1116 Рік тому
Please Sir I need clarity on why -2 was used in this formula ( w_grad = -(2/total_samples)*(X.T.dot(y_true-y_predicted)))
@shashisaini7919 Рік тому
thankyou sir, good tutorial.❣💯
@harshalbhoir8986 Рік тому
Thank you so much sir
@tinanajafpour7214 Рік тому
thank you for the video
@vikrantgsai7327 Рік тому
For mini batch gradient descent, can the samples for the mini batch picked in any order from the main batch?
@mohamedyassinehaouam8956 2 роки тому
very interesting
@JH-kj3xk 2 роки тому
many thanks!
@work-dw2hl 4 роки тому ⁺²
w_grad = -(2/total_samples)*(X.T.dot(y_true-y_predicted))
b_grad = -(2/total_samples)*np.sum(y_true-y_predicted)
sir why use minus before this line (2/total_samples)*(X.T.dot(y_true-y_predicted))
in previous video
w1d=(1/n)*np.dot(np.transpose(Age),(y_pred-y_train))
w2d=(1/n)*np.dot(np.transpose(Affordability),(y_pred-y_train))
here we dont use minus what is the diffreece why use minus here sir
@MuskanMadaan 3 роки тому
did you find the reason?
@work-dw2hl 3 роки тому
@@MuskanMadaan no
@anishmanandhar1203 3 роки тому
I was also thinking the same
@hey_this_is_tanushree 3 роки тому ⁺¹
@work123 I can understand minus is because in original formula here in this video -> (ua-cam.com/video/pXGBHV3y8rs/v-deo.html) , formula had (ypredicted-ytrue) but here he did opposite that is ytrue-ypredicted. But not sure why 2 is considered.
@hey_this_is_tanushree 3 роки тому ⁺¹
@@MuskanMadaan partially
, posted it
@Mathmagician73 4 роки тому ⁺¹
Waiting 😍........Also make video on optimizers pls
@codebasics 4 роки тому
👍😊
@venisc2681 2 роки тому ⁺¹
kindly provide the dataset for practice
@satyamjyotisankar2041 3 роки тому ⁺¹
sir.i have a qustion why you devide -2/n..in state of 1/n?
@gokhanersoz5239 2 роки тому
Do you understand this part ?
@AJAYKUMAR-gl1vx 2 роки тому
Hello Sir,
I have one doubt.
You are calculating y_predicted through the simple equation y=mx+c. But you told in one video we get y_predicted after passing this y to some activation function. So why you are not taking into consideration of that activation function.
@sostedia 2 роки тому
Getting y_predicted from passing to a certain activation function is with the classification problem.

Наступне

Автоматичне відтворення

Chain Rule | Deep Learning Tutorial 15 (Tensorflow2.0, Keras & Python)