man!! I was struggling with how to use statistics in EDA. I knew std, mean n all but couldn't use them in the EDA flow. u just cleared my confusion!!!! u won't believe how long I have been struggling with this.. thank god I found this video.. u r a great teacher.. I had the tools but couldn't use them. u just taught me how to use it..
Can't thank you enough for the amazing work you do. It is explained in such simple honest way. Many UA-camrs explain things in incomplete way and then keep referencing their paid courses. You are probably the only one who has complete course and complete explanations and exercises all available for free and you even provide some level of feedback to those who interact with you. This is so rare and precious. I have been learning programming and data science with view to improve my career. As soon as I get a salary from any coding related work, I promise to join your patreons. Can't thank you enough for what you do. All the best for you and your family.
You are such an inspration for people like me who are looking for a transition towards data science day and night im spending my time in this quarantine with datascience and your youtube videos plays a huge role in increasing my caliber. I am a system engineer in cts and now i wish to move my career towards data science. Tirelessly im preparing my portfolio and my resume to forward as per your latest video for the evalation
I have no words how to say Thank You..You always providing Such a knowledge for free all the time...I pray god to keep safe for you and your Family all the Time with Health, Wealth and Prosperity..Thank You once again
Hello, 1st of all, I love your videos. You have a great talent for teaching and are putting it to good use. Just a small nit: the heights file you're using is not really a normal distribution, but a bi-modal one, as it has 2 modes. And the reason is very simple, it's because you're lumping together males & females. If you use separate data sets for each gender, you get much "cleaner" normal distributions. Cheers -CJK
16:38 this is just trimming technique. If we want to do capping that means replacing outliers with either lowest defined value or highest defined value, how to do it?
very good and neat explanation but there is one draw back in this Z -score it deal with mean calculation when there is some extreme outlier entry or human made error it can be affected instead of that if we go for Median calculation for outliers it will be roboust,what ever the value it will only take the mid values alone,thanks for your teaching z score
How do we determine the Z-Score range for Skewed data? Do I use the same range on either side (like -3 to 3) or can I use different values like -1 to 3 (for left skewed data) after looking at the histogram plot? Thanks in advance!
Thanks for the video Sir. I am new to the Machine Learning Well I use percentile,standard deviation and zscore method but problem I get in standard dev nd zscore method is the outliers removed doesn't changes values in our data i.e df, rather it gets stored in new frame df_no_outlier_std_dev. So how to update new values after removing outliers in our data i.e df. please help....
that is because we are storing it in new dataframe not the original one....in case u want the changes to be reflected in original dataframe store it in original and use inplace = True df = df([......code.....,inplace = True) happy learning
thank you very much again... i am really following all your video.. really knowledgeable ... @5:50 of this video, you created the bell curve.. i am aware of one function .kde() which does the same thing. Is it wise to use that? or there is some difference in that to this function you created for drawing bell curve? Thank you very much again. Really appreciate.
Naveen, actually I don't know about kde() function. What does API specification say about that function? Can you try plotting it and see if result is same as mine?
@@codebasics thank you for your reply. I went through your advice and plotted the height using .kde() method and it produced the bell curve same but with a slight difference but plotted the same normal curve. I just had to write this line to draw it: df.Height.plot.kde(); But, thank you again for your precious work. Because it's opening up my brain to think the more agile way of drawing it to understand mathematically.
how to remove outlier from dataframe which has categorical as well as continuous data, as by percentile technique I am getting NaN value in categorical columns
Question, say you have a df of drink consumption and if you don't want to eliminate the outliers but instead replace them with NaN and keep the zero values of the dataframe, what would you do? Thanks
Very good session as always. I came across this situation but couldn't figure out why. Unless we pass this argument "density=True" in matplotlib.pyplot.hist(), it is not possible to see the normal curve and histogram together in the graph. What is the reason for that?
what happens when the std deviation is way bigger than the mean? Currently exploring a dataset where mean price is ~220 and std dev is ~395? Evidently, there's some big outliers that can be seen straightaway (i.e. min price of 4 and max price of 36000). Should I remove those 'clear' outliers manually and then apply the remove outliers function? (i presume that if I don't do this, the function will remove a lot of 'non-outliers'?
Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.
You can also use seaborn to plot the bell curve. It's much easier than matplotlib method. seaborn.histplot(data=df.height, kde=True) kde is the kernal density estimate line
General guideline is 3 or more. If data set is small people use 2 STD dev too but just be careful that you don't remove data point that can add value to data analysis process
I have question. Let's assume a Dataframe has some missing values with the presence of outliers and I don't want to just remove the outliers I want to winsorize the outliers. Is it right to treat the missing values first before winsorization or the other way round?
I have a question kindly answer. Suppose we have 20 column and from all 2 column we are removing outliers, then we are excluding small amount of data from each column, i.e. all together we are loosing huge data. Is this a correct way to handle outliers ?
Thank you for your lectures! I have learnt a lot from the lectures. We can only apply method of Std and Z score to remove the outliers if the data set is normal distribution or we can apply these two methods to all "types" of data set ( normal or not normal distributions)? Thank you again!.
@@codebasics Thank you very much! Does that mean we need to test to see if the data set is normal distribution before we apply "Z score or standard deviation " method to remove the outlier?
Hi sir, your explanation is really amazing, I recently started to learn data science i have some doubts in this video kindly please explain the question is we have mean of 66.36755 and if we add 3.8475 then it will become 69 how it will be one standard deviation.
Thanks so much for explaining in such a easy way. Could you please clarify what would we need to do if other columns contains important values in the same row where outlier exist? Still we can go ahead and remove the entire row?
Everything is good when you are applying Z_score for searching outliers which are either positive or negative outliers. If both positive and negative values are present together then it does not work..!! data = [1, 2, 2, 2, 3, 1, 1,-19, 2, 2, 2, 3, 1, 1, 2,19,25] try with this simple dataset. with IQR method you can detect -19,19,25 all three but with Z_score it is not working. I don't know the reason. If you know Sir then let us know.
Hello! Your lesson is very helpful for me. Can you just say how can I find outliers using multiple parameters? Like I want to find the outliers using all the column of data together that I have. What should I do?? Thank you in advance.
A little suggestion to make it simpler. In Z-Score method I can calculate its absolute value through np.abs and I can only write < 3 in my condition for the new dataframe. In addition, to visualize the curve it is better to use sns.histplot with kde=True
Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced
man!! I was struggling with how to use statistics in EDA. I knew std, mean n all but couldn't use them in the EDA flow. u just cleared my confusion!!!! u won't believe how long I have been struggling with this.. thank god I found this video.. u r a great teacher.. I had the tools but couldn't use them. u just taught me how to use it..
☺️👍
+1
Can't thank you enough for the amazing work you do. It is explained in such simple honest way. Many UA-camrs explain things in incomplete way and then keep referencing their paid courses. You are probably the only one who has complete course and complete explanations and exercises all available for free and you even provide some level of feedback to those who interact with you. This is so rare and precious.
I have been learning programming and data science with view to improve my career. As soon as I get a salary from any coding related work, I promise to join your patreons. Can't thank you enough for what you do. All the best for you and your family.
Sultan, you are a very kind person and thanks for all your appreciation :) This kind of feedback motivates me to continue my work on youtube!
Sir i just wanna say that my respect for you is increasing alot.
Keep making such videos.
Thank you for your efforts.🙏
You are such an inspration for people like me who are looking for a transition towards data science day and night im spending my time in this quarantine with datascience and your youtube videos plays a huge role in increasing my caliber. I am a system engineer in cts and now i wish to move my career towards data science. Tirelessly im preparing my portfolio and my resume to forward as per your latest video for the evalation
you made it?
I am totally inspired by dhaval sir and krish naik sir
Thank you very much for sharing your valuable knowledge with us
I have no words how to say Thank You..You always providing Such a knowledge for free all the time...I pray god to keep safe for you and your Family all the Time with Health, Wealth and Prosperity..Thank You once again
Simply Super B Star.You and Krish are two eyes of Data science
we will not stop the video :) we will watch entire video . each info is very valuable to us (learners)
Wah... extra-ordinary explanation sir. Thank you...
One of the finest tutorials. Great teaching style.
Thanks Hardik, Keep learning.
Thankyou for the support and guidance. Your exercise part in tutorials is just awesome. I really loved your way of teaching
Really amazing lecture sir,i increasing interest on Data science sir
Excellent explanation in every topics, it really helps me alot for my data science career.. thanks
Hello,
1st of all, I love your videos. You have a great talent for teaching and are putting it to good use.
Just a small nit: the heights file you're using is not really a normal distribution, but a bi-modal one, as it has 2 modes. And the reason is very simple, it's because you're lumping together males & females. If you use separate data sets for each gender, you get much "cleaner" normal distributions.
Cheers
-CJK
It is a really beneficial and useful video on this topic, thank you!
woww! what a simple and easy to understand tutorial. Love it. Thank you sir.
Thank you so much sir your way of teaching is so clear and easily understandable
Very well explained sir!!
Worth watching
👍😊
Great Greaaaaat and a fulll too Greaaattttt explanation man. Loved it.
best tutorial
thanks alot sir
you are great
i have learnt alot of concept from your videos
GOD bless you
and keep making more videos
you are simply amazing , yr simple explanation helping a lot , thanks a trillion
Your tutorial is so clear. Well done!
Glad it was helpful!
Thanks very much for your simple and clear code.
Your videos are easy to understand. Thanks so much!
TOP content seriously thanks sir waiting for more videos specially EDA
16:38 this is just trimming technique. If we want to do capping that means replacing outliers with either lowest defined value or highest defined value, how to do it?
Really my sincere thanks for your valuable efforts and im keenly following your guideliness
very good and neat explanation but there is one draw back in this Z -score it deal with mean calculation when there is some extreme outlier entry or human made error it can be affected instead of that if we go for Median calculation for outliers it will be roboust,what ever the value it will only take the mid values alone,thanks for your teaching z score
Great video - well explained!
How do we determine the Z-Score range for Skewed data? Do I use the same range on either side (like -3 to 3) or can I use different values like -1 to 3 (for left skewed data) after looking at the histogram plot?
Thanks in advance!
same question i don't know what is the right range for my data because the (3 , -3) doesn't work for my case
Great video, Thanks man , keep up the good work
Sir if data is non-normally distributed then which technique we prefer for removing outliers?
there are ways to convert data into normal distribution..learn scaling
Tks for the very detailed explanation sir...
Great tutorial, thanks for using readily available sample CSV as well. ☑☑
Do we remove outlier before feature scaling and after feature scaling?
We don't need to remove them all the time. We need to treat them which means we might end up changing the value to some resonable value
Yes we remove them before feature scaling
Thanks for the video Sir.
I am new to the Machine Learning
Well I use percentile,standard deviation and zscore method
but problem I get in standard dev nd zscore method is the outliers removed doesn't changes values in our data i.e df, rather it gets stored in new frame df_no_outlier_std_dev. So how to update new values after removing outliers in our data i.e df.
please help....
that is because we are storing it in new dataframe not the original one....in case u want the changes to be reflected in original dataframe store it in original and use inplace = True
df = df([......code.....,inplace = True)
happy learning
@@viveksingh881 Thanks..It was 6months back story..Now I at intermediate level in machine learning 👍
@@harshal_ajetrao thats great bro....clearing some doubts on random yotube videos..happy learning :)
@@viveksingh881 Thanks for helping man..Keep it up 🤘🤘🤘
Nice one Sir, thank you. One thing sir, I would like you to please make a tutorial on SQL.
Thank you sir
thank you very much again... i am really following all your video.. really knowledgeable ... @5:50 of this video, you created the bell curve.. i am aware of one function .kde() which does the same thing. Is it wise to use that? or there is some difference in that to this function you created for drawing bell curve? Thank you very much again. Really appreciate.
Naveen, actually I don't know about kde() function. What does API specification say about that function? Can you try plotting it and see if result is same as mine?
@@codebasics thank you for your reply. I went through your advice and plotted the height using .kde() method and it produced the bell curve same but with a slight difference but plotted the same normal curve.
I just had to write this line to draw it:
df.Height.plot.kde();
But, thank you again for your precious work. Because it's opening up my brain to think the more agile way of drawing it to understand mathematically.
We can also plot through seaborn using parametre ( kde = True)
Does this work only if the feature is normally distributed? Most of the features in real world data are not normally distributed.
how can you apply this rule when you have about 10 features? Do you do them one by one?
Removing outlier is good option of replacing outliers with other value is good option ?
how to remove outlier from dataframe which has categorical as well as continuous data, as by percentile technique I am getting NaN value in categorical columns
Question, say you have a df of drink consumption and if you don't want to eliminate the outliers but instead replace them with NaN and keep the zero values of the dataframe, what would you do? Thanks
Very good session as always. I came across this situation but couldn't figure out why. Unless we pass this argument "density=True" in matplotlib.pyplot.hist(), it is not possible to see the normal curve and histogram together in the graph. What is the reason for that?
How standard deviations is selected as 3 and zscalar 3 too?
Please someone explain
Long time sir. I wished you took at least dataset with 5-6 features. Nonetheless it's fantastic
what happens when the std deviation is way bigger than the mean? Currently exploring a dataset where mean price is ~220 and std dev is ~395? Evidently, there's some big outliers that can be seen straightaway (i.e. min price of 4 and max price of 36000). Should I remove those 'clear' outliers manually and then apply the remove outliers function? (i presume that if I don't do this, the function will remove a lot of 'non-outliers'?
Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.
You can also use seaborn to plot the bell curve. It's much easier than matplotlib method.
seaborn.histplot(data=df.height, kde=True)
kde is the kernal density estimate line
how to select the number of standard deviation in zscore technique to remove outliers?
General guideline is 3 or more. If data set is small people use 2 STD dev too but just be careful that you don't remove data point that can add value to data analysis process
Nice effort
I have question.
Let's assume a Dataframe has some missing values with the presence of outliers and I don't want to just remove the outliers I want to winsorize the outliers. Is it right to treat the missing values first before winsorization or the other way round?
hey. why cant we use 'StandardScaler' and delete all outliers ?
How can we apply this to multiple columns?
Is there any short way or we have to do it manually for every column?
How to decide 3 as a threshold value to calculate zscore values? you have considered ex: zscore >3
Hi Sir,
How do I decide Z score values, does it depend on my data or is it always -3 to +3?
Usually is is between 3 and -3 but yes it depends on data. Sometimes people use more than 3 based on data distribution
I have a question kindly answer. Suppose we have 20 column and from all 2 column we are removing outliers, then we are excluding small amount of data from each column, i.e. all together we are loosing huge data. Is this a correct way to handle outliers ?
why we choose height column ??why dont we chose weight column???
Concise Explanation !
Thank you for your lectures! I have learnt a lot from the lectures. We can only apply method of Std and Z score to remove the outliers if the data set is normal distribution or we can apply these two methods to all "types" of data set ( normal or not normal distributions)? Thank you again!.
You would do that if you have normal distribution
@@codebasics Thank you very much! Does that mean we need to test to see if the data set is normal distribution before we apply "Z score or standard deviation " method to remove the outlier?
Sir can l become data analyst after
12th
Hi sir, your explanation is really amazing, I recently started to learn data science i have some doubts in this video kindly
please explain the question is we have mean of 66.36755 and if we add 3.8475 then it will become 69 how it will be
one standard deviation.
one standard deviation = 3.8475
It really helped me. Thank You
Glad it helped!
Here is a great explanation:
www.kaggle.com/c0derr/outlier-detection?scriptVersionId=39511980
Thanks so much for explaining in such a easy way. Could you please clarify what would we need to do if other columns contains important values in the same row where outlier exist? Still we can go ahead and remove the entire row?
Everything is good when you are applying Z_score for searching outliers which are either positive or negative outliers. If both positive and negative values are present together then it does not work..!!
data = [1, 2, 2, 2, 3, 1, 1,-19, 2, 2, 2, 3, 1, 1, 2,19,25]
try with this simple dataset.
with IQR method you can detect -19,19,25 all three
but with Z_score it is not working.
I don't know the reason. If you know Sir then let us know.
can you provide mock interview?
hello sir, can we learn personally from you? and how can we contact you
Hello! Your lesson is very helpful for me. Can you just say how can I find outliers using multiple parameters? Like I want to find the outliers using all the column of data together that I have. What should I do??
Thank you in advance.
Also make videos regarding Seaborn please
Great Video. Thx!!
Clear and succinct
Very nice
awesome.
Fantastic, many thanks.
Great sir
Fantastic, thank you
Thank you!
Sir, thank you
thanks sir
Thanks!
I noticed he didn't use z-score or cooks in the real estate project
🙌🙌🙌
BRUH.... why would you remove one column .... this just ruins the propose
Zoom in your screen !!!
You know python, but you dont know much about statistics in identifying the outliers in normal distributed data.
Bruh,Wdym?
content is good but ur delivery is boring
Sir Z- score will work for numeric data ? In case of text data what we can do ?
A little suggestion to make it simpler. In Z-Score method I can calculate its absolute value through np.abs and I can only write < 3 in my condition for the new dataframe.
In addition, to visualize the curve it is better to use sns.histplot with kde=True