Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

Поділитися
Вставка
  • Опубліковано 19 гру 2024

КОМЕНТАРІ • 119

  • @codebasics
    @codebasics  2 роки тому +2

    Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

  • @dipto624
    @dipto624 3 роки тому +7

    man!! I was struggling with how to use statistics in EDA. I knew std, mean n all but couldn't use them in the EDA flow. u just cleared my confusion!!!! u won't believe how long I have been struggling with this.. thank god I found this video.. u r a great teacher.. I had the tools but couldn't use them. u just taught me how to use it..

  • @sultanhusnoo8552
    @sultanhusnoo8552 2 роки тому +3

    Can't thank you enough for the amazing work you do. It is explained in such simple honest way. Many UA-camrs explain things in incomplete way and then keep referencing their paid courses. You are probably the only one who has complete course and complete explanations and exercises all available for free and you even provide some level of feedback to those who interact with you. This is so rare and precious.
    I have been learning programming and data science with view to improve my career. As soon as I get a salary from any coding related work, I promise to join your patreons. Can't thank you enough for what you do. All the best for you and your family.

    • @codebasics
      @codebasics  2 роки тому +2

      Sultan, you are a very kind person and thanks for all your appreciation :) This kind of feedback motivates me to continue my work on youtube!

  • @sumitkumarsah8782
    @sumitkumarsah8782 4 роки тому +14

    Sir i just wanna say that my respect for you is increasing alot.
    Keep making such videos.
    Thank you for your efforts.🙏

  • @subuqerpsmja
    @subuqerpsmja 4 роки тому +1

    You are such an inspration for people like me who are looking for a transition towards data science day and night im spending my time in this quarantine with datascience and your youtube videos plays a huge role in increasing my caliber. I am a system engineer in cts and now i wish to move my career towards data science. Tirelessly im preparing my portfolio and my resume to forward as per your latest video for the evalation

  • @krishnanarwade1467
    @krishnanarwade1467 4 роки тому +1

    I am totally inspired by dhaval sir and krish naik sir
    Thank you very much for sharing your valuable knowledge with us

  • @kirandeepmarala5541
    @kirandeepmarala5541 4 роки тому +1

    I have no words how to say Thank You..You always providing Such a knowledge for free all the time...I pray god to keep safe for you and your Family all the Time with Health, Wealth and Prosperity..Thank You once again

  • @shaiksuleman3191
    @shaiksuleman3191 4 роки тому

    Simply Super B Star.You and Krish are two eyes of Data science

  • @jaganinfo
    @jaganinfo 4 роки тому

    we will not stop the video :) we will watch entire video . each info is very valuable to us (learners)

  • @kakmca
    @kakmca 2 роки тому

    Wah... extra-ordinary explanation sir. Thank you...

  • @hardikatri7803
    @hardikatri7803 4 роки тому +1

    One of the finest tutorials. Great teaching style.

    • @codebasics
      @codebasics  4 роки тому

      Thanks Hardik, Keep learning.

    • @hardikatri7803
      @hardikatri7803 4 роки тому

      Thankyou for the support and guidance. Your exercise part in tutorials is just awesome. I really loved your way of teaching

  • @anandshimpi8011
    @anandshimpi8011 2 роки тому

    Really amazing lecture sir,i increasing interest on Data science sir

  • @saifansari6459
    @saifansari6459 2 роки тому

    Excellent explanation in every topics, it really helps me alot for my data science career.. thanks

  • @cesarkastoun5752
    @cesarkastoun5752 4 роки тому +1

    Hello,
    1st of all, I love your videos. You have a great talent for teaching and are putting it to good use.
    Just a small nit: the heights file you're using is not really a normal distribution, but a bi-modal one, as it has 2 modes. And the reason is very simple, it's because you're lumping together males & females. If you use separate data sets for each gender, you get much "cleaner" normal distributions.
    Cheers
    -CJK

  • @Hale-xn6ec
    @Hale-xn6ec 3 роки тому +2

    It is a really beneficial and useful video on this topic, thank you!

  • @siddharthmodi2740
    @siddharthmodi2740 3 роки тому

    woww! what a simple and easy to understand tutorial. Love it. Thank you sir.

  • @likhithsasank8017
    @likhithsasank8017 3 роки тому

    Thank you so much sir your way of teaching is so clear and easily understandable

  • @srishtikumari6664
    @srishtikumari6664 4 роки тому

    Very well explained sir!!
    Worth watching

  • @abdeali004
    @abdeali004 4 роки тому +2

    Great Greaaaaat and a fulll too Greaaattttt explanation man. Loved it.

  • @hasanbutt8622
    @hasanbutt8622 4 роки тому

    best tutorial
    thanks alot sir
    you are great
    i have learnt alot of concept from your videos
    GOD bless you
    and keep making more videos

  • @Medjdiptiranjan
    @Medjdiptiranjan 2 роки тому

    you are simply amazing , yr simple explanation helping a lot , thanks a trillion

  • @Deepsim
    @Deepsim 3 роки тому

    Your tutorial is so clear. Well done!

  • @whimsicalkins5585
    @whimsicalkins5585 Рік тому

    Thanks very much for your simple and clear code.

  • @learnerlearner4090
    @learnerlearner4090 2 роки тому

    Your videos are easy to understand. Thanks so much!

  • @bhavindedhia9968
    @bhavindedhia9968 4 роки тому

    TOP content seriously thanks sir waiting for more videos specially EDA

  • @akshaypatil8155
    @akshaypatil8155 2 роки тому

    16:38 this is just trimming technique. If we want to do capping that means replacing outliers with either lowest defined value or highest defined value, how to do it?

  • @subuqerpsmja
    @subuqerpsmja 4 роки тому

    Really my sincere thanks for your valuable efforts and im keenly following your guideliness

  • @sa89879
    @sa89879 4 роки тому

    very good and neat explanation but there is one draw back in this Z -score it deal with mean calculation when there is some extreme outlier entry or human made error it can be affected instead of that if we go for Median calculation for outliers it will be roboust,what ever the value it will only take the mid values alone,thanks for your teaching z score

  • @jp-hm
    @jp-hm 21 день тому

    Great video - well explained!

  • @AryanFelix
    @AryanFelix 3 роки тому +2

    How do we determine the Z-Score range for Skewed data? Do I use the same range on either side (like -3 to 3) or can I use different values like -1 to 3 (for left skewed data) after looking at the histogram plot?
    Thanks in advance!

    • @haythemb4214
      @haythemb4214 2 роки тому

      same question i don't know what is the right range for my data because the (3 , -3) doesn't work for my case

  • @fahadreda3060
    @fahadreda3060 4 роки тому +1

    Great video, Thanks man , keep up the good work

  • @pranjalgupta9427
    @pranjalgupta9427 4 роки тому +3

    Sir if data is non-normally distributed then which technique we prefer for removing outliers?

    • @stuttzzzi
      @stuttzzzi 3 роки тому

      there are ways to convert data into normal distribution..learn scaling

  • @yogeshbharadwaj6200
    @yogeshbharadwaj6200 4 роки тому

    Tks for the very detailed explanation sir...

  • @python360
    @python360 2 роки тому

    Great tutorial, thanks for using readily available sample CSV as well. ☑☑

  • @pranjalgupta9427
    @pranjalgupta9427 4 роки тому +1

    Do we remove outlier before feature scaling and after feature scaling?

    • @codebasics
      @codebasics  4 роки тому +1

      We don't need to remove them all the time. We need to treat them which means we might end up changing the value to some resonable value

    • @codebasics
      @codebasics  4 роки тому +1

      Yes we remove them before feature scaling

  • @harshal_ajetrao
    @harshal_ajetrao 4 роки тому +1

    Thanks for the video Sir.
    I am new to the Machine Learning
    Well I use percentile,standard deviation and zscore method
    but problem I get in standard dev nd zscore method is the outliers removed doesn't changes values in our data i.e df, rather it gets stored in new frame df_no_outlier_std_dev. So how to update new values after removing outliers in our data i.e df.
    please help....

    • @viveksingh881
      @viveksingh881 3 роки тому +1

      that is because we are storing it in new dataframe not the original one....in case u want the changes to be reflected in original dataframe store it in original and use inplace = True
      df = df([......code.....,inplace = True)
      happy learning

    • @harshal_ajetrao
      @harshal_ajetrao 3 роки тому +1

      @@viveksingh881 Thanks..It was 6months back story..Now I at intermediate level in machine learning 👍

    • @viveksingh881
      @viveksingh881 3 роки тому +1

      @@harshal_ajetrao thats great bro....clearing some doubts on random yotube videos..happy learning :)

    • @harshal_ajetrao
      @harshal_ajetrao 3 роки тому

      @@viveksingh881 Thanks for helping man..Keep it up 🤘🤘🤘

  • @chivalrousforlan238
    @chivalrousforlan238 4 роки тому +1

    Nice one Sir, thank you. One thing sir, I would like you to please make a tutorial on SQL.
    Thank you sir

  • @naveenkalhan95
    @naveenkalhan95 4 роки тому

    thank you very much again... i am really following all your video.. really knowledgeable ... @5:50 of this video, you created the bell curve.. i am aware of one function .kde() which does the same thing. Is it wise to use that? or there is some difference in that to this function you created for drawing bell curve? Thank you very much again. Really appreciate.

    • @codebasics
      @codebasics  4 роки тому +1

      Naveen, actually I don't know about kde() function. What does API specification say about that function? Can you try plotting it and see if result is same as mine?

    • @naveenkalhan95
      @naveenkalhan95 4 роки тому

      @@codebasics thank you for your reply. I went through your advice and plotted the height using .kde() method and it produced the bell curve same but with a slight difference but plotted the same normal curve.
      I just had to write this line to draw it:
      df.Height.plot.kde();
      But, thank you again for your precious work. Because it's opening up my brain to think the more agile way of drawing it to understand mathematically.

    • @hardikatri7803
      @hardikatri7803 4 роки тому

      We can also plot through seaborn using parametre ( kde = True)

  • @modhua4497
    @modhua4497 3 роки тому

    Does this work only if the feature is normally distributed? Most of the features in real world data are not normally distributed.

  • @estherugwueke5409
    @estherugwueke5409 2 роки тому

    how can you apply this rule when you have about 10 features? Do you do them one by one?

  • @ajaykushwaha-je6mw
    @ajaykushwaha-je6mw 3 роки тому

    Removing outlier is good option of replacing outliers with other value is good option ?

  • @shounaksushantadasgupta8440
    @shounaksushantadasgupta8440 3 роки тому

    how to remove outlier from dataframe which has categorical as well as continuous data, as by percentile technique I am getting NaN value in categorical columns

  • @tucomax
    @tucomax 4 роки тому

    Question, say you have a df of drink consumption and if you don't want to eliminate the outliers but instead replace them with NaN and keep the zero values of the dataframe, what would you do? Thanks

  • @sahanjayawarna4894
    @sahanjayawarna4894 4 роки тому +2

    Very good session as always. I came across this situation but couldn't figure out why. Unless we pass this argument "density=True" in matplotlib.pyplot.hist(), it is not possible to see the normal curve and histogram together in the graph. What is the reason for that?

  • @piyush_sh98
    @piyush_sh98 Рік тому

    How standard deviations is selected as 3 and zscalar 3 too?
    Please someone explain

  • @dhananjaykansal8097
    @dhananjaykansal8097 4 роки тому +1

    Long time sir. I wished you took at least dataset with 5-6 features. Nonetheless it's fantastic

  • @GusMD84
    @GusMD84 4 роки тому

    what happens when the std deviation is way bigger than the mean? Currently exploring a dataset where mean price is ~220 and std dev is ~395? Evidently, there's some big outliers that can be seen straightaway (i.e. min price of 4 and max price of 36000). Should I remove those 'clear' outliers manually and then apply the remove outliers function? (i presume that if I don't do this, the function will remove a lot of 'non-outliers'?

  • @rsinh3792
    @rsinh3792 3 роки тому

    Sir reviewer has asked me this question I don't know how to address it, can you please guide me "Use some statistical significant test such as T-test or ANOVA to prove you validate the proposed diagnostic model on patients and quality improvements of your method". I have two datasets. Dataset 1 was used to train the model and dataset 2 was used to validate the trained model. I have trained the ML model deployed it and Validated it on new data and presented the results. Actually, I have understood the question. Shall I apply the statistical test between the performance metrics of trained model results and validation results? Please help me, sir.

  • @hrushik10
    @hrushik10 2 роки тому

    You can also use seaborn to plot the bell curve. It's much easier than matplotlib method.
    seaborn.histplot(data=df.height, kde=True)
    kde is the kernal density estimate line

  • @nikhilgaikwad9954
    @nikhilgaikwad9954 4 роки тому

    how to select the number of standard deviation in zscore technique to remove outliers?

    • @codebasics
      @codebasics  4 роки тому

      General guideline is 3 or more. If data set is small people use 2 STD dev too but just be careful that you don't remove data point that can add value to data analysis process

  • @sarfrazhussain9851
    @sarfrazhussain9851 2 роки тому

    Nice effort

  • @obigvee
    @obigvee 4 роки тому

    I have question.
    Let's assume a Dataframe has some missing values with the presence of outliers and I don't want to just remove the outliers I want to winsorize the outliers. Is it right to treat the missing values first before winsorization or the other way round?

  • @0SIGMA
    @0SIGMA 3 роки тому

    hey. why cant we use 'StandardScaler' and delete all outliers ?

  • @trinayanbharadwaj146
    @trinayanbharadwaj146 3 роки тому

    How can we apply this to multiple columns?
    Is there any short way or we have to do it manually for every column?

  • @Artech.Ranjit
    @Artech.Ranjit 3 роки тому

    How to decide 3 as a threshold value to calculate zscore values? you have considered ex: zscore >3

  • @ssrriinniivvaass
    @ssrriinniivvaass 4 роки тому

    Hi Sir,
    How do I decide Z score values, does it depend on my data or is it always -3 to +3?

    • @codebasics
      @codebasics  4 роки тому

      Usually is is between 3 and -3 but yes it depends on data. Sometimes people use more than 3 based on data distribution

  • @ajaykushwaha-je6mw
    @ajaykushwaha-je6mw 3 роки тому

    I have a question kindly answer. Suppose we have 20 column and from all 2 column we are removing outliers, then we are excluding small amount of data from each column, i.e. all together we are loosing huge data. Is this a correct way to handle outliers ?

  • @sadikaljarif9635
    @sadikaljarif9635 Рік тому

    why we choose height column ??why dont we chose weight column???

  • @vishalvig01
    @vishalvig01 4 роки тому

    Concise Explanation !

  • @haintuvn
    @haintuvn 4 роки тому

    Thank you for your lectures! I have learnt a lot from the lectures. We can only apply method of Std and Z score to remove the outliers if the data set is normal distribution or we can apply these two methods to all "types" of data set ( normal or not normal distributions)? Thank you again!.

    • @codebasics
      @codebasics  4 роки тому

      You would do that if you have normal distribution

    • @haintuvn
      @haintuvn 4 роки тому

      @@codebasics Thank you very much! Does that mean we need to test to see if the data set is normal distribution before we apply "Z score or standard deviation " method to remove the outlier?

  • @harleyquinn5245
    @harleyquinn5245 4 роки тому +1

    Sir can l become data analyst after
    12th

  • @prdfrnd
    @prdfrnd 4 роки тому +1

    Hi sir, your explanation is really amazing, I recently started to learn data science i have some doubts in this video kindly
    please explain the question is we have mean of 66.36755 and if we add 3.8475 then it will become 69 how it will be
    one standard deviation.

    • @hustleto-n6d
      @hustleto-n6d 4 роки тому

      one standard deviation = 3.8475

  • @bikashpokharel478
    @bikashpokharel478 4 роки тому

    It really helped me. Thank You

  • @zehraup4722
    @zehraup4722 4 роки тому +1

    Here is a great explanation:
    www.kaggle.com/c0derr/outlier-detection?scriptVersionId=39511980

  • @nareshchinnam8349
    @nareshchinnam8349 4 роки тому

    Thanks so much for explaining in such a easy way. Could you please clarify what would we need to do if other columns contains important values in the same row where outlier exist? Still we can go ahead and remove the entire row?

  • @anirbaniitgn8407
    @anirbaniitgn8407 2 роки тому

    Everything is good when you are applying Z_score for searching outliers which are either positive or negative outliers. If both positive and negative values are present together then it does not work..!!
    data = [1, 2, 2, 2, 3, 1, 1,-19, 2, 2, 2, 3, 1, 1, 2,19,25]
    try with this simple dataset.
    with IQR method you can detect -19,19,25 all three
    but with Z_score it is not working.
    I don't know the reason. If you know Sir then let us know.

  • @ethiotube4805
    @ethiotube4805 Рік тому

    can you provide mock interview?

  • @beautyisinmind2163
    @beautyisinmind2163 2 роки тому

    hello sir, can we learn personally from you? and how can we contact you

  • @priyantangupta5176
    @priyantangupta5176 3 роки тому

    Hello! Your lesson is very helpful for me. Can you just say how can I find outliers using multiple parameters? Like I want to find the outliers using all the column of data together that I have. What should I do??
    Thank you in advance.

  • @satyavardhan8204
    @satyavardhan8204 4 роки тому

    Also make videos regarding Seaborn please

  • @pukyalligator
    @pukyalligator 2 роки тому

    Great Video. Thx!!

  • @reshaknarayan3944
    @reshaknarayan3944 4 роки тому

    Clear and succinct

  • @pythongui5199
    @pythongui5199 3 роки тому

    Very nice

  • @pythonenthusiast9292
    @pythonenthusiast9292 4 роки тому +2

    awesome.

  • @boubacaramaiga4408
    @boubacaramaiga4408 4 роки тому

    Fantastic, many thanks.

  • @saurabhbarasiya4721
    @saurabhbarasiya4721 4 роки тому

    Great sir

  • @marco_6145
    @marco_6145 4 роки тому

    Fantastic, thank you

  • @AlonAvramson
    @AlonAvramson 3 роки тому

    Thank you!

  • @sayantandas9281
    @sayantandas9281 4 роки тому

    Sir, thank you

  • @bhaskarsubbaiah6002
    @bhaskarsubbaiah6002 4 роки тому

    thanks sir

  • @renanaoki714
    @renanaoki714 Рік тому

    Thanks!

  • @anthonym9130
    @anthonym9130 4 роки тому

    I noticed he didn't use z-score or cooks in the real estate project

  • @barkhapaswan5807
    @barkhapaswan5807 3 роки тому

    🙌🙌🙌

  • @research__7644
    @research__7644 2 роки тому

    BRUH.... why would you remove one column .... this just ruins the propose

  • @mohammadfasih7752
    @mohammadfasih7752 4 роки тому

    Zoom in your screen !!!

  • @Kingcolumbian
    @Kingcolumbian 3 роки тому

    You know python, but you dont know much about statistics in identifying the outliers in normal distributed data.

  • @sushobhan14
    @sushobhan14 4 роки тому

    content is good but ur delivery is boring

    • @skcbca8580
      @skcbca8580 3 роки тому

      Sir Z- score will work for numeric data ? In case of text data what we can do ?

  • @flaviobrienza7697
    @flaviobrienza7697 2 роки тому

    A little suggestion to make it simpler. In Z-Score method I can calculate its absolute value through np.abs and I can only write < 3 in my condition for the new dataframe.
    In addition, to visualize the curve it is better to use sns.histplot with kde=True