Finding an outlier in a dataset using Python

Поділитися
Вставка
  • Опубліковано 12 лип 2019
  • In this video we will understand how we can find an outlier in a dataset using python.
    ref: #medium articles
    #Outlierdetection
    github url: github.com/krishnaik06/Findin...
    Data Science Projects playlist: • Generative Adversarial...
    NLP playlist: • Natural Language Proce...
    Statistics Playlist: • Population vs Sample i...
    Feature Engineering playlist: • Feature Engineering in...
    Computer Vision playlist: • OpenCV Installation | ...
    Data Science Interview Question playlist: • Complete Life Cycle of...
    You can buy my book on Finance with Machine Learning and Deep Learning from the below url
    amazon url: www.amazon.in/Hands-Python-Fi...

КОМЕНТАРІ • 118

  • @yourkarma7012
    @yourkarma7012 3 роки тому +6

    Clustering techniques are also widely used in industry to detect outliers. Specially isolation forest algo

  • @satheeshswaminathan2328
    @satheeshswaminathan2328 4 роки тому +1

    Hi Krish, Thank you so much for the tutorial, Very clear and crisp explanation, loved it :)

  • @yuktikhantwal2342
    @yuktikhantwal2342 4 роки тому +1

    great video sir. great content, and explained in the cleanest way possible. thanks

  • @srijeetful
    @srijeetful 2 роки тому +1

    Very clear and crisp explanation, loved it

  • @doubando
    @doubando 6 місяців тому +1

    Amazing Krish, now I understand the concept of outliers, thanks

  • @AmitSharma-po1zb
    @AmitSharma-po1zb 3 роки тому +1

    Superb explanation...in very simple way..

  • @shujashakir9952
    @shujashakir9952 Рік тому +4

    The tutorial offers a lucid explanation of a complex problem of outliers. It is well-presented with examples that made it easier to follow. However, threshold = 3 isn't working for me. I modified it to threshold = 3+std to make it work properly. Moreover, declaring outliers = [ ] outside the function is causing problems if you want use this function in another dataset in the same notebook. So, declaring outlier list inside the function would be a better approach, I think.

  • @gyapti-fctfinder3336
    @gyapti-fctfinder3336 3 роки тому

    Nice Content and you explained it very well.ThankYou So Much

  • @meghnasingh9941
    @meghnasingh9941 4 роки тому

    great explanation, kudos !

  • @samarendrapradhan5067
    @samarendrapradhan5067 4 роки тому

    Sir,pls help if i have a dataset which contains 10 features each with a date for a particula index,how can i detect and see the outliers for it happens for an index in one or more than one fearures.i have 4000 fixed indexes and feature values are updates for each date.thanks

  • @Ashokkumar-sc3vt
    @Ashokkumar-sc3vt 5 років тому +3

    Hi Krish, well explained. can you please post a video on how to equate the outliers using any dataset. Thanks in advance.

  • @The.Data.Scientist
    @The.Data.Scientist 11 місяців тому

    Nice work mate. I also tried something similar but with Upper and Lower Bound on the Return

  • @muhammadmuneebkhanafridi154
    @muhammadmuneebkhanafridi154 4 роки тому

    Very well explained.

  • @aws384
    @aws384 4 роки тому

    great video and really it is inspiring

  • @adityapradhan8474
    @adityapradhan8474 Місяць тому

    Thank you so much sir, I understood everything

  • @saniyamanchekar9978
    @saniyamanchekar9978 4 роки тому +1

    How can I find out outliers when there will be many numbers of Columbus in a large datasets.

  • @aashaygoel7338
    @aashaygoel7338 3 роки тому

    During a project in ml I come to an scenario where when I split the dataset with train_test_split the test set contained some categorical column that were not present in the train set while label encoding it. Can you please explain what to do in this type of scenario and also do the outliers be detected before train test split or after. I have seen that you explain each topic in detail. Please help me in this scenario.

  • @shadrul2783
    @shadrul2783 4 роки тому +14

    Here is the correction lower bound = q1 - 1.5*IQR and upper bound = q3 + 1.5*IQR

    • @rohankupate5917
      @rohankupate5917 Рік тому +1

      You mean in video it's mistake?

    • @Kishor_D7
      @Kishor_D7 6 місяців тому

      Yes bro, check statistics playlist by krish naik.

  • @sanathdas4071
    @sanathdas4071 4 роки тому +2

    Sir,please can you tell me the difference between anomaly and outliers?
    I am confused about this two.
    please, sir answer me

  • @sekharpink
    @sekharpink 5 років тому +1

    Hi Krish
    I like ur videos alot..very informative..Could you please put videos related to word2vec models like skipgram, CBOW, gensim, glove..
    Thanks in advance.

  • @dineshlakshitha7309
    @dineshlakshitha7309 3 роки тому

    amazing video
    supper explanation

  • @nabilahhannani2326
    @nabilahhannani2326 4 роки тому

    I've applied both of the method in my dataset, but I found different results for both of them? Which one should I choose? Is it possible they have different result?

  • @karishmaqweera3869
    @karishmaqweera3869 4 роки тому

    Sir, Are you having handwritten notes of whatever you taught in ML course videos?Please share them Sir.

  • @dikshadhiman2474
    @dikshadhiman2474 3 роки тому

    Thankyou sir for this content.

  • @dhivya_animal_lover
    @dhivya_animal_lover 4 роки тому +1

    Hi Sir , a smal doubt in the video part where you talk about the Std Normal Distribution. You told the graph is about Std normal distribution, but the you told when data falls before and beyonf 3rd std deviation, you will not consider it. Kindly clarify

  • @deeptijoshi377
    @deeptijoshi377 3 роки тому +1

    What will we do in case when outliers are not following gaussian distribution and outlier is present in between the data distribution but not at the extremes

  • @sheetalyoutub
    @sheetalyoutub 2 роки тому

    Very helpful !

  • @mridulagarwal5881
    @mridulagarwal5881 4 роки тому +27

    You have explained things well. Just one correction - it's inter-quartile range and not inter-quantile range.

    • @FaraazKhanfz
      @FaraazKhanfz 3 роки тому

      It's Inter Quartile Range

    • @nosseibagacem9014
      @nosseibagacem9014 Рік тому

      Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

  • @jatingupta4026
    @jatingupta4026 3 роки тому

    how to remove those values that are more than the upper bound and lower than the lower bound values respectively? Please tell that too sir

  • @adarshrai22
    @adarshrai22 2 роки тому

    @krish naik how to remove outliers from non-normal distributed dataset?

  • @rushikeshbulbule8120
    @rushikeshbulbule8120 4 роки тому

    Excellent👍👏😆

  • @amitsawant4961
    @amitsawant4961 2 роки тому

    insightful for me

  • @mithunkumar7063
    @mithunkumar7063 5 років тому +1

    Thank you

  • @mohanadjibory2191
    @mohanadjibory2191 2 роки тому

    Thanks , i wonder how to detect outliers in ndarry numpy. I mean n by m shape array. You explained for 1D array, what abot 2d?

  • @subhamasthan7294
    @subhamasthan7294 4 роки тому

    Hi Krish thank you so much for a nice video can you pls share the link of nxt video where you applied these techniques on kaggle dataset ?

  • @manavagarwal9763
    @manavagarwal9763 8 місяців тому

    where can i get this jupyter notebook for revision

  • @sakhawathossain3812
    @sakhawathossain3812 Рік тому

    Very helpful...

  • @mdazizulislam9653
    @mdazizulislam9653 4 роки тому +2

    Any suggestions for multivariate outliers having mixed variables (continuous & Categorical)?

    • @bonishagarwal9315
      @bonishagarwal9315 4 роки тому

      In case of categorical data, it will be better to find the outlier using a scatter plot as sir explained.

  • @ryando4556
    @ryando4556 4 місяці тому

    Well explained, would be great if you can add some plot for visualization.

  • @dhirendrajha9667
    @dhirendrajha9667 5 років тому

    Hi, Krish, well explained, can you build one video on rasa chatbot.

  • @satyanarayanajammala5129
    @satyanarayanajammala5129 5 років тому +1

    excellent

  • @smalirizvi8026
    @smalirizvi8026 2 роки тому +2

    I have a couple of questions.
    1. Is it always better to remove the outliers or could it be big mistake as well? You gave an example of a fraudulent transaction. Now, an outlier indeed is a hint that the transaction was fraud. If I remove all transactions at the first place, how am i going to achieve my results?
    2. You did not explain how do we perform outlier checks with multivariate dataset. Suppose IRIS dataset. I have seen a couple of videos here and there but no proper way is coming out. What is the proper way to identify outliers with multivariate datasets.
    Tahnks

  • @AbhishekMishra-mq4jw
    @AbhishekMishra-mq4jw 3 роки тому +1

    what to do with natural outliers?
    the outliers which are expected to be there which are not because of any artificial errors

  • @kaka83185
    @kaka83185 3 роки тому +3

    Just a correction, when calculating z-score , you are doing subtraction of i to an array, you should enumerate on datasets and then subset i from the current index of mean and std.

    • @nosseibagacem9014
      @nosseibagacem9014 Рік тому

      Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

    • @karimdandachi9200
      @karimdandachi9200 Рік тому

      mean and std are not arrays... the mean of a list of values is a single value and so is the standard deviation

  • @bhagyaraj5506
    @bhagyaraj5506 4 роки тому +2

    in z-score threshold value mentioned as 3 , threshold is nothing but 3rd standard deviation is it?

  • @prateeksmithpatra5796
    @prateeksmithpatra5796 3 роки тому

    outliers.append(y)
    y is not defined but how did you complied it

  • @rizkamilandgamilenio9806
    @rizkamilandgamilenio9806 Рік тому

    Is there any condition better we use one method over another?

  • @PratapO7O1
    @PratapO7O1 3 роки тому

    14:06 here it is a single dimension df how to sort multidimensional df. We can't sort all rows at once we need to specify one row or 2 how to do it with multi-dimension df?
    Thank you

  • @vamsinadh100
    @vamsinadh100 3 роки тому +10

    13:57 Correction
    Lower bound=Q1-IQR*1.5
    Upeer bound= Q3+IQR*1.5

    • @aggreykip2006
      @aggreykip2006 Рік тому

      can you use Upper bound in a histogram as a max value?

  • @ahmedbaheeg
    @ahmedbaheeg Рік тому

    Thanks

  • @pratikramteke3274
    @pratikramteke3274 3 роки тому

    How to find outliers in multiple linear regression?

  • @BAIBHAVPATHYBEE
    @BAIBHAVPATHYBEE Рік тому

    for z score how did you know the threshold
    value ???

  • @niveshtayal979
    @niveshtayal979 4 роки тому +1

    Hi Krish
    Thanks for excellent explanation....But if we get some outliers in any feature should we remove those records containing outliers(but in this case we loose some data), if not then how can we handle outliers??? Please cover this portion also :)

    • @amanpreetsinghgulati2475
      @amanpreetsinghgulati2475 2 роки тому

      Capping (wensorization) is another way where we can deal with outliers by imputing the values (within the range) in that case the data will not be lost

  • @otroleonarbe
    @otroleonarbe 2 роки тому

    thanks for sharing this video.
    One correction, in the loop it should be *outliers.append(i) *
    not
    outliers.append(y)

  • @RahulKumar-hj8qk
    @RahulKumar-hj8qk 4 роки тому +1

    if we have more than one feature, after that we remove the outliers than, is it not affect other features

    • @bonishagarwal9315
      @bonishagarwal9315 4 роки тому +3

      You need to remove the whole sample of that outlier because if you remove only the outlier from one feature, it results in an empty space leading to inaccurate predictions.
      Eg. if you have Age, Height, and Weight as your input features and u find an outlier in your Age column, you need to remove the whole sample of that particular outlier i.e. remove the complete row of that outlier. Hope I have answered your question.

  • @jakekiddall5108
    @jakekiddall5108 2 роки тому +1

    Is there any anamoly detection videos that dont use credit card fraud as an example???

  • @arjyabasu1311
    @arjyabasu1311 4 роки тому +2

    Sir, shouldn't the threshold value be 3*std and not just 3 ?? Because the rule is a data point is will be considered to be a outlier if it falls outside 3rd standard deviation and not just value 3.

    • @jondoe3693
      @jondoe3693 4 роки тому +1

      Do you mean when z score = 3? Then it is correct to use threshold of 3 because you have standardized the data and standard deviation of z scored values is 1 and its mean is 0.

  • @cliffkwok
    @cliffkwok 5 років тому +3

    Hi Krish, I just ordered your finance book in Amazon, which is the newest one in whole amazon about python in finance, will you do more video on finance?

    • @krishnaik06
      @krishnaik06  5 років тому +3

      Thanks Kwok for buying my book...yes I will be uploading more videos on finance.

    • @varunchandrappa5123
      @varunchandrappa5123 3 роки тому

      @@krishnaik06 Hands-On Python for Finance is out of stock..Please let us know when it will be available for sale

  • @yomeshyadav3407
    @yomeshyadav3407 3 роки тому +1

    sir, I have a doubt, threshold is nothing but 3rd standard deviation as you said so it must be 3 * sigma but here you have taken the threshold as 3 can you please clarify this

    • @somomitachattopadhyay2846
      @somomitachattopadhyay2846 11 місяців тому

      yes thats because here in standard normal distribution the standard deviation is considered to be having the value 1 , sigma = 1

  • @shishirdixit5996
    @shishirdixit5996 4 роки тому

    Sir once we have detected these outliers using z score method and if they are too many outliers how can we drop those outliers

    • @RwSkipper007
      @RwSkipper007 4 роки тому +1

      you can use .difference() method to do that
      If A and B are two sets then you can calculate the difference as :
      A.difference(B) , equivalent to (A-B) of the set.
      Similarly (B-A) = B.difference(A)
      Hope this helps

  • @ga43ga54
    @ga43ga54 5 років тому

    Please talk about data strategy

  • @terwasevictorsesugh3902
    @terwasevictorsesugh3902 Рік тому

    What if the data does not follow a normal distribution?

  • @muditmathur465
    @muditmathur465 Рік тому

    Why do we use 1.5 times IQR? Can we take any other number?

  • @nosseibagacem9014
    @nosseibagacem9014 Рік тому

    Hello sir, i hope you are doing well, i was hoping if you can help me with OD, I'm doing a thesis on the subject and i'm very new to python and programming, i hope to hear from you and thank you in advance.

  • @jayantdikshit4181
    @jayantdikshit4181 3 роки тому

    Hi Krish thanks for making such an amazing content. I have a query at 09:35.
    As you have mentioned that we can find outliers using scatter plots. But how can we find outliers if we do have multiple features(more than 2 features)? Your views/response on this would be much appreciated.
    Thanks in advance.

    • @rachittoshniwal
      @rachittoshniwal 3 роки тому

      You can try with any two random features from your data
      You'll either see most values following a trend with a few outliers, or you'll see most values cluster at a place with a few outliers. Or maybe something else too!

    • @sanjaysanjay862
      @sanjaysanjay862 2 роки тому

      yes, you can do it by plotting each feature with the target.

  • @magicmushroom9670
    @magicmushroom9670 3 роки тому +1

    Every single UA-cam channel explain with perspective of Univariate. Can you please explain this with Multivariate ? There is very less data about that on internet.

  • @jorgeeg2668
    @jorgeeg2668 2 роки тому

    how detect outliers in fuction to datetime?

  • @mashirnizami134
    @mashirnizami134 4 роки тому

    Gr8

  • @aakashsinghrawat3313
    @aakashsinghrawat3313 4 роки тому

    sir, in any dataset like bank loan prediction, what if credit score is beyond its ranging(300-850), will they considered as outliers? if yes, how to handle them?
    great fellows are welcome to help...please

    • @rachittoshniwal
      @rachittoshniwal 3 роки тому

      If the range itself is 300-850 and you are having values above or below that range, then that is a data error, and you can drop them unless you can devise a way to find the real value

  • @ksoftqatutorials9251
    @ksoftqatutorials9251 5 років тому +1

    I have been following your videos and I have learnt many things Krish Naik. Could you please tell me have you written any Datascience and machine learning books. I would like to buy your books and follow your videos to clinch Datascience job as soon as possible.

    • @krishnaik06
      @krishnaik06  5 років тому +2

      Hi Kiran,
      I have written a book on finance with ML and DL

    • @ksoftqatutorials9251
      @ksoftqatutorials9251 5 років тому +1

      @@krishnaik06 could you please share the link,so that I would buy that book..looking forward to more videos.

  • @parikshitgupta343
    @parikshitgupta343 3 роки тому

    How is lower bound which you said is q1*1.5 is greater then lower quartile which you said it's q1
    Lower bound seems like something which should be less then lower quartile

  • @abdulaziz-lh3nb
    @abdulaziz-lh3nb 2 роки тому

    what if I have a lot of outliers in the dataset (around 27%), how to handle that?

    • @newbie8051
      @newbie8051 Рік тому

      If I were you, I would go for missing value treatment first, then try to go with outlier treatment, also if I had to deal with such high % of outliers, my first thought would be treat them like normal data points, as deleting outliers would lead to loss of too-many data points.
      Can you share how you solved the problem ?

  • @Getrocknete_Kotze_Schlabbern
    @Getrocknete_Kotze_Schlabbern Місяць тому

    i dont understand why we compute 1.5 * iqr , what does this 1.5 mean where do you get this number?

  • @vishalb1204
    @vishalb1204 5 років тому +3

    Can you please enable English subtitle?

  • @কোরআন-শিখি
    @কোরআন-শিখি 4 роки тому

    can u do a ransac

  • @aayushijain2160
    @aayushijain2160 4 роки тому

    Sir I understood that how to identify outliers using Z-score and IQR but can you tell us how to fix them like either we should drop that column or what else we should do to remove that outlier from the dataset????

    • @farazmev3430
      @farazmev3430 4 роки тому

      drop rows or replace them (mean,mode,median)

  • @chandrasekharpoluboyina8865
    @chandrasekharpoluboyina8865 3 роки тому

    Generally we remove this noise, But for fraud detection and identifying a rare disease outliers will be helpful, in such cases how to handle or use them instead of removing them.

  • @chandrasekharpoluboyina8865
    @chandrasekharpoluboyina8865 3 роки тому

    tell us about robust outlier

  • @raghavgirigiri1
    @raghavgirigiri1 3 роки тому

    Krish i just wanna make a small correction, while saying "less than 2" OR "less than 3" say "10% of the data (or whatever the data is) fall below 2 or 3"....otherwise it's great, Good job !!

  • @deepquest
    @deepquest 2 роки тому

    Hi Krish, How can we identify root cause of an outlier?

    • @newbie8051
      @newbie8051 Рік тому

      Due to human error in data entry/recording or maybe due to some error/bug in the Data Pipeline

  • @NickolayGrin
    @NickolayGrin 5 років тому +1

    Using mean is Ok, but not best idea for outlier detection. Median based methods usually more robust.

  • @muhammadyazidbaihaqi1479
    @muhammadyazidbaihaqi1479 2 роки тому

    why your video no subtitle? please make it, thanks

  • @hritwijkamble9988
    @hritwijkamble9988 Рік тому +2

    Why threeshold = 3

    • @Blodia1990
      @Blodia1990 Місяць тому

      It represents the quartile

  • @iliyasn2760
    @iliyasn2760 4 роки тому +1

    we need to append 'i' value not 'y'

  • @julieohn
    @julieohn 4 роки тому

    What to do after detecting outliers? How do we treat them?

  • @ganeshkumarpatel
    @ganeshkumarpatel 4 роки тому

    Why to do such calculations and looping to find outlier... Just apply standard scaling and create new conditional dataframe of scaled data which contains morethan 3 std values... Those are outliers... Isn't it?

  • @AmeerulIslam
    @AmeerulIslam 4 роки тому

    should be i instead of y in outlier.append(i)

    • @AmeerulIslam
      @AmeerulIslam 4 роки тому

      i can see you have fixed it in the video but not in github.

  • @zehraup4722
    @zehraup4722 4 роки тому

    codes:
    www.kaggle.com/c0derr/outlier-detection

  • @KNfarming882
    @KNfarming882 3 роки тому

    its not data set its data point which away from >=3

  • @econdoc3000
    @econdoc3000 4 роки тому +1

    Hi Krish, your definition of quantiles is wrong! If you have 0.1=F(x) with F() being the cumulative density, then its 0.1 = F(x)=P(X

  • @aparnashrivastava5837
    @aparnashrivastava5837 3 роки тому

    Thanks