Outlier Detection using the Percentile Method | Winsorization Technique

Поділитися
Вставка
  • Опубліковано 2 лис 2024

КОМЕНТАРІ • 29

  • @mithilaagnihotri2960
    @mithilaagnihotri2960 Місяць тому +7

    Thanks for the 100 days of ML videos. They are very helpful!

  • @0Fallen0
    @0Fallen0 2 роки тому +5

    Hey, For those who might find it helpful, (I'm sure most will) I made a transformer that handles outliers. You can use it in pipelines/columntransformers and perform gridsearch for optimal handling of outliers. Everything is explained clearly in the documentation. Let me know if anyone needs help with anything just in case. I'm posting it as a reply to this comment.

    • @0Fallen0
      @0Fallen0 2 роки тому +10

      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
      %matplotlib inline
      from sklearn.base import BaseEstimator,TransformerMixin
      class OutlierHandler(BaseEstimator,TransformerMixin):
      """
      Description:
      Detects and Handles Outliers.

      Parameters:
      strategy = {'trim', 'cap', 'nan'} default = 'cap'.
      strategy sets how outliers will be dealt with after detection.
      For strategy='trimming':
      Instances with outliers will be dropped (Not recommended).
      For strategy='capping':
      Instances with outliers will be "capped" or replaced with upper and lower limit
      values computed according to the method chosen.
      For strategy='nan':
      Instnaces with outliers will have the positions of outliers replaced with np.nan
      "NaN" values. This data can then be treated and imputed using any of the various
      imputation techniques.

      method = {'z_score', 'iqr', 'percentile'}
      method sets how outliers will be detected. default = 'iqr'.
      Depending on method chosen, you will need to pass additional parameter(s).
      For method='z_score':
      Pass Standard Deviation 'std' above and below which outliers will be
      detected and handles according to strategy chosen. Note that this method is
      optimally used only for normally distributed features.
      default=3.
      For method='iqr':
      Pass 'factor' by which IQR needs to be multiplied by for computing upper and lower limits.
      default=1.5.
      For method='percentile':
      Pass 'alpha' which will be the percentile that will be used to detect outliers and handle it
      according to the method chosen.
      default=0.01.

      Returns:
      numpy.ndarray of transformed data with outliers detected and handles
      according to method and strategy chosen respectively.
      """

      def __init__(self, strategy = 'cap', method = 'iqr', factor=1.5, zstd=3, alpha=0.01):
      self.factor = factor
      self.strategy = strategy
      self.method = method
      self.factor = factor
      self.factor = factor
      self.zstd = zstd
      self.alpha = alpha

      def outlier_iqr(self,X,y=None):
      X = pd.Series(X).copy()
      self.q1 = X.quantile(0.25)
      self.q3 = X.quantile(0.75)
      self.iqr = self.q3 - self.q1
      self.lower_bound = self.q1 - (self.factor * self.iqr)
      self.upper_bound = self.q3 + (self.factor * self.iqr)
      if self.strategy == 'nan':
      X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
      elif self.strategy == 'trim':
      X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
      else:
      X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
      return pd.Series(X)

      def outlier_zscore(self,X,y=None):
      X = pd.Series(X).copy()
      self.mean = X.mean()
      self.std = X.std()
      self.lower_bound = self.mean - (self.zstd * self.std)
      self.upper_bound = self.mean + (self.zstd * self.std)
      if self.strategy == 'nan':
      X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
      elif self.strategy == 'trim':
      X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
      else:
      X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
      return pd.Series(X)

      def outlier_percentile(self,X,y=None):
      X = pd.Series(X).copy()
      self.lower_bound = X.quantile(0.00+self.alpha)
      self.upper_bound = X.quantile(1.00-self.alpha)
      if self.strategy == 'nan':
      X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
      elif self.strategy == 'trim':
      X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
      else:
      X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
      return pd.Series(X)

      def fit(self,X,y=None):
      return self

      def transform(self,X,y=None):
      if self.method == 'iqr':
      return X.apply(self.outlier_iqr)
      elif self.method == 'z_score':
      return X.apply(self.outlier_zscore)
      else:
      return X.apply(self.outlier_percentile)

    • @123arskas
      @123arskas Рік тому +2

      Kindly share your GitHub link instead of commenting the code.

  • @mohitkushwaha8974
    @mohitkushwaha8974 Рік тому +7

    Doubt 1 - Sir is this percentile method is applicable to both distrubution , normally and non- normally curves ????
    Doubt 2- Also u said u have to give threshold value equally/symetrically say 1% and 99% or 5% and 95% ,
    What if our data is right or left skew distributed then in that case most of the outliers would be there on the one of the extreme ends and not at both the ends, so in that case if we use symmetrical threshold then we might lose some of of our non outliers???
    Doubt 3- if we have to remove outliers before train and test split then in that case we have to fill the missing value too before the train and test split , but u taught us that u have to fill missing value after train and test split .
    awaiting for ur kind reply

  • @AyushPatel
    @AyushPatel 3 роки тому +3

    sir i just wanted to ask that can we write our own machine learning algorithms instead of using sklearn and tensorflow i mean from scratch plz make a video about that. I have been following you whole series. Sir do reply. Thanks to your efforts

  • @zainfaisal3153
    @zainfaisal3153 8 місяців тому +1

    Sir! I have a question about this lecture that can we apply this technique for both normal and skewed data?

  • @namansethi1767
    @namansethi1767 2 роки тому

    Thanks sir for these video, it helps us a lot to polish our all skills.

  • @ParthivShah
    @ParthivShah 8 місяців тому +1

    Thank You Sir.

  • @230489shraddha
    @230489shraddha 2 роки тому +1

    Thanks for this informative tutorial. I have a doubt that how do we find what min and max level of percentile value we should choose to eliminate outliers?

    • @amanpreetsinghgulati2475
      @amanpreetsinghgulati2475 2 роки тому

      As he mentioned that he already have worked with the data so he somewhere knew what can be the min and Max percentile. Meanwhile in real setup it's very much practical some domain knowledge will also comes into picture and may be hit and trail methods as well (experimental - again mentioned by him)

  • @ronbiswas8592
    @ronbiswas8592 2 роки тому +1

    Can we just cap or remove outliers on the upper side while keeping the lower side outliers ?

  • @JACKSPARROW-ch7jl
    @JACKSPARROW-ch7jl Рік тому +2

    🎉🎉🎉🎉❤❤❤❤❤Keep it up

  • @nidhisharma302
    @nidhisharma302 Місяць тому

    Sir aapne non distribution plot ka outlier nhi btaya.. last video me bola tha ke coloum agar non distribution hoga to kaise outlier detect krenge

  • @shubhamjain-li5tn
    @shubhamjain-li5tn 3 роки тому +3

    Sir, is there any percentage of data below which we can accept the outliers in the data set?
    Say if the outliers are only 2% of the data, then there is no need to remove the outliers and we can build the model.

    • @ParthivShah
      @ParthivShah 8 місяців тому

      It would be better if your data don't have any outlier. bcoz 2% of outlier can also decrease accuracy by some percent.
      So Try to find and remove it, Though you find it difficult then It's ok.
      I would suggest to "CAP" it if the outlier is only 2%.(my thoughts, not necessarily right).

  • @vikranttomar8392
    @vikranttomar8392 2 роки тому

    Do we have to check every column for outliers and that same goes for other methods.

  • @whatdidilearntoday6369
    @whatdidilearntoday6369 Рік тому

    Hi Nitish , i ran your notebook and checked the box plot . It shows one outlier but if i check for records > Q3+1.5*IQR , there are no records. Thats strange

  • @mohammadfarazgoriya5929
    @mohammadfarazgoriya5929 2 роки тому

    Sir how can we do capping if we have multiple columns??

  • @heetbhatt4511
    @heetbhatt4511 Рік тому

    Thank you sir

  • @tejaspatil3760
    @tejaspatil3760 2 роки тому

    sir what we do when we have 3 lacs rows and there is almost 10k outliers, How we can treat it

  • @ajaykushwaha-je6mw
    @ajaykushwaha-je6mw 3 роки тому

    Sir outlier remover to humein X_train aut X_test pe kerna chahhiye tha.

  • @thomsonblaze
    @thomsonblaze Рік тому

    what to do when the data is not normally distributed?

  • @deeprajmazumder6261
    @deeprajmazumder6261 Рік тому

    Sir what to do when there are too many outliers?

  • @acharjyaarijit
    @acharjyaarijit Рік тому

    how can we remove multivariant or bivariant outliers?

  • @arshad1781
    @arshad1781 3 роки тому

    thanks

  • @Priyam_barman
    @Priyam_barman 9 місяців тому

    lekin ye outliers kaise huye heigh me logo ki height to hoti hi hai aesi

  • @Priyam_barman
    @Priyam_barman 9 місяців тому

    meko marks me to koi outlier nahi laga logo k nuber jayda aate hai

  • @Turkish811
    @Turkish811 2 місяці тому

    🤍✨🌹