Hey, For those who might find it helpful, (I'm sure most will) I made a transformer that handles outliers. You can use it in pipelines/columntransformers and perform gridsearch for optimal handling of outliers. Everything is explained clearly in the documentation. Let me know if anyone needs help with anything just in case. I'm posting it as a reply to this comment.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from sklearn.base import BaseEstimator,TransformerMixin class OutlierHandler(BaseEstimator,TransformerMixin): """ Description: Detects and Handles Outliers.
Parameters: strategy = {'trim', 'cap', 'nan'} default = 'cap'. strategy sets how outliers will be dealt with after detection. For strategy='trimming': Instances with outliers will be dropped (Not recommended). For strategy='capping': Instances with outliers will be "capped" or replaced with upper and lower limit values computed according to the method chosen. For strategy='nan': Instnaces with outliers will have the positions of outliers replaced with np.nan "NaN" values. This data can then be treated and imputed using any of the various imputation techniques.
method = {'z_score', 'iqr', 'percentile'} method sets how outliers will be detected. default = 'iqr'. Depending on method chosen, you will need to pass additional parameter(s). For method='z_score': Pass Standard Deviation 'std' above and below which outliers will be detected and handles according to strategy chosen. Note that this method is optimally used only for normally distributed features. default=3. For method='iqr': Pass 'factor' by which IQR needs to be multiplied by for computing upper and lower limits. default=1.5. For method='percentile': Pass 'alpha' which will be the percentile that will be used to detect outliers and handle it according to the method chosen. default=0.01.
Returns: numpy.ndarray of transformed data with outliers detected and handles according to method and strategy chosen respectively. """
Doubt 1 - Sir is this percentile method is applicable to both distrubution , normally and non- normally curves ???? Doubt 2- Also u said u have to give threshold value equally/symetrically say 1% and 99% or 5% and 95% , What if our data is right or left skew distributed then in that case most of the outliers would be there on the one of the extreme ends and not at both the ends, so in that case if we use symmetrical threshold then we might lose some of of our non outliers??? Doubt 3- if we have to remove outliers before train and test split then in that case we have to fill the missing value too before the train and test split , but u taught us that u have to fill missing value after train and test split . awaiting for ur kind reply
sir i just wanted to ask that can we write our own machine learning algorithms instead of using sklearn and tensorflow i mean from scratch plz make a video about that. I have been following you whole series. Sir do reply. Thanks to your efforts
Thanks for this informative tutorial. I have a doubt that how do we find what min and max level of percentile value we should choose to eliminate outliers?
As he mentioned that he already have worked with the data so he somewhere knew what can be the min and Max percentile. Meanwhile in real setup it's very much practical some domain knowledge will also comes into picture and may be hit and trail methods as well (experimental - again mentioned by him)
Sir, is there any percentage of data below which we can accept the outliers in the data set? Say if the outliers are only 2% of the data, then there is no need to remove the outliers and we can build the model.
It would be better if your data don't have any outlier. bcoz 2% of outlier can also decrease accuracy by some percent. So Try to find and remove it, Though you find it difficult then It's ok. I would suggest to "CAP" it if the outlier is only 2%.(my thoughts, not necessarily right).
Hi Nitish , i ran your notebook and checked the box plot . It shows one outlier but if i check for records > Q3+1.5*IQR , there are no records. Thats strange
Thanks for the 100 days of ML videos. They are very helpful!
Hey, For those who might find it helpful, (I'm sure most will) I made a transformer that handles outliers. You can use it in pipelines/columntransformers and perform gridsearch for optimal handling of outliers. Everything is explained clearly in the documentation. Let me know if anyone needs help with anything just in case. I'm posting it as a reply to this comment.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.base import BaseEstimator,TransformerMixin
class OutlierHandler(BaseEstimator,TransformerMixin):
"""
Description:
Detects and Handles Outliers.
Parameters:
strategy = {'trim', 'cap', 'nan'} default = 'cap'.
strategy sets how outliers will be dealt with after detection.
For strategy='trimming':
Instances with outliers will be dropped (Not recommended).
For strategy='capping':
Instances with outliers will be "capped" or replaced with upper and lower limit
values computed according to the method chosen.
For strategy='nan':
Instnaces with outliers will have the positions of outliers replaced with np.nan
"NaN" values. This data can then be treated and imputed using any of the various
imputation techniques.
method = {'z_score', 'iqr', 'percentile'}
method sets how outliers will be detected. default = 'iqr'.
Depending on method chosen, you will need to pass additional parameter(s).
For method='z_score':
Pass Standard Deviation 'std' above and below which outliers will be
detected and handles according to strategy chosen. Note that this method is
optimally used only for normally distributed features.
default=3.
For method='iqr':
Pass 'factor' by which IQR needs to be multiplied by for computing upper and lower limits.
default=1.5.
For method='percentile':
Pass 'alpha' which will be the percentile that will be used to detect outliers and handle it
according to the method chosen.
default=0.01.
Returns:
numpy.ndarray of transformed data with outliers detected and handles
according to method and strategy chosen respectively.
"""
def __init__(self, strategy = 'cap', method = 'iqr', factor=1.5, zstd=3, alpha=0.01):
self.factor = factor
self.strategy = strategy
self.method = method
self.factor = factor
self.factor = factor
self.zstd = zstd
self.alpha = alpha
def outlier_iqr(self,X,y=None):
X = pd.Series(X).copy()
self.q1 = X.quantile(0.25)
self.q3 = X.quantile(0.75)
self.iqr = self.q3 - self.q1
self.lower_bound = self.q1 - (self.factor * self.iqr)
self.upper_bound = self.q3 + (self.factor * self.iqr)
if self.strategy == 'nan':
X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
elif self.strategy == 'trim':
X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
else:
X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
return pd.Series(X)
def outlier_zscore(self,X,y=None):
X = pd.Series(X).copy()
self.mean = X.mean()
self.std = X.std()
self.lower_bound = self.mean - (self.zstd * self.std)
self.upper_bound = self.mean + (self.zstd * self.std)
if self.strategy == 'nan':
X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
elif self.strategy == 'trim':
X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
else:
X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
return pd.Series(X)
def outlier_percentile(self,X,y=None):
X = pd.Series(X).copy()
self.lower_bound = X.quantile(0.00+self.alpha)
self.upper_bound = X.quantile(1.00-self.alpha)
if self.strategy == 'nan':
X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
elif self.strategy == 'trim':
X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
else:
X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
return pd.Series(X)
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
if self.method == 'iqr':
return X.apply(self.outlier_iqr)
elif self.method == 'z_score':
return X.apply(self.outlier_zscore)
else:
return X.apply(self.outlier_percentile)
Kindly share your GitHub link instead of commenting the code.
Doubt 1 - Sir is this percentile method is applicable to both distrubution , normally and non- normally curves ????
Doubt 2- Also u said u have to give threshold value equally/symetrically say 1% and 99% or 5% and 95% ,
What if our data is right or left skew distributed then in that case most of the outliers would be there on the one of the extreme ends and not at both the ends, so in that case if we use symmetrical threshold then we might lose some of of our non outliers???
Doubt 3- if we have to remove outliers before train and test split then in that case we have to fill the missing value too before the train and test split , but u taught us that u have to fill missing value after train and test split .
awaiting for ur kind reply
sir i just wanted to ask that can we write our own machine learning algorithms instead of using sklearn and tensorflow i mean from scratch plz make a video about that. I have been following you whole series. Sir do reply. Thanks to your efforts
Sir! I have a question about this lecture that can we apply this technique for both normal and skewed data?
Thanks sir for these video, it helps us a lot to polish our all skills.
Thank You Sir.
Thanks for this informative tutorial. I have a doubt that how do we find what min and max level of percentile value we should choose to eliminate outliers?
As he mentioned that he already have worked with the data so he somewhere knew what can be the min and Max percentile. Meanwhile in real setup it's very much practical some domain knowledge will also comes into picture and may be hit and trail methods as well (experimental - again mentioned by him)
Can we just cap or remove outliers on the upper side while keeping the lower side outliers ?
🎉🎉🎉🎉❤❤❤❤❤Keep it up
Sir aapne non distribution plot ka outlier nhi btaya.. last video me bola tha ke coloum agar non distribution hoga to kaise outlier detect krenge
Sir, is there any percentage of data below which we can accept the outliers in the data set?
Say if the outliers are only 2% of the data, then there is no need to remove the outliers and we can build the model.
It would be better if your data don't have any outlier. bcoz 2% of outlier can also decrease accuracy by some percent.
So Try to find and remove it, Though you find it difficult then It's ok.
I would suggest to "CAP" it if the outlier is only 2%.(my thoughts, not necessarily right).
Do we have to check every column for outliers and that same goes for other methods.
Hi Nitish , i ran your notebook and checked the box plot . It shows one outlier but if i check for records > Q3+1.5*IQR , there are no records. Thats strange
Sir how can we do capping if we have multiple columns??
Thank you sir
sir what we do when we have 3 lacs rows and there is almost 10k outliers, How we can treat it
Sir outlier remover to humein X_train aut X_test pe kerna chahhiye tha.
what to do when the data is not normally distributed?
Sir what to do when there are too many outliers?
how can we remove multivariant or bivariant outliers?
thanks
lekin ye outliers kaise huye heigh me logo ki height to hoti hi hai aesi
meko marks me to koi outlier nahi laga logo k nuber jayda aate hai
🤍✨🌹