Feature Construction | Feature Splitting

Outlier Detection and Removal using the IQR Method | Handing Outliers Part 3

How to Identify and Treat Outliers in Stata | Stata Tutorial

Моя історія завмерлої вагітності...

"Москва - это правнучка Киева, Крым - это Украина" - Борис Миронов размазал крымнашистов @omtvreal

Отечественный Суперкар Маруся! Оживляем легенду

Outlier Detection using the Percentile Method | Winsorization Technique

CampusX

Переглядів 34 070

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 2 лис 2024

КОМЕНТАРІ • 29

@mithilaagnihotri2960 Місяць тому ⁺⁷
Thanks for the 100 days of ML videos. They are very helpful!
@0Fallen0 2 роки тому ⁺⁵
Hey, For those who might find it helpful, (I'm sure most will) I made a transformer that handles outliers. You can use it in pipelines/columntransformers and perform gridsearch for optimal handling of outliers. Everything is explained clearly in the documentation. Let me know if anyone needs help with anything just in case. I'm posting it as a reply to this comment.
@0Fallen0 2 роки тому ⁺¹⁰
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.base import BaseEstimator,TransformerMixin
class OutlierHandler(BaseEstimator,TransformerMixin):
"""
Description:
Detects and Handles Outliers.

Parameters:
strategy = {'trim', 'cap', 'nan'} default = 'cap'.
strategy sets how outliers will be dealt with after detection.
For strategy='trimming':
Instances with outliers will be dropped (Not recommended).
For strategy='capping':
Instances with outliers will be "capped" or replaced with upper and lower limit
values computed according to the method chosen.
For strategy='nan':
Instnaces with outliers will have the positions of outliers replaced with np.nan
"NaN" values. This data can then be treated and imputed using any of the various
imputation techniques.

method = {'z_score', 'iqr', 'percentile'}
method sets how outliers will be detected. default = 'iqr'.
Depending on method chosen, you will need to pass additional parameter(s).
For method='z_score':
Pass Standard Deviation 'std' above and below which outliers will be
detected and handles according to strategy chosen. Note that this method is
optimally used only for normally distributed features.
default=3.
For method='iqr':
Pass 'factor' by which IQR needs to be multiplied by for computing upper and lower limits.
default=1.5.
For method='percentile':
Pass 'alpha' which will be the percentile that will be used to detect outliers and handle it
according to the method chosen.
default=0.01.

Returns:
numpy.ndarray of transformed data with outliers detected and handles
according to method and strategy chosen respectively.
"""

def __init__(self, strategy = 'cap', method = 'iqr', factor=1.5, zstd=3, alpha=0.01):
self.factor = factor
self.strategy = strategy
self.method = method
self.factor = factor
self.factor = factor
self.zstd = zstd
self.alpha = alpha

def outlier_iqr(self,X,y=None):
X = pd.Series(X).copy()
self.q1 = X.quantile(0.25)
self.q3 = X.quantile(0.75)
self.iqr = self.q3 - self.q1
self.lower_bound = self.q1 - (self.factor * self.iqr)
self.upper_bound = self.q3 + (self.factor * self.iqr)
if self.strategy == 'nan':
X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
elif self.strategy == 'trim':
X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
else:
X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
return pd.Series(X)

def outlier_zscore(self,X,y=None):
X = pd.Series(X).copy()
self.mean = X.mean()
self.std = X.std()
self.lower_bound = self.mean - (self.zstd * self.std)
self.upper_bound = self.mean + (self.zstd * self.std)
if self.strategy == 'nan':
X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
elif self.strategy == 'trim':
X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
else:
X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
return pd.Series(X)

def outlier_percentile(self,X,y=None):
X = pd.Series(X).copy()
self.lower_bound = X.quantile(0.00+self.alpha)
self.upper_bound = X.quantile(1.00-self.alpha)
if self.strategy == 'nan':
X.loc[((X < self.lower_bound) | (X > self.upper_bound))] = np.nan
elif self.strategy == 'trim':
X = X.loc[((X > self.lower_bound) & (X < self.upper_bound))]
else:
X = np.where((X > self.upper_bound), self.upper_bound, np.where((X < self.lower_bound), self.lower_bound, X))
return pd.Series(X)

def fit(self,X,y=None):
return self

def transform(self,X,y=None):
if self.method == 'iqr':
return X.apply(self.outlier_iqr)
elif self.method == 'z_score':
return X.apply(self.outlier_zscore)
else:
return X.apply(self.outlier_percentile)
@123arskas Рік тому ⁺²
Kindly share your GitHub link instead of commenting the code.
@mohitkushwaha8974 Рік тому ⁺⁷
Doubt 1 - Sir is this percentile method is applicable to both distrubution , normally and non- normally curves ????
Doubt 2- Also u said u have to give threshold value equally/symetrically say 1% and 99% or 5% and 95% ,
What if our data is right or left skew distributed then in that case most of the outliers would be there on the one of the extreme ends and not at both the ends, so in that case if we use symmetrical threshold then we might lose some of of our non outliers???
Doubt 3- if we have to remove outliers before train and test split then in that case we have to fill the missing value too before the train and test split , but u taught us that u have to fill missing value after train and test split .
awaiting for ur kind reply
@AyushPatel 3 роки тому ⁺³
sir i just wanted to ask that can we write our own machine learning algorithms instead of using sklearn and tensorflow i mean from scratch plz make a video about that. I have been following you whole series. Sir do reply. Thanks to your efforts
@zainfaisal3153 8 місяців тому ⁺¹
Sir! I have a question about this lecture that can we apply this technique for both normal and skewed data?
@namansethi1767 2 роки тому
Thanks sir for these video, it helps us a lot to polish our all skills.
@ParthivShah 8 місяців тому ⁺¹
Thank You Sir.
@230489shraddha 2 роки тому ⁺¹
Thanks for this informative tutorial. I have a doubt that how do we find what min and max level of percentile value we should choose to eliminate outliers?
@amanpreetsinghgulati2475 2 роки тому
As he mentioned that he already have worked with the data so he somewhere knew what can be the min and Max percentile. Meanwhile in real setup it's very much practical some domain knowledge will also comes into picture and may be hit and trail methods as well (experimental - again mentioned by him)
@ronbiswas8592 2 роки тому ⁺¹
Can we just cap or remove outliers on the upper side while keeping the lower side outliers ?
@JACKSPARROW-ch7jl Рік тому ⁺²
🎉🎉🎉🎉❤❤❤❤❤Keep it up
@nidhisharma302 Місяць тому
Sir aapne non distribution plot ka outlier nhi btaya.. last video me bola tha ke coloum agar non distribution hoga to kaise outlier detect krenge
@shubhamjain-li5tn 3 роки тому ⁺³
Sir, is there any percentage of data below which we can accept the outliers in the data set?
Say if the outliers are only 2% of the data, then there is no need to remove the outliers and we can build the model.
@ParthivShah 8 місяців тому
It would be better if your data don't have any outlier. bcoz 2% of outlier can also decrease accuracy by some percent.
So Try to find and remove it, Though you find it difficult then It's ok.
I would suggest to "CAP" it if the outlier is only 2%.(my thoughts, not necessarily right).
@vikranttomar8392 2 роки тому
Do we have to check every column for outliers and that same goes for other methods.
@whatdidilearntoday6369 Рік тому
Hi Nitish , i ran your notebook and checked the box plot . It shows one outlier but if i check for records > Q3+1.5*IQR , there are no records. Thats strange
@mohammadfarazgoriya5929 2 роки тому
Sir how can we do capping if we have multiple columns??
@heetbhatt4511 Рік тому
Thank you sir
@tejaspatil3760 2 роки тому
sir what we do when we have 3 lacs rows and there is almost 10k outliers, How we can treat it
@ajaykushwaha-je6mw 3 роки тому
Sir outlier remover to humein X_train aut X_test pe kerna chahhiye tha.
@thomsonblaze Рік тому
what to do when the data is not normally distributed?
@deeprajmazumder6261 Рік тому
Sir what to do when there are too many outliers?
@acharjyaarijit Рік тому
how can we remove multivariant or bivariant outliers?
@arshad1781 3 роки тому
thanks
@Priyam_barman 9 місяців тому
lekin ye outliers kaise huye heigh me logo ki height to hoti hi hai aesi
@Priyam_barman 9 місяців тому
meko marks me to koi outlier nahi laga logo k nuber jayda aate hai
@Turkish811 2 місяці тому
🤍✨🌹

Наступне

Автоматичне відтворення

Feature Construction | Feature Splitting

Feature Construction | Feature Splitting

Outlier Detection and Removal using the IQR Method | Handing Outliers Part 3

Outlier Detection and Removal using the IQR Method | Handing Outliers Part 3

How to Identify and Treat Outliers in Stata | Stata Tutorial

How to Identify and Treat Outliers in Stata | Stata Tutorial

Моя історія завмерлої вагітності...

Моя історія завмерлої вагітності...

"Москва - это правнучка Киева, Крым - это Украина" - Борис Миронов размазал крымнашистов @omtvreal

"Москва - это правнучка Киева, Крым - это Украина" - Борис Миронов размазал крымнашистов @omtvreal

Отечественный Суперкар Маруся! Оживляем легенду

Отечественный Суперкар Маруся! Оживляем легенду

Интересный поединок

Интересный поединок

Outlier Detection and Removal using Z-score Method | Handling Outliers Part 2

Outlier Detection and Removal using Z-score Method | Handling Outliers Part 2

Outlier detection and removal using percentile | Feature engineering tutorial python # 2

Outlier detection and removal using percentile | Feature engineering tutorial python # 2

What are Outliers | Outliers in Machine Learning

What are Outliers | Outliers in Machine Learning

How to Detect and Remove Outliers in the Data | Python

How to Detect and Remove Outliers in the Data | Python

Generative AI in a Nutshell - how to survive and thrive in the age of AI

Generative AI in a Nutshell - how to survive and thrive in the age of AI

Statistics-Finding Outliers in Dataset using Z- score and IQR

Statistics-Finding Outliers in Dataset using Z- score and IQR

Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

Outlier detection and removal: z score, standard deviation | Feature engineering tutorial python # 3

Outliers : Data Science Basics

Outliers : Data Science Basics

Интересный поединок

Интересный поединок

ПРОВОЦИРОВАТЬ Тайсона - самая БОЛЬШАЯ ОШИБКА в его жизни #shorts

ПРОВОЦИРОВАТЬ Тайсона - самая БОЛЬШАЯ ОШИБКА в его жизни #shorts

Не так важно как ТЫ БЬЁШЬ, а важно какой ДЕРЖИШЬ УДАР😎 #shorts

Не так важно как ТЫ БЬЁШЬ, а важно какой ДЕРЖИШЬ УДАР😎 #shorts

Отечественный Суперкар Маруся! Оживляем легенду

Отечественный Суперкар Маруся! Оживляем легенду

Главная суперспособность армейских муравьев и пляжные упогебии

Главная суперспособность армейских муравьев и пляжные упогебии

Легендарный «Цезарь» (легион «Свобода России»). Отставка Путина, захват Харькова и Днепра

Легендарный «Цезарь» (легион «Свобода России»). Отставка Путина, захват Харькова и Днепра

Human vs Jet Engine

Human vs Jet Engine

Чим ви займалися до мобілізації? / hromadske

Чим ви займалися до мобілізації? / hromadske