Isolation Forest for Outlier Detection within Python

Andy McDonald

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 лип 2024
Isolation Forest is a popular unsupervised machine learning algorithm for detecting anomalies (outliers) within datasets. Anomaly detection is a crucial part of any machine learning and data science workflow. Erroneous values that are not identified early on can result in inaccurate predictions from machine learning models, and therefore impact the interpretation of those results.
⭐️ If you haven't already, make sure you subscribe to the channel: / @andymcdonald42
▼ --- SUPPORT THE CHANNEL --- ▼
☕️ BUY ME A COFFEE: www.buymeacoffee.com/andymcdo...
▼ --- RECOMMENDED BOOKS --- ▼
As an Amazon Associate I earn from qualifying purchases. By buying through any of the links below I will earn commission at no extra cost to you.
PYTHON FOR DATA ANALYSIS: Data Wrangling with Pandas, NumPy, and IPython
UK: amzn.to/3HNycJ9
US: amzn.to/3DL7qPv
FUNDAMENTALS OF PETROPHYSICS
UK: amzn.to/3l1PgSf
PETROPHYSICS: Theory and Practice of Measuring Reservoir Rock and Fluid Transport Properties
UK: amzn.to/30UNWZS
US: amzn.to/3DNqBbd
WELL LOGGING FOR EARTH SCIENTISTS
UK: amzn.to/3FHsbfn
US: amzn.to/3CILAuE
GEOLOGICAL INTERPRETATION OF WELL LOGS
UK: amzn.to/3l2v2HV
US: amzn.to/30UOTkU
▼ --- SOCIAL CHANNELS --- ▼
Thanks for watching, if you want to connect you can find me at the links below:
/ andymcdonaldgeo
/ geoandymcd
/ andymcdonaldgeo
www.andymcdonald.scot/
Be sure to sign up for my newsletter to be kept updated when I post and share new content on UA-cam and Medium.
www.getrevue.co/profile/andym...
#datascience #petrophysics #python #eda #datascience
Наука та технологія

КОМЕНТАРІ • 21

@smn7074 Рік тому ⁺³
thanks for your great video. exactly what i needed.
@vitorribeirosa Рік тому ⁺²
Thanks, Andy!!!
Great video!!!
@mwasimmit 10 місяців тому ⁺¹
for plotting in 2D if i reduce the dimensin to 2 dimensions using PCA and Plot it with the model result.. will it be a good summerize plot?
@faicornelius2601 Рік тому ⁺²
Thanks so much for your great videos.
@MonuSaraswati День тому
Hi Andy - Can you please share this dataset ? I have not been able to find it online
@pioner40 Рік тому
very good video. do you share the notebook ?
@pramishprakash 10 місяців тому
Great explanation Sir
@fastisslow6177 11 днів тому
nice explanation👍
@user-eu5ri8cr1c 11 місяців тому
hi .. any python lib to create visual family tree with SQLite db ?
@gourabguha3167 11 місяців тому
Any chance we can get the github link or the source code .ipynb file along with the dataset
@redpantherofmadrid 5 місяців тому
well explained, thanks a lot, and love the accent, its a bonus :)
@rawabih4026 Рік тому
شكرا من أعماق القلب
@FxbxxxScxlxrxxnx Рік тому ⁺³
got a question: I have created a model using IF, and I fitted the model with my training dataset, now I want to apply this model to my test dataset. I don't really understand how I actually need to imagine this process of "fitting the IF model"? I mean, when I set contamination to, let's say, 5%, then my model calculates the anomaly scores of all values in the training dataset assigning to the 5% "most anomaly-like" data points the value -1 describing them as anomalies, right?, and after that when I pass my test dataset to the model, does my model then actually just reuse this structure of the IF trained with the training dataset for calculating the anomaly scores of the test data points and then it just compares if there are any anomaly-scores of test data points that superate the lowest one of these 5% "most anomaly-like" datapoints of the training dataset regarding their anomaly-score? And if any test data points are superating the lowest anomaly score of the 5% "most anomaly-like" data points in the training dataset then the data points in my test dataset are described as anomalies?
@johnbaptistbypassinglife Рік тому ⁺²
Yes, that's correct! When you fit an Isolation Forest (IF) model to your training data, the model will create a number of decision trees and use them to calculate anomaly scores for each data point in the training set. The data points with the highest anomaly scores will be considered the "most anomaly-like" and will be given a label of -1 to indicate that they are anomalies.
When you apply the model to your test data, the model will use the same decision trees and calculation process to determine the anomaly scores for each data point in the test set. If any data points in the test set have anomaly scores that are higher than the lowest anomaly score of the "most anomaly-like" data points in the training set, they will also be given a label of -1 to indicate that they are anomalies.
This process allows the model to identify anomalies in the test data that are similar to the anomalies identified in the training data. However, it's important to note that the model may also identify anomalies in the test data that were not present in the training data, as the model is designed to detect unusual or unexpected patterns in the data.
I hope this helps to clarify the process of fitting and applying an IF model to your data! Let me know if you have any other questions.
@mngreta 6 місяців тому ⁺¹
Can you please share the code? I took the time and tried to copy from the video but something is still wrong :(
@faicornelius2601 Рік тому
Please Andy, after identifying the outliers, how do we remove them?
@AndyMcDonald42 Рік тому ⁺¹
Removing outliers needs to be done with due consideration. The cause of them being outliers needs to be properly understood and then the appropriate course of action can be taken.
I discuss multiple methods of dealing with outliers in my medium article here: towardsdatascience.com/well-log-data-outlier-detection-with-machine-learning-a19cafc5ea37
@faicornelius2601 Рік тому
@@AndyMcDonald42 Thank you so much Andy. I have just followed you on Towards data Science. You are a great teacher.
@danymerizalde1942 9 місяців тому
Where is the data?
@lashlarue7924 11 місяців тому
🫡👏👏👏❤
@nikolanovakovic7591 5 місяців тому
really struggling to understand this accent

Наступне

Автоматичне відтворення

PyGWalker for Exploratory Data Analysis In Jupyter Notebooks