Isolation Forest for Outlier Detection within Python

Поділитися
Вставка
  • Опубліковано 4 лип 2024
  • Isolation Forest is a popular unsupervised machine learning algorithm for detecting anomalies (outliers) within datasets. Anomaly detection is a crucial part of any machine learning and data science workflow. Erroneous values that are not identified early on can result in inaccurate predictions from machine learning models, and therefore impact the interpretation of those results.
    ⭐️ If you haven't already, make sure you subscribe to the channel: / @andymcdonald42
    ▼ --- SUPPORT THE CHANNEL --- ▼
    ☕️ BUY ME A COFFEE: www.buymeacoffee.com/andymcdo...
    ▼ --- RECOMMENDED BOOKS --- ▼
    As an Amazon Associate I earn from qualifying purchases. By buying through any of the links below I will earn commission at no extra cost to you.
    PYTHON FOR DATA ANALYSIS: Data Wrangling with Pandas, NumPy, and IPython
    UK: amzn.to/3HNycJ9
    US: amzn.to/3DL7qPv
    FUNDAMENTALS OF PETROPHYSICS
    UK: amzn.to/3l1PgSf
    PETROPHYSICS: Theory and Practice of Measuring Reservoir Rock and Fluid Transport Properties
    UK: amzn.to/30UNWZS
    US: amzn.to/3DNqBbd
    WELL LOGGING FOR EARTH SCIENTISTS
    UK: amzn.to/3FHsbfn
    US: amzn.to/3CILAuE
    GEOLOGICAL INTERPRETATION OF WELL LOGS
    UK: amzn.to/3l2v2HV
    US: amzn.to/30UOTkU
    ▼ --- SOCIAL CHANNELS --- ▼
    Thanks for watching, if you want to connect you can find me at the links below:
    / andymcdonaldgeo
    / geoandymcd
    / andymcdonaldgeo
    www.andymcdonald.scot/
    Be sure to sign up for my newsletter to be kept updated when I post and share new content on UA-cam and Medium.
    www.getrevue.co/profile/andym...
    #datascience #petrophysics #python #eda #datascience
  • Наука та технологія

КОМЕНТАРІ • 21

  • @smn7074
    @smn7074 Рік тому +3

    thanks for your great video. exactly what i needed.

  • @vitorribeirosa
    @vitorribeirosa Рік тому +2

    Thanks, Andy!!!
    Great video!!!

  • @mwasimmit
    @mwasimmit 10 місяців тому +1

    for plotting in 2D if i reduce the dimensin to 2 dimensions using PCA and Plot it with the model result.. will it be a good summerize plot?

  • @faicornelius2601
    @faicornelius2601 Рік тому +2

    Thanks so much for your great videos.

  • @MonuSaraswati
    @MonuSaraswati День тому

    Hi Andy - Can you please share this dataset ? I have not been able to find it online

  • @pioner40
    @pioner40 Рік тому

    very good video. do you share the notebook ?

  • @pramishprakash
    @pramishprakash 10 місяців тому

    Great explanation Sir

  • @fastisslow6177
    @fastisslow6177 11 днів тому

    nice explanation👍

  • @user-eu5ri8cr1c
    @user-eu5ri8cr1c 11 місяців тому

    hi .. any python lib to create visual family tree with SQLite db ?

  • @gourabguha3167
    @gourabguha3167 11 місяців тому

    Any chance we can get the github link or the source code .ipynb file along with the dataset

  • @redpantherofmadrid
    @redpantherofmadrid 5 місяців тому

    well explained, thanks a lot, and love the accent, its a bonus :)

  • @rawabih4026
    @rawabih4026 Рік тому

    شكرا من أعماق القلب

  • @FxbxxxScxlxrxxnx
    @FxbxxxScxlxrxxnx Рік тому +3

    got a question: I have created a model using IF, and I fitted the model with my training dataset, now I want to apply this model to my test dataset. I don't really understand how I actually need to imagine this process of "fitting the IF model"? I mean, when I set contamination to, let's say, 5%, then my model calculates the anomaly scores of all values in the training dataset assigning to the 5% "most anomaly-like" data points the value -1 describing them as anomalies, right?, and after that when I pass my test dataset to the model, does my model then actually just reuse this structure of the IF trained with the training dataset for calculating the anomaly scores of the test data points and then it just compares if there are any anomaly-scores of test data points that superate the lowest one of these 5% "most anomaly-like" datapoints of the training dataset regarding their anomaly-score? And if any test data points are superating the lowest anomaly score of the 5% "most anomaly-like" data points in the training dataset then the data points in my test dataset are described as anomalies?

    • @johnbaptistbypassinglife
      @johnbaptistbypassinglife Рік тому +2

      Yes, that's correct! When you fit an Isolation Forest (IF) model to your training data, the model will create a number of decision trees and use them to calculate anomaly scores for each data point in the training set. The data points with the highest anomaly scores will be considered the "most anomaly-like" and will be given a label of -1 to indicate that they are anomalies.
      When you apply the model to your test data, the model will use the same decision trees and calculation process to determine the anomaly scores for each data point in the test set. If any data points in the test set have anomaly scores that are higher than the lowest anomaly score of the "most anomaly-like" data points in the training set, they will also be given a label of -1 to indicate that they are anomalies.
      This process allows the model to identify anomalies in the test data that are similar to the anomalies identified in the training data. However, it's important to note that the model may also identify anomalies in the test data that were not present in the training data, as the model is designed to detect unusual or unexpected patterns in the data.
      I hope this helps to clarify the process of fitting and applying an IF model to your data! Let me know if you have any other questions.

  • @mngreta
    @mngreta 6 місяців тому +1

    Can you please share the code? I took the time and tried to copy from the video but something is still wrong :(

  • @faicornelius2601
    @faicornelius2601 Рік тому

    Please Andy, after identifying the outliers, how do we remove them?

    • @AndyMcDonald42
      @AndyMcDonald42  Рік тому +1

      Removing outliers needs to be done with due consideration. The cause of them being outliers needs to be properly understood and then the appropriate course of action can be taken.
      I discuss multiple methods of dealing with outliers in my medium article here: towardsdatascience.com/well-log-data-outlier-detection-with-machine-learning-a19cafc5ea37

    • @faicornelius2601
      @faicornelius2601 Рік тому

      @@AndyMcDonald42 Thank you so much Andy. I have just followed you on Towards data Science. You are a great teacher.

  • @danymerizalde1942
    @danymerizalde1942 9 місяців тому

    Where is the data?

  • @lashlarue7924
    @lashlarue7924 11 місяців тому

    🫡👏👏👏❤

  • @nikolanovakovic7591
    @nikolanovakovic7591 5 місяців тому

    really struggling to understand this accent