Anomaly Detection with Isolation Forest ❌Unsupervised Machine Learning with Python

Поділитися
Вставка
  • Опубліковано 5 вер 2024

КОМЕНТАРІ • 88

  • @DecisionForest
    @DecisionForest  4 роки тому +2

    Hi there! If you want to stay up to date with the latest machine learning and big data analysis tutorials please subscribe here:
    ua-cam.com/users/decisionforest
    Also drop your ideas for future videos, let us know what topics you're interested in! 👇🏻

    • @sushmithapulagam6021
      @sushmithapulagam6021 4 роки тому

      Great!!. Easy to understand. After you get the anomalies, how you are going to identify from the dataset which are labeled as 0. Please add that code as well.

    • @DecisionForest
      @DecisionForest  4 роки тому

      Thank you, glad you found it helpful. To do that you simply filter the dataframe with the condition target == 0 , in our example df[df.iforest == 0]

  • @shylilak
    @shylilak 2 роки тому +3

    That was a concise and clear tutorial! It would have been lovely to see some metrics on a test data set and also a visiaul on what those outliers looked like :)

  • @xianz2609
    @xianz2609 3 роки тому +7

    Hi, really enjoyed this tutorial. But, may I know if there are any way to output some sort of score for the accuracy of the model to use it for hyperparameter tuning? Much appreciated thanks.

    • @tonyliu7542
      @tonyliu7542 2 роки тому

      AUC is usually used to measure the model accuracy.

  • @brentrau7570
    @brentrau7570 3 роки тому +5

    thank you for the video, very insightful. my question pertains to the contamination parameter you set -- what if we don't have any expectation around what this would be? I'm working on a very high dimensional multi class classification problem and am uncertain about how I'd go about setting this parameter.

    • @DecisionForest
      @DecisionForest  3 роки тому +2

      Thank you Brent, glad it was helpful. Regarding the contamination, you set that based on your expected percentage of anomalies within that dataset. Assuming you have knowledge of expected anomaly rates for that dataset of course. If you have no such information I would suggest you go for a lower value in the range .001 to .01 and analyse the results. And tweak it based on the relevancy of the findings.

    • @shreyaskulkarni5823
      @shreyaskulkarni5823 Рік тому

      @@DecisionForest can we do hyperparameter tunning? I think by adjusting the contamination value

  • @ayenewyihune
    @ayenewyihune 2 роки тому +1

    Thank you for the helpful video. My question is that the method expects that some of the data points are anomalies. What if I want to check the data has anomalies or not? I mean I don't want just some percentage of the data to be necessarily classified as anomaly.

  • @mattfinn7884
    @mattfinn7884 2 роки тому +1

    Great video? Can we see which tuples are the outliers?

  • @mam7967
    @mam7967 2 роки тому +1

    Thanks, this was good.
    How does one see the actual rows or records which are outliers after running iforest?

  • @meetmeraj2000
    @meetmeraj2000 3 роки тому +1

    Hi, Thanks for the wonderful explanation, How to verify model performance and can we do hyper parameter tuning.Thanks

  • @EroneInnocent
    @EroneInnocent 3 роки тому +1

    Thanks for the wonderful video. Could you please do one on supervised anomaly detection perhaps using one-class SVM with gridSearchCV or RamdomSearchCV

  • @casaGnawa
    @casaGnawa Рік тому

    @DecisionForest Thanks a ton for your valuable video. Your Jupyter notebook link is not working anymore. Can you please update with working source url. That would be highly appreciated 🙏

  • @robertotomas
    @robertotomas 4 роки тому +1

    Great, approachable intro! Thank you

  • @aguelejoseph5753
    @aguelejoseph5753 Рік тому

    please how would you deploy a trained isolation forest model to an android application ?

  • @luispinto4319
    @luispinto4319 3 роки тому +1

    You didn't remove the target (isFraud) so your model is actually just using that information

    • @DecisionForest
      @DecisionForest  3 роки тому

      Very good point, and normally you don’t even have the target of course otherwise you wouldn’t need this in the first place. Since this dataset had it I kept it to see how the Isolation Forest performs and how many it identifies correctly, even if partially helped (partially helped because we know that’s the target, the algorithm sees it just as another binary feature). The isfraud feature is just one feature given to an unsupervised learning algorithm therefore the algorithm doesn’t output anomalies solely on this. But it does influence the outcome slightly and in a real world scenario you’d not use the fraud features of course, you wouldn’t have it anyway.

  • @ankitranjan30
    @ankitranjan30 2 роки тому

    Thank you for the detailed video! You use the terms outliers and anomalies interchangeably but they're different. Can you comment on why it's used interchangeably in this video? Is Isolation Forest used for outlier detection or anomaly detection? Hope this can be answered!

    • @DecisionForest
      @DecisionForest  2 роки тому +1

      Thank you Ankit. For sure, I might use them interchangeably for convenience as the topic is Anomaly Detection with IF.

    • @tonyliu7542
      @tonyliu7542 2 роки тому +1

      Anomaly points inside data are also anomalies. While outliers only refer to outlying points.

  • @decoolest6616
    @decoolest6616 Рік тому

    Thanks for the good job here. i have a question, how do you extract the inlier set to continue with the various machine learning algorithms?

  • @sebastianmatt7707
    @sebastianmatt7707 3 роки тому +1

    Well explained, thank you!

  • @ayushsrivastav8220
    @ayushsrivastav8220 4 роки тому +1

    But how do we evaluate our model that how it is perfoming,like how did we get accuracy measure ???

    • @DecisionForest
      @DecisionForest  4 роки тому

      Check this video out, hope it will help: ua-cam.com/video/i42UzP3mT58/v-deo.html

  • @shivamaggarwal4451
    @shivamaggarwal4451 2 роки тому

    Firstly thanks for the tutorial, it was very informative. My question is related to the visualization of results. Is there a way we can can effectively visualize the results of multidimensional data?

  • @niveditadwivedi6466
    @niveditadwivedi6466 3 роки тому

    what is the importance of this dataset? why was this chosen??

  • @nikhilagrawal2423
    @nikhilagrawal2423 3 роки тому

    Hey great video, but is there a way to find which feature contributed most to the row classified as an outlier?

    • @DecisionForest
      @DecisionForest  3 роки тому

      Thanks mate! From my current knowledge the Isolation Forest cannot have feature importance implemented due to the random nature of the splits. I have another video that might be helpful with this: ua-cam.com/video/i42UzP3mT58/v-deo.html. Let me know if that helps.

  • @vittoriarispoli5299
    @vittoriarispoli5299 4 роки тому +1

    Hello, I am writing the thesis of the university on this topic, but I don't find much on the internet, could you tell me how do I get a graph printed that underlines the isolation of anomalies in phyton?

    • @DecisionForest
      @DecisionForest  4 роки тому +1

      Well you can plot the normal values vs anomalies per each categorical feature. This would be a nice way to show how different values have more of a predictive capability. Something like here: ua-cam.com/video/i42UzP3mT58/v-deo.html . Hope this helps.

  • @shaikrasool1316
    @shaikrasool1316 4 роки тому

    Sir, i have doubt.
    Suppose we have credit card data..
    First we label those data with anamoly detection... Then we go for preprocessing and model building like random forest ,am i right..
    So anamoly detection is used to label the outliers correctly..
    This is what i understood from this video...
    Is my understanding is correct sir??

    • @DecisionForest
      @DecisionForest  4 роки тому

      Good question and hope I understood it correctly. We only have one model, the Isolation Forest algorithm that uses a Random Forest approach to isolate the records from one another. And it considers the records with the shorter paths as anomalies as it took less time to isolate them. So we do the label encoding prior to building the isolation forest (not a random forest), that leads to getting the possible outliers in the dataset.

  • @JonathanAmbriz
    @JonathanAmbriz 3 роки тому

    Would this apply to detecting anomalies in it assets?

  • @som856
    @som856 2 роки тому

    Hi
    Much thanks for the video
    Can you please provide the code or github link from which can down load along with data set .
    This will be very helpful.

  • @anupriyagupta7784
    @anupriyagupta7784 Рік тому

    I am getting error while executing this commands. Can you please provide the inpyb file for this code?

  • @h120n
    @h120n 3 роки тому +1

    Hey there, you have some mistakes in your video.
    You said that IForest can work with categorical attributes but that's wrong, it only deals with numerical.
    The way you handled missing values is wrong.
    Contrary to your other answers, label encoding is actually very wrong. By turning categorical attributes to numeric - the way you did it - it simply does not work for isolation forest. When it performs the random splits, the frequency of these classes simply introduces bias as the algorithm is forced to isolate the most common classes first. This is wrong and against the underlying principles of IForest as the most frequent classes in theory point to inliers, not outliers, and these should be isolated deeper into the trees. This also makes your classes ordinal which is also wrong. You should have used one hot encoding.
    Finally, I'm not sure if you understand what the contamination parameter means. If you already have labelled instances, you can simply use the percentage of these as your contamination value. Training the algorithm with the labels is simply your biggest mistake. If you wanted to compare your predictions to the actual anomalies, you could just ignore that column for the training

  • @TooManyPBJs
    @TooManyPBJs 3 роки тому

    Is there an error metricthat we should be looking for?

  • @sushmithapulagam6021
    @sushmithapulagam6021 4 роки тому

    Great tutorial. Thanks!!

  • @thomasberweger3088
    @thomasberweger3088 3 роки тому

    How do you get a Data Frame with the rows that have been classified as outliers from the model?

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Simply create a filter where the target == 1

  • @ketan_sahu
    @ketan_sahu 3 роки тому

    Well explained!

  • @flamboyantperson5936
    @flamboyantperson5936 3 роки тому +1

    Congratulations your channel got monetized so early :))

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Thank you so much! Yees, it took only 15 months haha

    • @flamboyantperson5936
      @flamboyantperson5936 3 роки тому

      @@DecisionForest I actually have started watching your channel recently. In the past few days, you have a lot of videos regularly on your channel. I didn't know it took 15 months haha great work

    • @flamboyantperson5936
      @flamboyantperson5936 3 роки тому

      Have you ever tried to google ad sense to promote your videos to get views?

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Yes in the past months I did post more often. It varies though as there are weeks in which I have a lot more work and cannot record as often.

    • @DecisionForest
      @DecisionForest  3 роки тому

      I haven’t as I believe that if the content is useful people will eventually find it 😊 And if they like it then it sets up the connection better in the long term.

  • @stravero2503
    @stravero2503 3 роки тому

    Hiii !!
    I have a question and i can't find anything online.
    Can an isolation forest works well with a massive number of 0 values and a lot of columns?
    Great video anyway!
    Thanks,
    Veronica

    • @DecisionForest
      @DecisionForest  3 роки тому

      Hi Veronica, thanks for the support! To answer your question it will work but not great as there isn’t much information in those features unfortunately. Depends of course on the size of the dataset as well.

    • @stravero2503
      @stravero2503 3 роки тому

      @@DecisionForest Ok thanks, I expected that.
      The dataset is big, could PCA be a valid solution? Probably I'll lose to much information
      Thank you a lot

    • @DecisionForest
      @DecisionForest  3 роки тому

      PCA summarizes the variance in the dataset so it wouldn’t negatively affect the results but it wouldn’t help either.
      The best best is to try with the full range of columns and then see if you obtain any relevant insights.

  • @pabloarriagadaojeda6452
    @pabloarriagadaojeda6452 11 місяців тому

    if there's a cat, there's a like

  • @Ifbayhaqi
    @Ifbayhaqi Рік тому

    hii can i have the jupyter notebook code?

  • @gourabguha3167
    @gourabguha3167 Рік тому

    Any chance we get the source code

  • @mfavaits
    @mfavaits 3 роки тому

    Picture (at 3:00 minutes) looks very familiar to me....how did you get a written copy of my course ?

    • @DecisionForest
      @DecisionForest  3 роки тому

      What do you mean, is the explanation similar to how you present it as well? The picture is just a simple representation of a decision tree in the isolation forest algorithm

    • @mfavaits
      @mfavaits 3 роки тому

      @@DecisionForest its funny, I drew it exactly the same way ...anyway, liked your video

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      @@mfavaits great minds think alike :) Glad you enjoyed it, trying to do my best with this channel

  • @carloslopez7204
    @carloslopez7204 3 роки тому

    I think it's not correct to use label encoder in categorical data.
    More details:
    datascience.stackexchange.com/questions/36006/categorical-data-for-sklearns-isolation-forrest

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Hi Carlos, label encoding encodes the target labels with numeric value from 0 to n-1 and it's required for the Isolation Forest algorithm. As I agree that for many algorithms the ordinal features that we get impose false orders of magnitude, tree-based algorithms aren't influenced by this.
      The way the Isolation Forest algorithm works is that it isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. So the Isolation Forest doesn't care about the orders of magnitude.

    • @carloslopez7204
      @carloslopez7204 3 роки тому

      @@DecisionForest Thanks for the explanation I'm gonna read more about the theory behind how isolation forest really works to understand your comment properly, it's my first time using this algorithm

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Happy to help. 👍🏻

  • @kasperknudsen4876
    @kasperknudsen4876 2 роки тому

    King

  • @mohsinkhalid2375
    @mohsinkhalid2375 3 роки тому

    How can you give the class feature(isFraud) as input to the model?, the model will tell you outliers solely based on this target feature. Your logic is completely wrong.

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Very good question, and normally you don’t even have the target of course otherwise you wouldn’t need this in the first place. Since this dataset had it I kept it to see how the Isolation Forest performs and how many it identifies correctly, even if partially helped (partially helped because we know that’s the target, the algorithm sees it just as another binary feature). The isfraud feature is just one feature given to an unsupervised learning algorithm therefore the algorithm doesn’t output anomalies solely on this. But it does influence the outcome slightly and in a real world scenario you’d not use the fraud features of course, you wouldn’t have it anyway.

  • @WeebRipples
    @WeebRipples 2 роки тому

    Is it univariate or multivariate based analysis for labeling the data to be outlier or not?

  • @gtownmunda
    @gtownmunda 3 роки тому

    Can I get column wise outliers by using this? Can you help pls

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      For column wise outliers you don't need this, you can just explore outliers based on data type.

    • @gtownmunda
      @gtownmunda 3 роки тому

      @@DecisionForest sorry I didn't understand what you mean by based on data type, I was trying to use this algorithm and trying to apply to each column so that I can get 0's and 1's column for each column, can you suggest the best way to detect column wise outliers in any dataset pls?

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      @@gtownmunda no problem, well fr continuous variables you can simply use a boxplot or histograms. Categorical variables don't have outliers. So for every column you can just explore the data type and check for anomalies with statistical methods.

    • @gtownmunda
      @gtownmunda 3 роки тому

      @@DecisionForest thank you for your response, one last question!
      What is the issue with using isolation forest? I was thinking of iterating though all the columns and fixing the model on each column separately and getting output for each column one by one

    • @DecisionForest
      @DecisionForest  3 роки тому +2

      @@gtownmunda anytime, it's good to ask questions. Unsupervised learning algos analyse all of the features, as they check for patterns between them, between multiple features. Analysing one feature is not a case for unsupervised machine learning algorithms as you can easily do that with normal EDA.

  • @jaysaha1967
    @jaysaha1967 4 роки тому

    Link for the notebook?

    • @DecisionForest
      @DecisionForest  4 роки тому

      Will add the link in the description as soon as possible.

  • @vijithav5908
    @vijithav5908 3 роки тому

    Can i get the source code and data?

    • @DecisionForest
      @DecisionForest  3 роки тому +1

      Of course, the link is in the description.

  • @pattiknuth4822
    @pattiknuth4822 3 роки тому

    Do not waste your time. His explanation was as clear as mud. I define anyone to define an isolation forest after listening to the first 5 minues of this video.

    • @DecisionForest
      @DecisionForest  3 роки тому

      Hi Patti, thanks for the comment. Could you please clarify what you didn’t understand? As a lot of other people found it helpful.