SHAP with Python (Code and Explanations)

Поділитися
Вставка
  • Опубліковано 2 тра 2024
  • SHAP is the most powerful Python package for understanding and debugging your machine learning models. It can be used to explain both individual predictions and trends across multiple predictions. We explore how by walking through the code and explanations for the SHAP waterfall plot, force plot, absolute mean plot, beeswarm plot and dependence plots.
    SHAP course: adataodyssey.com/courses/shap...
    XAI course: adataodyssey.com/courses/xai-...
    Newsletter signup: mailchi.mp/40909011987b/signup
    *NOTE*: You will now get the XAI course for free if you sign up (not the SHAP course)
    Read the companion article (no-paywall link):
    towardsdatascience.com/introd...
    SHAP for Categorical Features (no-paywall link): towardsdatascience.com/shap-f...
    Medium: / conorosullyds
    Twitter: / conorosullyds
    Mastodon: sigmoid.social/@conorosully
    Website: adataodyssey.com/

КОМЕНТАРІ • 70

  • @adataodyssey
    @adataodyssey  2 місяці тому +1

    *NOTE*: You will now get the XAI course for free if you sign up (not the SHAP course)
    SHAP course: adataodyssey.com/courses/shap-with-python/
    XAI course: adataodyssey.com/courses/xai-with-python/
    Newsletter signup: mailchi.mp/40909011987b/signup

    • @mohadesehkeshavarz9107
      @mohadesehkeshavarz9107 Місяць тому

      why can not get the XAI for free? the time had ended?

    • @adataodyssey
      @adataodyssey  Місяць тому

      @@mohadesehkeshavarz9107 if you sign up for the newsletter letter, you will get a coupon that gives you free access to the XAI course. If you are still having trouble, send me your email on Instagram.

  • @tamojitmaiti
    @tamojitmaiti 2 місяці тому

    This is so clear and concise! Thank you!

    • @adataodyssey
      @adataodyssey  2 місяці тому

      No problem Tamojit! This is my goal. More XAI content is on the way.

  • @cutestbear3327
    @cutestbear3327 6 місяців тому +2

    thank you for the awesome video~ really like the way you explain everything thoroughly and meticulously. really friendly to people like us who have just begun our journey into data science

    • @adataodyssey
      @adataodyssey  6 місяців тому

      I'm glad you found it useful! Are there any other related concepts you are interested in learning about?

    • @cutestbear3327
      @cutestbear3327 6 місяців тому +1

      @@adataodyssey hi conor, thnx for your kind reply. i am happy to go with whatever topic you dive into. maybe random forest (and its hyperparameter tuning) since it is such a classic?
      may you have fun and enjoy continued success on youtube~~ cheers

  • @murilopalomosebilla2999
    @murilopalomosebilla2999 7 місяців тому

    Really well explained. Thanks ^^

    • @adataodyssey
      @adataodyssey  7 місяців тому

      No problem! I'm glad you found it useful

  • @thegerman1239
    @thegerman1239 5 місяців тому

    Thank you so much for this awesome video! I'm currently writing a term paper about this topic and other machine learning explainability techniques. This helped me out a lot while creating my examples!
    Kind regards from Germany!

    • @adataodyssey
      @adataodyssey  5 місяців тому +1

      Guten tag! I'm glad this helped. I also have videos about the maths behind Shapley values:
      ua-cam.com/video/UJeu29wq7d0/v-deo.htmlsi=-s-QTmLoQmSiYwFD
      ua-cam.com/video/b9qqbFudVhI/v-deo.htmlsi=uMpSUk7ue6Tzs8SQ

    • @thegerman1239
      @thegerman1239 4 місяці тому

      Hey I'm done with the paper! The videos about the math really helped me as well. You're a champ

    • @adataodyssey
      @adataodyssey  4 місяці тому +1

      @@thegerman1239 Great stuff! All the best with the result.

  • @yukiwang5825
    @yukiwang5825 9 місяців тому

    Wonderful video' Thanks for this.

  • @bakerb-rz6lv
    @bakerb-rz6lv 11 місяців тому

    love you, bro.😀

  • @pilarangelicarodriguezcaba8199
    @pilarangelicarodriguezcaba8199 3 місяці тому

    really easy to understand, a lot better than the offician documentation from shap plots

    • @adataodyssey
      @adataodyssey  3 місяці тому

      Thank you! This was my motivation for the content. Had to do a lot of work to understand the method fully :)

  • @shotclock5424
    @shotclock5424 2 місяці тому

    This is the best way to explain explanations 😁
    I am interested to see a video of yours with more complex models like Deep Neural Networks on Signal Data and how can we use SHAP on that.
    Great work!

    • @adataodyssey
      @adataodyssey  2 місяці тому

      Thank you! I will keep that in mind

  • @felicebugge
    @felicebugge 8 днів тому

    Really useful , thank you

  • @wangchris5468
    @wangchris5468 9 місяців тому

    Lovely ~~~~ 👍👍👍

  • @melih6826
    @melih6826 7 місяців тому

    Hi Connor, you mentioned on the limitation of the SHAP values that "highly correlated features are a problem when using shap values technique", but on this video the heat map shows that features are highly correlated?

    • @adataodyssey
      @adataodyssey  7 місяців тому

      The problem with correlated features is that they can potentially lead to unexpected model predictions. That is when we sample pairs of feature values that do not exist in the dataset. Some models will still produce reasonable predictions even if there are correlated features.
      The point is you can still use SHAP even if you have correlated features. You just need to be aware that the results may be negatively impacted. It is important to validate the results using other methods and visualisations. For example, it's not included here, but in the course, we use SHAP interaction values to find an interaction between two features. We then confirm this interaction using a scatter plot. In other words, we had a useful result even with highly correlated features.
      I hope that makes sense!

  • @markfedenia3383
    @markfedenia3383 9 місяців тому

    I see that cuML computes Shapley values, however it does not look like the Explainer object is compatible with shap. Do you know if there is any way to use the cuML Explainer object and model with the shap package (by the way, excellent videos)

    • @adataodyssey
      @adataodyssey  8 місяців тому

      Thanks! I'm not too familiar with cuML but I think it should be possible. You would have to replace all SHAP values and base_values in a SHAP explainer object with those from the cuML explainer object.
      It's not exactly what you are looking for but this article explains how you can manipulate the SHAP values object and then use the SHAP plots as normal: towardsdatascience.com/shap-for-categorical-features-7c63e6a554ea?sk=2eca9ff9d28d1c8bfde82f6784bdba19

  • @rafaelagd0
    @rafaelagd0 Рік тому +1

    Great video! Could you comment on the future of SHAP? It seems the project was abandoned. The latest commit is from June 2022 and there is a pile of 1.5k issues. I couldn't
    find much information about it and the other packages seem to depend on it. So there may be no alternative.

    • @adataodyssey
      @adataodyssey  Рік тому +3

      That is a good point, Rafael! I think SHAP has a good future regardless of the package. The method is widely used in industry and is based on solid theory. The method is based on Shapley values which have been around for long time.
      For now the package works well for me. The 1.5k issues is more an indication of the popularity than major issues with the package. Hopefully, if it does run into serious issues then updates will be made. If not, I’m sure something will take it’s place.
      As I mentioned, it is very popular so someone is sure to take advantage of that. The code and method is all open sourced so it shouldn’t be too hard to replicate. I know there are already other implementations in R (see IML package).

  • @apogounte8239
    @apogounte8239 6 місяців тому

    Hi! Interesting video! Just wanted to mention that if you just run shap.plots.waterfall(shap_values[0]), you never get on the y-axis, the actual names of the features, but you get instead feature 5, feature 2, etc. Is there a quick fix?

    • @adataodyssey
      @adataodyssey  6 місяців тому

      Yes, you should be able to fix that. You can try:
      1) Make sure your X feature matrix (that you pass into the explainer function i.e. shap_values = explainer(X)) is a pandas dataframe and the column names are the correct feature names. You can check these using X.columns
      2) Update the shap_values after they have been created using something like:
      shap_values.feature_names = list(["feature 1","feature 2", ... ]). It is important to pass the new names as a list.
      Let me know if that helps

  • @ooplectures3828
    @ooplectures3828 9 місяців тому

    Please explain how can i use shap to determine features important against classes in a multi classification problem. I need to know which features or values of features are contributing to prediction of each class in a multi classification system.

    • @adataodyssey
      @adataodyssey  9 місяців тому

      This has been on the list for a while. I'm not sure when I'll be able to do it but hopefully soon!

  • @NasirUddin-im2zb
    @NasirUddin-im2zb 7 місяців тому

    When i was running my code i had this issues, regading shap: FutureWarning: In the future `np.long` will be defined as the corresponding NumPy scalar.
    long_ = _make_signed(np.long), I did pip install 1.20.0, 1.24.2, 1.22.2 so on, no of them work, what can i do, if you can suggest me something it will be great.

    • @adataodyssey
      @adataodyssey  7 місяців тому

      Hi Nasir, sorry about that. I've never seen that issue before. To confirm, do you mean that you installed different versions on NumPy?
      This link might help: github.com/neonbjb/tortoise-tts/issues/379
      They suggest trying:
      pip install numpy==1.20.0

  • @anki8136
    @anki8136 7 місяців тому

    Hey connor , Thanks for the course
    I just have one doubt , how to explain this stacked force plot , I am having some problems in that.
    can you make a video or something?

    • @adataodyssey
      @adataodyssey  7 місяців тому

      Hi Anki, I am sorry that the explanation was not clear. Yet, I am reluctant to make a video on the stacked force plot. This is because, in practice, I have not found it very useful. It is used to explore relationships between features and shap values. But you can do this using the dependence plots which are also easier to understand.
      In the course, I go into a bit more detail on the stacked force plot. Did you see that section?

    • @anki8136
      @anki8136 7 місяців тому

      @@adataodyssey no I didn't saw that video yet but I will watch it now

    • @adataodyssey
      @adataodyssey  7 місяців тому +1

      @@anki8136 Okay, hopefully that clears things up for you. It is in the aggregations lesson

  • @user-ji3ib1rn8s
    @user-ji3ib1rn8s 3 місяці тому

    I tried XGBoost for a different dataset but it did not give a good scatter plot nor a red line significant to separate the observations. So which other model should one use if the number of features are 870?

    • @adataodyssey
      @adataodyssey  3 місяці тому

      This is too many features! You will never be able to get good explanations. Try to reduce the amount of features by removing the highly correlated ones.

  • @slimanearbaoui1237
    @slimanearbaoui1237 Рік тому +1

    can this library work with lstm model

    • @adataodyssey
      @adataodyssey  Рік тому +1

      Hi Slimane :) I've never applied it to an lstm models. Applying SHAP to deep learning models can be challenging. You may be able to apply SHAP to lstm model with some work.
      I have applied it to convolutional neural networks used for image classification and regression tasks. I've linked to two article below. I used the PyTorch. I know that SHAP also works with keras.
      towardsdatascience.com/image-classification-with-pytorch-and-shap-can-you-trust-an-automated-car-4d8d12714eea?sk=b04dcbb8a09f049f605d2110b5c8d851
      towardsdatascience.com/using-shap-to-debug-a-pytorch-image-regression-model-4b562ddef30d?sk=7eb3016839186f1ba2a6f1f105f8ff64

  • @shamkhalmammadov4083
    @shamkhalmammadov4083 Рік тому +2

    Can you please make another example with categorical variables

    • @adataodyssey
      @adataodyssey  Рік тому +1

      Hi Shamkhal, there is a video in the course that explains categorical features :) Otherwise, you might find this article useful (no-paywall link): towardsdatascience.com/shap-for-categorical-features-7c63e6a554ea?sk=2eca9ff9d28d1c8bfde82f6784bdba19

    • @shamkhalmammadov4083
      @shamkhalmammadov4083 Рік тому +1

      @@adataodyssey Thank you very much! I am your big fun. I loved the way you explained SHAP. I got medium 3 days ago just to read your article. I still have a big problem with waterfall plot my targte variable has 3 classes - 0,1,2 for some reason I can not plot faterfall type plot

    • @adataodyssey
      @adataodyssey  Рік тому

      @@shamkhalmammadov4083 Okay, in this case you have a categorical feature as your target variable. I assumed you meant categorical feature as an input feature. I have only worked with binary target variables.
      Can you send me your link to your dataset>

  • @mulusewwondieyaltaye4937
    @mulusewwondieyaltaye4937 29 днів тому

    I can't access SHAP python course. Could you please give me the access

    • @adataodyssey
      @adataodyssey  29 днів тому

      Hi Mulusew, the SHAP course is no longer free. But you will now get free access to my XAI course if you sign up to the newsletter

  • @KOTESWARARAOMAKKENAPHD
    @KOTESWARARAOMAKKENAPHD 8 місяців тому

    I got error in boxplot code

    • @adataodyssey
      @adataodyssey  8 місяців тому

      Sorry to hear that. Can you describe the error in more detail?

  • @digitama
    @digitama 4 місяці тому

    Your explanation is very interesting, but I met with a problem that is "Numba needs NumPy 1.20 or less" and no matter how much downgrade the Numpy and Numba I did, the problem still doesn't go away, any suggestions?

    • @adataodyssey
      @adataodyssey  4 місяці тому

      Sorry to hear that! Did you try only downgrading the Numpy package? Also you could try upgrading the Numba package instead so it is inline with the latest version of Numpy. Remember to refresh your kernel after installing a new package, if you are working with a notebook.

    • @digitama
      @digitama 4 місяці тому

      @@adataodyssey I did downgraded Numba and havent tried upgrading it, what is the version to upgrade to?

  • @bakerb-rz6lv
    @bakerb-rz6lv 11 місяців тому +2

    I got something strange bugs. I copy your code, and I run it. At today morning, The code work correctly. But now, it cannot work. I did not change anything!
    The error message is, After I run the code "explainer = shap.Explainer(model)":
    TypeError: The passed model is not callable and cannot be analyzed directly with the given masker! Model: XGBRegressor(base_score=None, booster=None, callbacks=None,
    colsample_bylevel=None, colsample_bynode=None,
    colsample_bytree=None, early_stopping_rounds=None,
    enable_categorical=False, eval_metric=None, feature_types=None,
    gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
    interaction_constraints=None, learning_rate=None, max_bin=None,
    max_cat_threshold=None, max_cat_to_onehot=None,
    max_delta_step=None, max_depth=None, max_leaves=None,
    min_child_weight=None, missing=nan, monotone_constraints=None,
    n_estimators=100, n_jobs=None, num_parallel_tree=None,
    predictor=None, random_state=None, ...)

    • @adataodyssey
      @adataodyssey  11 місяців тому

      Can you try to run this code:
      explainer = shap.Explainer(model,X[0:10])
      where X is the feature matrix used to train your model. For some models, you need to pass this in as a mask. You can see the full example for a random forest here:
      github.com/conorosully/SHAP-tutorial/blob/main/src/project_1_solution.ipynb

    • @bakerb-rz6lv
      @bakerb-rz6lv 11 місяців тому

      @@adataodyssey It still cannot work. Strangely, it says "AttributeError: module 'numpy' has no attribute 'bool'". I do not understand why this code is about the numpy. All packages I used is the newest version.

    • @bakerb-rz6lv
      @bakerb-rz6lv 11 місяців тому

      @@adataodyssey And I found another difference. In your GitHub code, the step 9--Train model. Your output is
      XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
      colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
      early_stopping_rounds=None, enable_categorical=False,
      eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
      importance_type=None, interaction_constraints='',
      learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
      max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
      missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
      num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
      reg_lambda=1, ...)
      But my output and your video's output is :
      XGBRegressor(base_score=None, booster=None, callbacks=None,
      colsample_bylevel=None, colsample_bynode=None,
      colsample_bytree=None, early_stopping_rounds=None,
      enable_categorical=False, eval_metric=None, feature_types=None,
      gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
      interaction_constraints=None, learning_rate=None, max_bin=None,
      max_cat_threshold=None, max_cat_to_onehot=None,
      max_delta_step=None, max_depth=None, max_leaves=None,
      min_child_weight=None, missing=nan, monotone_constraints=None,
      n_estimators=100, n_jobs=None, num_parallel_tree=None,
      predictor=None, random_state=None, ...)

    • @adataodyssey
      @adataodyssey  11 місяців тому +1

      @@bakerb-rz6lv Sometimes, if you are using the newest versions, then other packages have not caught up yet. It could be that SHAP uses an older version of numpy. See this similar issue: stackoverflow.com/questions/74893742/how-to-solve-attributeerror-module-numpy-has-no-attribute-bool#:~:text=This%20means%20you%20are%20using,while%20that%20isn't%20fixed.
      The important point is: "Then, in version NumPy 1.24.0, the deprecated np.bool was entirely removed. This means you are using a NumPy version that removed the deprecated ways AND the library you are using wasn't updated to match that version (uses something like np.bool instead of just bool)."
      You could try to install an early version of numpy. But this is just a guess on my part.

    • @bakerb-rz6lv
      @bakerb-rz6lv 11 місяців тому

      @@adataodyssey God damn it! You are right. I install numpy==1.22.3 and it work correctly. Maybe you can set this comment to top to notice other freshmen.

  • @noazamstein5795
    @noazamstein5795 2 місяці тому

    What does it mean that being a male increases the prediction by 0.78, AND ALSO not being an infant FURTHER increases it by 0.42? These two are obviously mutually exclusive, so I would expect either one of them being the sum of 0.78+0.42 or something else

    • @adataodyssey
      @adataodyssey  2 місяці тому

      Your confusion is warranted as there is not a clear interpretation for this feature. In the model, there are three sex features (M, F and I). Together they are mutually exclusive. You are right, by summing up the values you get a clear interpretation of the contribution of the original categorical feature.
      Unfortunately, there is no easy way to do this with the SHAP package. We discuss this is in my SHAP course. You can also find a solution in this article:
      towardsdatascience.com/shap-for-categorical-features-7c63e6a554ea?sk=2eca9ff9d28d1c8bfde82f6784bdba19

  • @bakerb-rz6lv
    @bakerb-rz6lv 11 місяців тому

    Hello, teacher. I use another method to train my model. Here are some codes:
    from sklearn.model_selection import train_test_split
    # Extract feature and target arrays
    X, y = df.drop('Grade', axis=1), df[['Grade']]
    # Extract text features
    cats = X.select_dtypes(exclude=np.number).columns.tolist()
    # Convert to Pandas category
    for col in cats:
    X[col] = X[col].astype('category')
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    dtrain_reg = xgb.DMatrix(X_train, y_train, enable_categorical=True)
    dtest_reg = xgb.DMatrix(X_test, y_test, enable_categorical=True)

    • @bakerb-rz6lv
      @bakerb-rz6lv 11 місяців тому

      params = {"objective": "reg:squarederror", "tree_method": "gpu_hist"}
      n = 100
      model = xgb.train(
      params=params,
      dtrain=dtrain_reg,
      num_boost_round=n,
      )
      explainer = shap.Explainer(model)
      shap_values = explainer(X)

    • @bakerb-rz6lv
      @bakerb-rz6lv 11 місяців тому

      And it have something wrong:
      TypeError: The passed model is not callable and cannot be analyzed directly with the given masker! Model:
      How can I fix it?

    • @adataodyssey
      @adataodyssey  10 місяців тому

      Sorry I missed this comment! But I think I answered you question on the other comment :)