Can you please explain deeper how GBM works for multiclassification? I was trying to implement this from scratch but I got accuracy a little bit different from sklearn version. If I understood correctly we do the next steps: 1) apply one hot encoding for target (y) to get one_hot_y 2) apply softmax function to each column in one_hot_y to get probabilities for each class 3) train regression tree on X_train and probabilities for each one_hot_y and then make predictions on X_train 4) convert predictions into probabilities with softmax 5) get residuals = one_hot_y - gotten probabilities 6) train regression tree on X_train and residuals 7) update prediction by sum of previuos * learning rate 8) repeat steps 4-7 for loop 9) make predicions with X_test on trained trees for each class in one_hot_y 10) take class with max sum of predictions Is it correct? I couldn't understand how in works in sklearn because their codes write sh#t.
Your outlined steps for GBM in the context of multiclass classification generally capture the essence of the algorithm, but there are some nuances to consider: 1-One-Hot Encoding for Target (y): Your step 1 is correct. The classes are usually one-hot encoded to enable multiclass classification. 2-Softmax Function: In general practice, softmax isn't usually applied at this stage. Softmax is typically used when you're making final predictions to ensure they sum to 1. 3-Train Regression Tree on X_train: Yes, this is often done using log-odds instead of probabilities to make the problem more amenable to regression. 4-Softmax for Predictions: Again, softmax is typically applied when making final predictions. 5-Calculating Residuals: In the training phase, you would typically work with log-odds, not probabilities. The residuals would be calculated based on these log-odds. This aligns with what you generally do when you're employing trees in the boosting process since trees are inherently suited for regression-like problems, and log-odds provide a continuous target. After summing up the contributions from all trees (perhaps scaled by a learning rate), you can apply the softmax function to convert these aggregated log-odds to probabilities for the final prediction 6-Train Regression Tree on Residuals: Yes, another tree is fit to the residuals, but note that this is usually scaled by a learning rate. 7-Update Predictions: The predictions are updated based on the newly trained tree and scaled by a learning rate. 8-Repeat: Steps 4-7 are iteratively repeated for each tree in the ensemble. 9-Final Predictions: For each sample, the final prediction is usually the class that has the highest log-odds or probability sum across the trees. 10-Class with Max Sum: Your step 10 is aligned with the general approach of selecting the class that has the highest cumulative prediction. Differences from scikit-learn: - Scikit-learn optimizes a specific loss function, typically the multinomial deviance, rather than explicitly dealing with softmax and probabilities at each iteration. - It includes additional options for regularization, including both tree depth and leaf nodes. The discrepancy in accuracy between your model and sklearn's could be due to a number of factors, such as different hyperparameters, number of trees, learning rates, or even random seed initialization. I hope this deep dive helps clarify how GBM works for multiclass classification. The scikit-learn source code can indeed be complex, as it's optimized for efficiency and flexibility, which sometimes makes it hard to follow.
@@pedramjahangiry thank you so much for a such detailed answer. Now I have clarified some aspects of boosting much better. If to be honest, this is the single algorithm that I failed to implement one-on-one like in sklearn in a term of accuracy.
@@pedramjahangiry hello again. I just wanted to say that I've coped with GBM classifier implementation after reading the original paper. The main idea is we apply one-hot encoding to y, then compute residuals = y - y_pred and coeff. gamma = (k / (k-1)) * sum(residuals) / sum(abs(residuals) * (1-abs(residuals))), than we train a tree using residuals and make a tree predicion adding gamma coeff, than we add the result in a prediction table and aplly softmax function and repeat this procedure for a loop. After training all trees we create another prediction table to add samples predictions for each class and sum with argmax for each row wil be our final prediction. Also I've applied logitboost concept to GBM classifier where the main difference is that we compute residuals using another formula and also we compute weights for each tree and also we don't use gamma. In general these 2 approaches perform a litte better in a term of accuracy than GB and histGB from sklearn. It wasn't easy to realise and implement but finally I got a high).
Hello Professor, I am working on a predictive model where there are more features than observations - after data cleaning there are c. 100feats. x 50obs. All variables are continuous, some of the features are correlated - generally a regression issue. What model would be the best choice? I thought of Elastic Net (it could help with the dimensionality and correlation between features) and Random Forest (maybe it would be the cure for small amount of observations?) now I am wondering about boosting. Is there by default the best approach to such a case (many feats. but few obs.), or should I try out different approaches and compare test-set performance metrics?
Tomasz, you need a model to reduce the dimension first. For that you can use either lasso or elastic net. Alternatively you can use PCA to reduce the dimensions first and then apply any other ML model. Lastly you can use SVM or RF as they are both powerful in handling high dimensions. Hope that helps!
Hi Pedram, a quick question regarding AdaBoost or for that matter any boosting algorithm, if I understand correctly, based on your explanation, all of the datasets for each of the tree models will have the same data points as the original dataset, the main difference being the weights being assigned to the individual data points based on them being correctly classified or not based on the predictions of the previous trees...kindly let me know if my understanding is correct...
Hi Professor. I just wanted to let you know that this channel helped me immensely in the understanding of the ML concepts. I wonder if you if you have any course available for the pratical applications of the algorithms in finance using R including the pre-processing part. If this is not the case I would appreciate if you can suggest any resource to check out. It should be in R. Python is a super powerful programming language but I’m more of a R user and I’d need quite some time to learn another software decently.
I myself am a big fan of R as well. However, all my courses are being taught in Python now. I will be developing courses in Julia but not R. There are multiple great courses on Udemy which do machine learning in R. Please search within Udemy and see which one is a better fit for your needs. I cannot recommend any specific course here. Thanks,
sure, each weak learner has its own weight! the aggregation is based on these weights. Here is a simple explanation with the Pseudo-Code : towardsdatascience.com/boosting-and-adaboost-clearly-explained-856e21152d3e
i always got impressed about how much extraordinaries channel the youtube have and isnt known (sorry anout my english, Brazilian here)
Glad you find it helpful.
Can you please explain deeper how GBM works for multiclassification? I was trying to implement this from scratch but I got accuracy a little bit different from sklearn version.
If I understood correctly we do the next steps:
1) apply one hot encoding for target (y) to get one_hot_y
2) apply softmax function to each column in one_hot_y to get probabilities for each class
3) train regression tree on X_train and probabilities for each one_hot_y and then make predictions on X_train
4) convert predictions into probabilities with softmax
5) get residuals = one_hot_y - gotten probabilities
6) train regression tree on X_train and residuals
7) update prediction by sum of previuos * learning rate
8) repeat steps 4-7 for loop
9) make predicions with X_test on trained trees for each class in one_hot_y
10) take class with max sum of predictions
Is it correct? I couldn't understand how in works in sklearn because their codes write sh#t.
Your outlined steps for GBM in the context of multiclass classification generally capture the essence of the algorithm, but there are some nuances to consider:
1-One-Hot Encoding for Target (y): Your step 1 is correct. The classes are usually one-hot encoded to enable multiclass classification.
2-Softmax Function: In general practice, softmax isn't usually applied at this stage. Softmax is typically used when you're making final predictions to ensure they sum to 1.
3-Train Regression Tree on X_train: Yes, this is often done using log-odds instead of probabilities to make the problem more amenable to regression.
4-Softmax for Predictions: Again, softmax is typically applied when making final predictions.
5-Calculating Residuals: In the training phase, you would typically work with log-odds, not probabilities. The residuals would be calculated based on these log-odds. This aligns with what you generally do when you're employing trees in the boosting process since trees are inherently suited for regression-like problems, and log-odds provide a continuous target. After summing up the contributions from all trees (perhaps scaled by a learning rate), you can apply the softmax function to convert these aggregated log-odds to probabilities for the final prediction
6-Train Regression Tree on Residuals: Yes, another tree is fit to the residuals, but note that this is usually scaled by a learning rate.
7-Update Predictions: The predictions are updated based on the newly trained tree and scaled by a learning rate.
8-Repeat: Steps 4-7 are iteratively repeated for each tree in the ensemble.
9-Final Predictions: For each sample, the final prediction is usually the class that has the highest log-odds or probability sum across the trees.
10-Class with Max Sum: Your step 10 is aligned with the general approach of selecting the class that has the highest cumulative prediction.
Differences from scikit-learn:
- Scikit-learn optimizes a specific loss function, typically the multinomial deviance, rather than explicitly dealing with softmax and probabilities at each iteration.
- It includes additional options for regularization, including both tree depth and leaf nodes.
The discrepancy in accuracy between your model and sklearn's could be due to a number of factors, such as different hyperparameters, number of trees, learning rates, or even random seed initialization.
I hope this deep dive helps clarify how GBM works for multiclass classification. The scikit-learn source code can indeed be complex, as it's optimized for efficiency and flexibility, which sometimes makes it hard to follow.
@@pedramjahangiry thank you so much for a such detailed answer. Now I have clarified some aspects of boosting much better. If to be honest, this is the single algorithm that I failed to implement one-on-one like in sklearn in a term of accuracy.
@@hopelesssuprem1867 impressive!
@@pedramjahangiry hello again. I just wanted to say that I've coped with GBM classifier implementation after reading the original paper.
The main idea is we apply one-hot encoding to y, then compute residuals = y - y_pred and coeff. gamma = (k / (k-1)) * sum(residuals) / sum(abs(residuals) * (1-abs(residuals))), than we train a tree using residuals and make a tree predicion adding gamma coeff, than we add the result in a prediction table and aplly softmax function and repeat this procedure for a loop.
After training all trees we create another prediction table to add samples predictions for each class and sum with argmax for each row wil be our final prediction.
Also I've applied logitboost concept to GBM classifier where the main difference is that we compute residuals using another formula and also we compute weights for each tree and also we don't use gamma.
In general these 2 approaches perform a litte better in a term of accuracy than GB and histGB from sklearn.
It wasn't easy to realise and implement but finally I got a high).
@@hopelesssuprem1867 Wow! thank you so much for sharing your insights with me.
Hello Professor,
I am working on a predictive model where there are more features than observations - after data cleaning there are c. 100feats. x 50obs. All variables are continuous, some of the features are correlated - generally a regression issue. What model would be the best choice? I thought of Elastic Net (it could help with the dimensionality and correlation between features) and Random Forest (maybe it would be the cure for small amount of observations?) now I am wondering about boosting. Is there by default the best approach to such a case (many feats. but few obs.), or should I try out different approaches and compare test-set performance metrics?
Tomasz, you need a model to reduce the dimension first. For that you can use either lasso or elastic net. Alternatively you can use PCA to reduce the dimensions first and then apply any other ML model. Lastly you can use SVM or RF as they are both powerful in handling high dimensions. Hope that helps!
@@pedramjahangiry Thank You very much!
Hi Pedram, a quick question regarding AdaBoost or for that matter any boosting algorithm, if I understand correctly, based on your explanation, all of the datasets for each of the tree models will have the same data points as the original dataset, the main difference being the weights being assigned to the individual data points based on them being correctly classified or not based on the predictions of the previous trees...kindly let me know if my understanding is correct...
yep! you got this.
@@pedramjahangiry thank you
Hi Professor. I just wanted to let you know that this channel helped me immensely in the understanding of the ML concepts. I wonder if you if you have any course available for the pratical applications of the algorithms in finance using R including the pre-processing part. If this is not the case I would appreciate if you can suggest any resource to check out. It should be in R. Python is a super powerful programming language but I’m more of a R user and I’d need quite some time to learn another software decently.
I myself am a big fan of R as well. However, all my courses are being taught in Python now. I will be developing courses in Julia but not R.
There are multiple great courses on Udemy which do machine learning in R. Please search within Udemy and see which one is a better fit for your needs. I cannot recommend any specific course here. Thanks,
And also...am a bit unclear about how the final prediction is done based on the aggregation of weights in AdaBoost...
sure, each weak learner has its own weight! the aggregation is based on these weights. Here is a simple explanation with the Pseudo-Code
: towardsdatascience.com/boosting-and-adaboost-clearly-explained-856e21152d3e
I wish to contact you please by email, even for business work
Ali, you should be able to message me through my LinkedIn page. Please DM me there: www.linkedin.com/in/pedram-jahangiry-cfa-5778015a/