Mr.Landry reviewed accuracy, @27:20, based on validation dataset which was used during training to tune the model. It's not a realistic error estimate -- too optimistic -- consequently, when hit_ratio_table is looked at. Better to use new data to estimate error rather than data that was used to tune the model.
Hi @Geoffrey Anderson. A final/new test set is used, actually. This is introduced at about 18:00, and discussed at more length at 32:00, where it is scored for the first time. It trains on 60%, uses 20% for an internal validation set (early stopping), and the final 20% to evaluate when all tuning is complete.
Hi Mark, I could not find anywhere, how to know the optimal number of rounds for GBM, in xgboost in cv, we get to know at what iteration the model reached optimal loss, but h2o, even when I give, validation set, stopping metric (logloss), stopping rounds (150) and stopping error 0.0001, it does not seems to stop. number of trees is always what is set in ntrees
Hi @Suresh Chinta. Stopping rounds of 150 is quite high. It may be valid in your case, but H2O will wait until the average of 150 consecutive rounds is within the stopping tolerance (you are using 0.0001 it seems) of the prior 150 consecutive rounds. And rounds uses score_tree_interval for how many trees are part of a round (default is variable by scoring time estimation). For reference, I typically use 2 for stopping_rounds. I usually set ntrees at a nearly unattainable number (e.g. 2000, 10000), drop the tolerance to 0, and also set the score_tree_interval to somewhere between 2 and 5. And those models typically stop well before the ntrees limit. In case it helps, since the demo is intended to be fast for people in the audience and that makes it a little less indicative of typical modeling, this is the latest model I've run this week: gbm
The coding starts at 16:38
Mr.Landry reviewed accuracy, @27:20, based on validation dataset which was used during training to tune the model. It's not a realistic error estimate -- too optimistic -- consequently, when hit_ratio_table is looked at. Better to use new data to estimate error rather than data that was used to tune the model.
Hi @Geoffrey Anderson. A final/new test set is used, actually. This is introduced at about 18:00, and discussed at more length at 32:00, where it is scored for the first time. It trains on 60%, uses 20% for an internal validation set (early stopping), and the final 20% to evaluate when all tuning is complete.
Beautifully explained. Thanks Mark!
You are awesome Mark!
Could someone please elaborate a little more on the hit ratio table starting at 23:45? I am a little confused on what the score represents at k >= 2
Hi Mark, this was extremely helpful. Can you please share the github path for the same. Thanks.
The best explaination
Hi Mark, I could not find anywhere, how to know the optimal number of rounds for GBM, in xgboost in cv, we get to know at what iteration the model reached optimal loss, but h2o, even when I give, validation set, stopping metric (logloss), stopping rounds (150) and stopping error 0.0001, it does not seems to stop. number of trees is always what is set in ntrees
Hi @Suresh Chinta.
Stopping rounds of 150 is quite high. It may be valid in your case, but H2O will wait until the average of 150 consecutive rounds is within the stopping tolerance (you are using 0.0001 it seems) of the prior 150 consecutive rounds. And rounds uses score_tree_interval for how many trees are part of a round (default is variable by scoring time estimation).
For reference, I typically use 2 for stopping_rounds. I usually set ntrees at a nearly unattainable number (e.g. 2000, 10000), drop the tolerance to 0, and also set the score_tree_interval to somewhere between 2 and 5. And those models typically stop well before the ntrees limit.
In case it helps, since the demo is intended to be fast for people in the audience and that makes it a little less indicative of typical modeling, this is the latest model I've run this week:
gbm
thank you for the video but can you please talk slowly
You can control the speed yourself with youtube controls