Thanks for the great explanation! I have a question. Is it possible for PBT to learn the structure of the neural networks? since number of nodes, hidden layers ...etc can also be considered as hyperparameters
Thanks for the comment! Yes! I am very interested in this idea as well! It is interesting to think of a model with an adaptive capacity that grows and shrinks during training, or maybe even learns new connections to overcome certain moments in learning, especially when it is learning non-stationary distributions like a discriminator in the GAN framework
I think the main problem with this is that PBT wants to take advantage of weight sharing through the exploit mechanism. It might be difficult to organize this properly when you're adding layers or units that have no weights associated with them yet. You can say, just initialize those particular weights randomly and warm-start everything else, but then you need to make sure each uninterruptible step of the pbt algo has sufficient number of internal opt (sgd adam whatever) steps between each evaluation to get these random weights up to par and not just discarded.
There are 2 kinds of hyper-parameters : mutable and immutable ones. That is why PBT was successfully used to update data augmentation coefficients during training of one fixed neural network. It is successfull in Reinforcement Learning too due to the large number of mutable hyper-parameters.
Thanks for the great explanation! I have a question. Is it possible for PBT to learn the structure of the neural networks? since number of nodes, hidden layers ...etc can also be considered as hyperparameters
Thanks for the comment! Yes! I am very interested in this idea as well! It is interesting to think of a model with an adaptive capacity that grows and shrinks during training, or maybe even learns new connections to overcome certain moments in learning, especially when it is learning non-stationary distributions like a discriminator in the GAN framework
I think the main problem with this is that PBT wants to take advantage of weight sharing through the exploit mechanism. It might be difficult to organize this properly when you're adding layers or units that have no weights associated with them yet. You can say, just initialize those particular weights randomly and warm-start everything else, but then you need to make sure each uninterruptible step of the pbt algo has sufficient number of internal opt (sgd adam whatever) steps between each evaluation to get these random weights up to par and not just discarded.
There are 2 kinds of hyper-parameters : mutable and immutable ones. That is why PBT was successfully used to update data augmentation coefficients during training of one fixed neural network. It is successfull in Reinforcement Learning too due to the large number of mutable hyper-parameters.