Automated Machine Learning - Tree Parzen Estimator (TPE)

Поділитися
Вставка
  • Опубліковано 28 чер 2024
  • In this video, we cover another Bayesian Optimization method to perform hyperparameter optimization: Tree Parzen Estimator.
    Original paper where TPE was proposed: proceedings.neurips.cc/paper/...
    The animation of TPE used in the slides was created by Alois Bissuel ( / hyper-parameter-optimi... )
    Liked the video? Share!
    Any comments/feedback/questions? Let me know below!
  • Наука та технологія

КОМЕНТАРІ • 27

  • @dfv2814
    @dfv2814 Місяць тому +1

    Such a clear explanation! Out of the woods, at last. Thank you for the time you put in making this video.

  • @fabiansturman7404
    @fabiansturman7404 3 місяці тому +1

    Thank you! I was reading papers about TPEs and it seems they all either are very high level in how the calculations work (the original paper seems to explain some things but not others at all) or like 70 pages long and with tons of less useful information added in… this is such a great video :)

    • @aixplained4763
      @aixplained4763  3 місяці тому +1

      Thank you for your kind comment :) Glad that it was helpful!

  • @luistalavera6392
    @luistalavera6392 9 місяців тому +2

    The explanation was amazing and very clear, thank you so much :D

  • @amirmn7
    @amirmn7 9 місяців тому

    really nice explanation! thank you!

  • @Maciek17PL
    @Maciek17PL Рік тому

    great explenation!!!!! thank you

  • @andrewvirtual
    @andrewvirtual 6 місяців тому

    Very well explained

  • @TheLazyEngineer
    @TheLazyEngineer 3 місяці тому +1

    Any rule of thumb for how many observations needed for optimizing N hyperparameters?

    • @aixplained4763
      @aixplained4763  3 місяці тому +1

      That is a great question! Unfortunately, I do not know of such a rule of thumb. This really depends on the problem and the importance of the different hyperparameters. Normally, hyperparameter tuning is very expensive and we have to go with a small number of observations. TPE can more effectively explore the search space than, for example, random search. However, TPE is often warm-started with random search: you can for example test between 10 and 30 randomly selected configurations and use TPE to select new promising configurations.

    • @TheLazyEngineer
      @TheLazyEngineer 3 місяці тому

      @@aixplained4763 Thanks! And great video btw.

  • @jacobogerardogonzalezleon2161

    Amazing course! What theorethically implications there exists when using TPE as a sampler and Hyperband as a prunner?

    • @aixplained4763
      @aixplained4763  Рік тому +1

      Thank you :) That's a great question! Hyperband has a theoretical guarantee to find the optimal configuration with probability 1 given an infinite budget (because it does a random search in the final bracket). Using TPE as a sampler changes this, as TPE can get stuck in local optima and never explore parts of the search space (thus potentially missing out on the optimal configuration). A workaround to this would be to choose a configuration at random with a small probability (and TPE otherwise) so that we retrieve the guarantees of random search. These theoretical guarantees are, however, not very useful in practice.

  • @Raven-bi3xn
    @Raven-bi3xn Рік тому +2

    Nicely done. Thank you. Do you know why it’s called Tree Parzen Estimators?

    • @aixplained4763
      @aixplained4763  Рік тому +3

      Thank you. Great question! It is called "Parzen Estimtor" because it builds on Parzen windows for building probabilistic models (stats.stackexchange.com/questions/244012/can-you-explain-parzen-window-kernel-density-estimation-in-laymans-terms). The "tree" part refers to the fact that the configuration space of hyperparameters has a tree structure.

  • @JW-iw9fe
    @JW-iw9fe Рік тому +2

    Amazing!.Thank you.As a newbie, I would like to ask a few questions.
    What is the point of using Gaussian mixture for each hyperparameter in this paper?
    I know the distribution of good point and bad point is calculated by kernel density estimation(Parzen Estimator).
    But I don't understand how the mixture of gaussian is related to the distribution of good point and bad point.
    I probably don't know much about TPE yet so the question might be a little weird.
    I hope you can understand. Looking forward to your reply !
    Thank you !

    • @aixplained4763
      @aixplained4763  Рік тому

      That's a good question! In TPE, we create two partitions: the good partition and bad partition. These contain the observations with either good or bad performance, respectively. As you mention, we do kernel density estimation for both partitions. Within kernel density estimation, we basically perform a weighted sum over all observations in a given partition, where the weight is given by the kernel. This summation naturally also occurs in a mixture of Gaussians (we sum up individual the probabilities of the individual Gaussians). So when each Gaussian corresponds to a given observation, we retrieve a summation over all data points, as in kernel density estimation. Hope this helps! :)

    • @JW-iw9fe
      @JW-iw9fe Рік тому

      @@aixplained4763 Thank you for your answering! You help me a lot. I have another question.In Alois Bissuel Blog, he said 'One of the great drawback of tree-structured Parzen estimators is that they do not model interactions between the hyper-parameters. ' How do I understand this?
      Does it mean there is no relationship between hyperparameters?
      Looking forward to your reply !
      Thank you!

    • @aixplained4763
      @aixplained4763  Рік тому

      ​@@JW-iw9fe TPE models the densities for hyperparameters separately (for every hyperparameter, we fit l(x) and g(x)). As a consequence, it does not know how the individual densities are affected by other hyperparameter values, because it models them in isolation. This is a limitation of this method as in reality, the optimal value of a hyperparameter depends on the values of other hyperparameters.

    • @JW-iw9fe
      @JW-iw9fe Рік тому

      @@aixplained4763 You are so nice. Thank you so much!
      Your answer completely solved my confusion!
      Even with these drawbacks, is TPE used more in HPO?
      And can you tell me some of the most popular technologies in HPO?
      Thank you!

    • @aixplained4763
      @aixplained4763  Рік тому +1

      @@JW-iw9fe In spite of this limitation, TPE is quite commonly used as a surrogate model for computing the expected improvement of configurations. Some popular HPO methods include SMAC (uses random forests instead of TPE as surrogate model) and BOHB (does use TPE). In BOHB, however, they used a multidimensional TPE such that interactions between hyperparameters can be modeled. If you are interested, you can read more here: www.automl.org/automated-algorithm-design/algorithm-configuration/smac/ and arxiv.org/pdf/1807.01774.pdf

  • @jk-cw8ix
    @jk-cw8ix 11 місяців тому

    Good shit

  • @icelover7166
    @icelover7166 7 місяців тому

    Your video shows the optimisation of a hyperparameter using TPE. When the TPE optimises a set of parameters, how does it work?
    optimising each hyperparameter one by one with the other parameters fixed, or all hyperparameters together? I hope you can answer this, I am confused.

    • @aixplained4763
      @aixplained4763  7 місяців тому

      That is a great question! When you have multiple hyperparameters you want to tune, you fit P(x|good) and P(x|bad) for every individual hyperparameter x (for example, we could do this for the learning rate and batch size). This means that every hyperparameter is optimized independently of the other hyperparameters and that interactions cannot be modeled. This is one of the main shortcomings of TPE, but can be addressed by using multivariable distributions like P(x_1, x_2|good), although that brings about new issues :)

    • @icelover7166
      @icelover7166 7 місяців тому

      @@aixplained4763 Thank you for your answer, does this mean that using TPE for hyperparameter optimisation will ignore the link between the hyperparameters. For example, when tuning for learning rate and batch size, the results they get will be significantly different between optimising learning rate before batch size using TBE and optimising batch size before learning rate?

    • @aixplained4763
      @aixplained4763  7 місяців тому

      @@icelover7166 Yes, exactly. TPE cannot learn the interaction effects (or links) between different hyperparameters. Concerning the latter example you mention: this can indeed be the case but it depends on how the tuning is performed. In the regular case, you perform a random search for a few iterations (a warm-up) where you select random hyperparameter configurations (one configuration = a certain value for the learning rate and batch size in our case). These can be used to fit the densities used by TPE, and for selecting promising configurations to try out next.
      If instead you decide to first tune the batch size and fixing all other hyperparameters, and afterward tune the learning rate, fixing all other hyperparameters, then the outcome can indeed be very different compared to the case described above. Hope this helps!

    • @icelover7166
      @icelover7166 7 місяців тому

      @@aixplained4763 This is very helpful. Thank you