The challenges in Variational Inference (+ visualization)

Поділитися
Вставка
  • Опубліковано 28 лис 2024

КОМЕНТАРІ • 62

  • @Louis-ml1zr
    @Louis-ml1zr 2 роки тому +15

    Just Discovered your Channel and ive got to say that i am really impressed by the amount of work you put into it. Looking forward to seeing Other great vidéos like this! (Until then I have a lot to catch up)

  • @kimyongtan3818
    @kimyongtan3818 Рік тому +4

    I’ve watched various university course lecture, some papers, some blog posts for several days, still can’t understand “what we have”, “what we want to find”, etc..
    You explain them concretely, even show an simple example that p(x) is intractable. Instantly make me understand, much appreciated! 🎉

  • @Reexpat
    @Reexpat Рік тому +1

    The best explanation ever. Thumb up!

  • @hugogabrielidis8764
    @hugogabrielidis8764 Рік тому +1

    Just fond out this channel & and I would like to thank you for your thoughtful work

  • @thebluedragon6385
    @thebluedragon6385 5 місяців тому +2

    Very helpful video, thank you so much 😊

  • @clairedaddio346
    @clairedaddio346 8 місяців тому +1

    amazing video! Great help, thank you for your effort to make such an excellent video !!!

  • @AnasAhmedAbdouAWADALLA
    @AnasAhmedAbdouAWADALLA Рік тому +1

    Very well explained! earned my sub. Looking forward for more videos!

  • @soumyasarkar4100
    @soumyasarkar4100 2 роки тому +3

    Thanks for the video

  • @addisonweatherhead2790
    @addisonweatherhead2790 2 роки тому +2

    Great video, I really liked the concrete example and the actual computation of integral approximations etc. I also really like the amount that you distinguish between what we know and what we don't know when defining the different distributions (i.e. *assuming we have a z*, we can plug in and get p(x, z)).
    On that note, towards the end you talked about p(z, x=D), i.e. the joint over z and x where you've plugged in the observed dataset for x. You showed that this is actually not a valid probability distribution. Can you explain a bit more about why exactly that is the case? Why can't we simply treat the joint p(z, x=D) as the conditional. We are plugging in known data, and getting a value representing the probability of the latent.
    Thanks as always for amazing content, keep it up! :)

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +1

      Hey Addison,
      thanks for commenting :). I hope that the video was able to solve some of the open questions from the last video. Thanks a lot for the feedback.
      Regarding your question: I think the crucial observation is that the joint is proportional to the posterior (according to Bayes' rule). Therefore, it inhibits the same features, e.g., minima or maxima. Hence, we could query it to compare certain Z against each other. For example, p(Z=0.2, X=D) = 0.03 and p(Z=0.3, x=D) = 0.12. This would allow us to say which of the two Z is more probably. That is helpful for Maximum A Posteriori Estimates. However, we cannot say anything with respect to whether the probability values for both Z are high or low in comparison to the entire space of possible Z. That's where we would need a full distribution for. I hope the most recent video was able to shine some more light on it: ua-cam.com/video/u4BJdBCDR9w/v-deo.html
      Please feel free to ask a follow-up question if something remained unclear.

  • @jiahao2709
    @jiahao2709 Рік тому +2

    really nice, could you also make some advance thing about sparse gaussian process?

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому +2

      Also a great suggestion, will also come back to it once I revive the probabilistic ML series in the future 😊

  • @mickolesmana5899
    @mickolesmana5899 2 роки тому +1

    Good explanation as always :)

  • @matej6418
    @matej6418 Рік тому +1

    I am still struglling with the concept that in the beginning we already somewhat have the joint P(Z,D) that we can evaluate for values of Z and D and get probabilities, but we do not yet have the conditional P(Z|D). The joint itself already encodes the relationship between Z and D, P(Z,D), no? Why do we want the conditional that should effectively encode the same thing. (Perhaps I ll rewatch the part "Why we want the posterior") again.

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому +1

      You're right. :) The joint encodes this relationship, but only unnormalized. This means if I propose two latent variables to you, then (given the same observed data) you could compute which of the two has the higher probability. This is also the fact we use to do optimization, it allows us to find MAP estimates. However, you cannot tell the actual (normalized) probability of any of the two Z values.

  • @yccui
    @yccui 9 місяців тому +1

    Great video! I have a question, why don't we just model p(x) with some known distribution like Gaussian distribution? Why do we have to compute the integral of p(x,z) w.r.t. z?

    • @MachineLearningSimulation
      @MachineLearningSimulation  9 місяців тому

      Hi,
      Thanks a lot for the kind words and the great questions 😊
      Do you have timestamps for the points in the video you refer to (helps me recap what I said in detail). Some more general answers:
      The p(x) distribution is a consequence of the full joint model. So, if there is a model p(x,z) it implies a certain functional p(x) purely by its definition (as the expectation over z). Maybe you mean, whether we could propose a surrogate marginal (similar to the surrogate posterior one commonly sees in VI)? In that case: certainly this can also be done, but might be of less practical usage.
      Regarding your second question: p(x) = int_z p(x,z) dz is a fundamental result in probability theory. You can, for instance, check the first chapter of Chris Bishops "pattern recognition and machine learning".

  • @MLDawn
    @MLDawn Рік тому +1

    Thanks for a great video. You mentioned that in order to make the connection between x and z in the likelihood function, p(x|z), we make z to be the mean of the Gaussian. As you know, in a Gaussian we have the term (x-z)**2. Now, x and z can have very different dimensions! In that case, how on earth can we take theri difference, let alone compute the p(x|z)? Thanks

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому

      You're welcome 🤗 thank for the kind comment.
      For this specific example, it works because both the latent z and the observed x are scalar ("or 1-dimensional").
      Generally speaking, there might be two cases you could refer to: either the true dimension of the other variable is different or the the other variable has an additional batch axis (like in a dataset). For the former, yes there would be an inconsistency but one would then go back to remodeling to make it work. For the latter, you could use plate notation (ua-cam.com/video/AkqOGQTi9dY/v-deo.htmlsi=phRiOjQKpo3-cxOa ) and computer the likelihood as the product of the probability over all samples.

  • @ashitabhmisra9123
    @ashitabhmisra9123 2 роки тому +3

    Hello, this might be a trivial doubt. But at 12:01 you estimate the P(Z, X=D) for one observed datapoint. What if we have more than one datapoint ? How will this equation be generalised ? Thanks a ton for the video !!

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +2

      That depends a bit on the concrete model, one often done approach for a dataset is an I.I.D. assumption. Consequentially, you would take the product of the probability for each entry in the set.

    • @ashitabhmisra9123
      @ashitabhmisra9123 2 роки тому

      @@MachineLearningSimulation makes sense. Thanks!

  • @todianmishtaku6249
    @todianmishtaku6249 2 роки тому +1

    Superb!

  • @sfdv1147
    @sfdv1147 Рік тому +1

    Thanks for the video. However, I would like to ask: in the visualization, you computed the integral over Z, is this the marginal P(X)? But as you've said earlier in the video (7:58), it is intractable to compute. Is there something i'm missing?

    • @sfdv1147
      @sfdv1147 Рік тому +1

      Also what book would you recommend for Probabilistic Graphical Model? I finished Prof. Koller's lecture on Coursera but I find there are too many things left out. I know she also wrote a book on this topic but I find it a bit difficult to read 😅😅😅

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому +1

      You're welcome 😊 thanks for the kind comment.
      The integral value is just an approximation (I also used the approx sign there). In the streamlit script there are only the final computed values. If I remember correctly then I evaluated the integral from 0 to 10 (so not even to infinity, but I believe that there are no fat tails, correct me if I'm wrong) with a composite trapezoid rule. Probably, I used sth like 100-1000 evaluation points.
      Your question is very valid, because it still holds true: in general those integrals (for the marginal) are intractable. In lower dimensions (let's say below 15) you can often resort to numerical quadrature techniques (like newton-cotes or gauss or sth else). Beyond that, one can only use monte Carlo techniques for which one often uses special markov-chain Monte Carlo approaches. This is a very interesting field for itself, because some MCMC techniques like Hamilton Monte Carlo then link back to differential equations (that are also a major content of the channel). I want to create videos on these topics, but the probabilistic topics are a bit on hold at the moment since they are not part of my PhD research. Still, I want to continue with these topics on the channel at some point in the future. Stay tuned ;)

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому +1

      As you probably figured out there are many difficult books on these topics. I can recommend bishops "pattern recognition and machine learning".
      Generally speaking, I like a more code-focused approach (that I try to also teach in my videos). For instance, the documentation of probabilistic programming languages (like TFP, Stan, PyMC3, Pyro or Turing.jl in Julia) comes with many nice examples and use-cases. I learned a lot just by replicating these myself being guided by the documentation. 😊
      Good luck on your learning journey 👍

    • @sfdv1147
      @sfdv1147 Рік тому +1

      @@MachineLearningSimulation Thanks for the answer and book suggestions.

  • @junhanouyang6593
    @junhanouyang6593 2 роки тому +1

    Really good video as always. But just make sure I understand the variational inference example. Say we are doing dog and cat image classification task. And in the dataset there are 40 percent dog images and 60 percent cat images. Z is latent variable and X is the image. For the prior, P(Z = dog) = 0.4 and P(Z = cat) = 0.6? The P(X|Z) is the likehood of the data, we won't know the actual probability, but we can approximate and train an approximater by using negative log likehood or some types of likehood function? And for the variational inference we just want to know the P(Z|X)? Is my example and understanding correct? Thanks

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      Thanks for the comment, and the nice feedback, :) I really appreciate it.
      In the case of (simple) Probabilistic Models, you can always express the functional form of the likelihood P(X|Z) and could therefore also compute the likelihood of your data. You could use the likelihood to train your model, that's correct :). This classifier would also be some form of an approximator of the posterior as you have a model for the task "given a new image tell me whether it is a dog or a cat". For VI, we are interested in a full distribution (surrogate) for one data-point (or a bunch of).
      I think there is some small misconception you have in your question that is related to the difference between discriminative and generative models. There will be more videos on VI and VAE in the next weeks. I hope they can clear this up a bit :). Please leave a comment also under them, if they did not fully answer your question.

    • @junhanouyang6593
      @junhanouyang6593 2 роки тому

      @@MachineLearningSimulation Thank you, I think I know where my misunderstanding are. In Generative Approach, we want to find the probability of P(X,Z) while discriminative approach we only interested in P(X|Z)? Since in order to calculate P(X,Z), you also need to know P(X|Z). So both approach can let us know the likehood of data correct?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      @@junhanouyang6593 I think there are some more axes to this question. The example of cat and dogs was using in this video was in the form of VAEs, where we are not interested in classifying this images with their corresponding labels, but rather trying to find latent information within the images. This could be that the images of cats and dogs are distinct in the way that they are different types of animals. However, it could also be other information like the lighting situation in which the picture was taken. In a sense, using VAEs on these problems is unsupervised learning.
      In my point of view, the difference between generative and discriminative models is not necessarily what we are interested in. Rather, it is "how we model it". And I think here is also a bigger catch in your initial question. For classical classification problems (and also classical regression problems) the input and the output is not latent. Often only the parameters to the models are considered latent, often even without a prior on them (see e.g. linear regression: ua-cam.com/video/nYGGq5zTlgs/v-deo.html ). Let's put it into supervised learning with X being the input, Y being the output. Then, generative models would model P(X, Y) whereas discriminative models would model P(Y| X). In other words, generative models model the joint (and the DGM) whereas discriminative models model the posterior only.
      I am sure, this might not have been the best answer to your question. Maybe check back on some basic aspects of probabilistic modeling like latent variables: ua-cam.com/video/SNeC_SrbNZw/v-deo.html

  • @Stealph_Delta_3003
    @Stealph_Delta_3003 2 роки тому +2

    Awesome

  • @tony0731
    @tony0731 Рік тому

    Thanks for your amazing videos about variational inference, it's extremely helpful! I have a question regarding the joint distribution p(z, x). It seems intuitive that we assume latent variable p(z) is a given distribution like a normal distribution, but what if we don't know the likelihood p(x|z)? Is it still possible to do variational inference and how should I understand this in the example of images and the camera settings? Thanks! 😊

    • @MachineLearningSimulation
      @MachineLearningSimulation  Рік тому

      First of all thanks: thanks for the kind feedback and the nice comment 😊.
      It's an interesting question, but unfortunately beyond my knowledge. The VI as we looked it at here is based on having a joint distribution that can be factored into a Directed Graphical Model. As such, we will (by assumption) always have access to the distribution over the root nodes and the conditional distributions for all other nodes.
      Ultimately, what is required to perform VI is a (differentiable) implementation of the joint distribution fixed to the data, that can be queried for various latent variables. Maybe there is a way to find sth like this without the likelihood p(x|z), but I'm unfortunately not aware of 😅

  • @akshaykiranjose2099
    @akshaykiranjose2099 11 місяців тому +1

    Thanks!

  • @jason988081
    @jason988081 Рік тому

    why do not know the exact value of posterior, when we know the posteiror is propotional to the joint distribution which can be calculated ? Could you give a practical example to show that knowing the exact value of posterior is needed for an application ? Thank you.

    • @MachineLearningSimulation
      @MachineLearningSimulation  11 місяців тому +1

      Thanks for the great question! :)
      With the posterior being proportional to the joint distribution we can already find MAP estimates, which is a good start. Analogously, we can also compare two different proposals for latent variates: if one has a higher joint probability (under the same observed data) it also has a higher posterior probability. The problem is that, we can only query point-wise! In other words, we do not have a full probability distribution, meaning that we cannot sample from it. With a full distribution it is also easier to assess credibility intervals which is harder to do by sampling via MCMC.

  • @ritupande1420
    @ritupande1420 2 роки тому

    Maybe, this is a stupid question, but is it true that intractability of the marginal is valid only for continuous distributions of Z? For discrete distributions we can always calculate summation for all values of z in P(x,z) to get P(x). This implies that variational inference is applicable only for continuous distributions?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому

      Hi,
      thanks for the interesting question. :) Indeed, the video did not say too much about problems involving discrete latent random variables.
      There are two parts to your comment: (1) The claim that you can always sum over all discrete z and (2) The claim that once you could express the marginal, you can't apply VI anymore.
      (1): This might seem intriguing, since it will probably always work if the discrete z is one-dimensional. A good example for such is if z represents a class (like in a (Gaussian) Mixture Model). As you correctly mentioned, even if you have many classes (let's say 1000) you can still sum over them. In contrast, even in 1D you could come up with integrals that are intractable. The problem with discrete variables arises once you have higher dimensional latent spaces, which is due to the combinatorial complexity. Imagine you have multiple latent attributes that can have 10 classes each. If you have two attributes (a 2D latent space) you have 100 possible combinations to sum over, for three attributes (a 3D latent space) it becomes 1000 etc. In essence, it grows exponentially. Hence, for some smaller discrete latent spaces, it might be possible to just sum in order to marginalize, but it quickly becomes infeasible/intractable. On top of that, you would have to do that each time you wanted to query one value of the posterior distribution. With VI, you would get a full surrogate you could do whatever you want with (like finding modes, sampling etc.).
      (2): I can also understand the thought, especially because I motivated Variational Inference as the remedy to intractable posteriors. However, even if the marginal is tractable like in Gaussian Mixture Models, you can still use VI. In these cases, it is also Expectation Maximization (=EM).
      Hope that helped :). Let me know, if something is unclear.

    • @ritupande1420
      @ritupande1420 2 роки тому +1

      @@MachineLearningSimulation Thanks for a very detailed and clear explanation.

  • @alexanderkhokhlov4148
    @alexanderkhokhlov4148 9 місяців тому

    I'm not sure that claim that we can plug in some value into continious likelihood and get probability value is correct. The probability of this value should be zero, because the measure of some value is zero. Plus p(x) can be greater than 1 and it's strange to have probability of something greater than one. Only integral of p(x) over domain has to be one. Or I missed something?

    • @MachineLearningSimulation
      @MachineLearningSimulation  9 місяців тому

      Hi, thanks for the comment 😊
      Do you have a timestamp for when I say this in the video? It's been a while since I uploaded it.
      Probably, I referred to the probability density in that case.

  • @besarpria
    @besarpria 2 роки тому +1

    What I do not understand: when looking at the ELBO, we still compare the surrogate function q(Z) - that is a valid probability - with the unnormalized probability p(Z,X=D), right? But why does this comparison even make sense? For me, it seems like VI is some magic to compare the surrogate q(Z) to an unnormalized probability instead to the (unavailable) normalized conditional. Is this actually the gist of it?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +1

      It's actually the gist of it. :D I can understand, this might seem a bit magical.
      The reason, of course, we are doing this comparison is to find (=train/fit/optimize) the surrogate q(Z) to then do fancy things with it.
      One crucial observation is that the unnormalized probability p(Z, X=D) (the joint fixed to the data) is proportional to the (hypothetical, but unavailable) posterior. If a function is proportional to another function, their features are identical. Those features could for instance be maxima and minima. Hence, we could already use the unnormalized probability to do MAP estimates. Here with VI, we are just going one step further to get a full (surrogate) pdf.
      The next video (to be released on Friday) should clear this up. There, we will look at this Exponential-Normal model and the derivation in great detail. I hope this helps :)
      If something is unclear, feel free to leave a follow-up comment.

    • @besarpria
      @besarpria 2 роки тому +1

      @@MachineLearningSimulation Thanks for your clarification, I highly appreciate it. Looking forward to the follow-up video!

    • @besarpria
      @besarpria 2 роки тому +1

      @@MachineLearningSimulation There is one thing I still can't get my head around though: It is often said that L(q) becomes tractable to compute and to maximize for a reasonable family of surrogate distributions Q. However, computing L(q) requires solving an expectation, i.e., computing an integral over the complete latent space. How can I compute this expectation without evaluating p(Z, X=D) for each possible latent vector Z? Is this even possible in general or do we need a nice closed-from joint distribution for this to work?

    • @MachineLearningSimulation
      @MachineLearningSimulation  2 роки тому +1

      ​@@besarpria That's a great question. It also took me a while to get my head around that. Especially because in classes you might only face simple artificial scenarios in which many things can be -solved analytically. However, once you then apply it to realistic problems things become more challenging, and you have to "engineer" more often than you might like :D
      You are right, in many applications the integral corresponding to the ELBO (due to the Expectation) is intractable. Hence, it does not have a closed form antiderivative. You then have to resort to sampling techniques to approximately evaluate it, in the sense of Monte-Carlo. For me that raised the question: Okay, if we have to approximate an integral either way, why can't we just also approximate the Marginal by sampling and then normalize the joint to obtain a posterior. The catch is the necessary precision of these approximations. For the marginal, you need quite a high precision since you want your posterior to be a valid PDF (with the integral = 1 condition). On the other hand, for the ELBO it is usually fine to just use a handful of samples since it is going to be repeatedly evaluated over the course of the optimization. Even 1 sample was considered to be sufficient (take a look at the VAE paper: arxiv.org/pdf/1312.6114.pdf right below Eq (8) the authors note this).
      I hope that could at least give some information regarding the answer to your question :)

    • @besarpria
      @besarpria 2 роки тому +1

      @@MachineLearningSimulation Thank you so much for the extremely detailed answer! :) Very interesting to see that in the end it becomes a question of whether fitting to the joint or to the marginal using numerical integration is cheaper.
      I will also have a look at the paper, it looks really cool.