Stanford CS236: Deep Generative Models I 2023 I Lecture 5 - VAEs

Поділитися
Вставка
  • Опубліковано 28 лис 2024

КОМЕНТАРІ • 15

  • @420_gunna
    @420_gunna 4 місяці тому +11

    Students are asking great questions, really makes the class better for watchers. Impressive that they can understand what requires me to pause and chin-scratch (and still only get 80% of it).

  • @yangluo8317
    @yangluo8317 3 місяці тому +1

    讲得特别好。Very impressive lecture, thanks a lot for the instructor and students!

  • @CPTSLEARNER
    @CPTSLEARNER 6 місяців тому +1

    29:30 Infinite number of latent variables z
    30:10 Finite gaussians, able to choose parameters arbitrarily, lookup tables
    30:30 Infinite gaussians, not arbitrary, chosen by feeding z through neural network
    39:30 Parameters of infinite gaussian model
    40:30? Positive semi-definite covariance matrix
    41:30? Latent variable represented by part of image obscured
    50:00 Number of latent variables (binary variables, Bernoulli)
    52:00 Naive Monte Carlo approximation of likelihood function for partially observable data
    1:02:30? Modify learning objective to do semi-supervised learning
    1:04:00 Importance sampling with Monte Carlo
    1:07:00? Unbiased estimator, is q(z^(j)) supposed to be maximized?
    1:09:00 Biased estimator when computing log-likelihood, proof by Jensen's inequality for concave functions (log is concave)
    1:14:30 Summary, log p_theta(x) desired. Conditioned on latent variables z, if infinite Gaussians, then intractable. Do importance sampling with Monte Carlo. Base case k=1 shows biased estimator for log p_theta(x). Jensen's inequality yields ELBO. Optimize by choosing q.
    1:17:00 KUBO and other techniques for upper bound, much trickier to get UB
    1:18:40? Entropy and equality when q is posterior distribution
    1:19:40? E step of EM algorithm
    1:20:30? Loop when training? x to z and z to x

  • @Steven-gy9gx
    @Steven-gy9gx Місяць тому

    I have a question for the disentangled latent at 1:01:19 . May I ask which papers prove the result that fully make the latent to be meaningful is impossible.

  • @Jun_Seok_Kim
    @Jun_Seok_Kim 23 дні тому

    무야호~

  • @phucnguyenthang4808
    @phucnguyenthang4808 3 місяці тому

    43:53 I don't understand much about Z's here. They are latent variables, but just the missing pixels? In my opinion, latent variables are some meaningful features of observed X's, so Z's are missing values, as well as features of X's? And whether Z's distribution is N(0,1)? In previous lessons, I just knew pixelCNN would predict masked pixels, but didn't knew the masked follow N(0,1).
    Sorry if my question is too stupid, but I'm really misunderstand. Could u explain for me, please?🥺

    • @qizhang1978
      @qizhang1978 2 місяці тому

      Same question here. I do not quite understand the transition to talk about using Z to simulate the missing pixels. Does it mean that we do not need to rely on VAE if there are no missing pixels in the training dataset?

    • @aymanhassan8178
      @aymanhassan8178 Місяць тому

      I think it's just a prior to assume that the missing pixels are following N(0,1)
      also if their's no missing values in the training set how could we train Z?
      what I think of is we need to add random noise to the data -maybe gaussian Noise?- and with each iteration will try to train Z such that it could capture the meaning of those missed up pixels.
      and as in the 2nd lec when he talked about genrative and discriminative models we would use generative model to train Z and the discriminative model would act like a loss function.

  • @iamnotPi
    @iamnotPi 2 місяці тому

    1:14:15 Can sb explain what stops us from evaluating the LHS in the last inequality?

    • @BeomseoChoi
      @BeomseoChoi Місяць тому

      This is my weak understanding.
      The LHS is log p(x). p(x) is an unbiased estimator, but log p(x) is not. if it's an unbiased estimator, then we can obtain it by averaging all sample means. But log p(x) a biased estimator. So the inequality occurs. Now that averaing all sample means is not equal to log p(x), we optimize the lower bound (use Jensen's inequality to find the lower bound). If we maximize the lower bound, the gap (KL divergence) will be closed to 0 (minimized).
      I wish this answer is correct.

  • @dohyun0047
    @dohyun0047 6 місяців тому

    @56:38 쯤에 수업 분위기가 너무 좋네욤ㅋㅋㅋㅋ

  • @chongsun7872
    @chongsun7872 6 місяців тому

    @37:30 Why the nonlinear transformation of iid Gaussian distribution p(x|z) is another Gaussian? Can anyone explain to me please?

    • @artemkondratyev2805
      @artemkondratyev2805 5 місяців тому +1

      The way I understand it:
      - You make a modeling assumption about distribution p(Z), can be Gaussian, can be categorical, but can be anything else; but it's convenient to pick something "easy" like Gaussian
      - You make a modeling assumption about p(X|Z), and again, can be Gaussian, can be Exponential, can be anything else, doesn't have to be the same family of distributions as p(Z), and again, it's convenient to pick something "easy"
      - You make a final modeling assumption that the parameters (theta) of p(X|Z) depend on Z in some unknown and complicated way; i.e. theta = f(z), where f is some complex non-linear transformation (and which you approximate using NN)
      So you don't really transform the distribution p(Z) into p(X|Z), you transform values of Z into parameters theta of p(X|Z)
      And all of it just by modeling assumptions
      And then you just hope that your assumptions match the reality, and that you can find such Z and such transformations f(z)=theta that allow good approximation of p(X) using p(X) = sum [p(X|Z)*p(Z)] over all z (from the law of total probability)

    • @BeomseoChoi
      @BeomseoChoi Місяць тому

      I wish that I understood well your question.
      My understanding is that we supposed that the p(x|z) is Gaussian. So, p(x|z) follows N(mu, sigma). But, The mu and sigma depends on z. z can be sampled from any distributions. And a function f(z) will give us the mu and sigma. The function f can be anything that map different z to different mu, sigma. also can be a neural network. So, the non-linear transformation is not applied to p(x|z), but z. it is like the function f(z) such as a neural network.
      So, latent variable models work like this, i guess:
      1. sample z from any distribution p(z) (we can also model this distribution)
      2. make parameters for the distribution which p(x|z) follows, by using f(z).
      3. we can construct the distribution p(x|z), then sample (generate) from it.
      i.e. the non-linear transformation is a function that makes "parameters" for Gaussian. We applied it to sampled one, not the Gaussian p(z).

  • @420_gunna
    @420_gunna 4 місяці тому

    1:08:00 lost my ass though