Dirichlet Process Mixture Models and Gibbs Sampling

Jordan Boyd-Graber

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 6 чер 2024
Bayesian algorithms for clustering
Наука та технологія

КОМЕНТАРІ • 52

@pablotano352 8 років тому
Great explanation!!
@amandalevenberg841 8 років тому ⁺⁵
THANK YOU FOR THIS VIDEO
@dungthai762 6 років тому ⁺¹⁰
Thank you for the video! In slide 5 it should be (1-Vj) instead of (1-Vk) right?
@Kaassap 5 місяців тому
This was very helpful tyvm!
@JordanBoydGraber 3 роки тому ⁺⁵
Someone asked for a full derivation but deleted their comment, so here's a link:
www.ncbi.nlm.nih.gov/pmc/articles/PMC6583910/
@thegreatlazydazz 5 років тому
I might be mistaken but in eq (3),(4),(5) you are not conditioning on the values \theta_i that the the Gaussian is throwing up, you are just conditioning on the parametres of G. This is why you integrate against \theta in (6) and (7). I mean when you write | \theta , this integration would not make sense.
@ejaz629 6 років тому ⁺¹
Thank you for the video. What I understand is that in DPMM, you move from parameters to observations (generative model). In in inference: you infer z_i (then clusters parameters) from observations. Now consider my dataset compose of two features (assume normal), i.e. N(5,1), N(10,1). Can you please explain what would be the base distribution (would it be still normal with zero mean and unit variance?), and in this case what would be the goal of inference since our real dataset is difference from data generated by DPMM.
@ahmedstatistics2838 Рік тому
Many thanks
@a.a3265 3 роки тому
Thank you for the video .if I have linear model and I want find the prior distribution.for the parameter by dirichlet process mixture how the DPM prior distribution will be form .
@JordanBoydGraber 3 роки тому
It's a little trickier, as you need to fit your linear model *given* the table assignments of the CRP. Once you've done that you need to compute the probability of a table assignment from the DP prior and the linear model posterior and multiply those two terms together to sample a new table assignment.
@tobias3112 8 років тому ⁺¹
Can you clarify the meaning of the notation N(x, mu, sd). Later in the video when you do the actual calculations you end up with N(x, mu, sd) = (x_i - mu_k)^2 to calculate the P(Zi |Z_-i). This step was a bit confusing for me. Also you say if you find a new cluster you draw mu_0 from the prior, is the prior here just N(0, 1)?
@JordanBoydGraber 8 років тому ⁺¹
+Tobias Something Yes, there's a simplification step to assume unit variance that causes the variance to go away. Then for new clusters, we assume a standard normal base distribution.
@tobias3112 8 років тому ⁺¹
+Jordan Boyd-Graber Is the mean of each cluster drawn from a Normal distribution, or do you just initialize the mean as the value for each point?
@JordanBoydGraber 8 років тому ⁺²
+Tobias Something The posterior predictive distribution is a normal distribution that includes the effect of the current points assigned to the cluster *plus* the prior distribution for each cluster's mean (in this case, zero).
@franklee813 8 років тому
Crystal clean! thanks!
7 років тому
Hello, at minute 16:46 I want to know in what reading can I see the step between equation 6 and equation 7. Thanks
@JordanBoydGraber 6 років тому
This is replacing the general likelihood distribution with specifically a normal distribution (could be any base distribution).
@yiwendong7000 3 роки тому
@@JordanBoydGraber Thanks for the hint! May I ask for the link to the full derivation? I tried the derivation but failed to integrate the normal distribution formula...
@JordanBoydGraber 3 роки тому
@@yiwendong7000 This should be helpful!
www.ncbi.nlm.nih.gov/pmc/articles/PMC6583910/
@Abood123441 6 років тому
Thank you for this video, it's really helpful. But I have questions ?
Can we use CRP instead to LDA to discover an optimal number of topics ?
what is the disadvantages of CRP?
@JordanBoydGraber 6 років тому
The CRP is still sensitive to the Dirichlet process parameter, so in some ways you're selecting a topic number based on parameter. But it does find a good number of topics with respect to likelihood.
Disadvantages: CRP is much slower than LDA.
@Abood123441 6 років тому
Thank you for your quick response. Another question please:
I believe CRP try to assign a customer to a table with many customers than few. but, if use CRP to cluster similar words together, you don't think this will produce clusters with irrelevant words?
Thank you in advance .
@JordanBoydGraber 3 роки тому
@@Abood123441 Yes, it usually does! You probably only want to look at the top words of a cluster.
@jacobmoore8734 3 роки тому ⁺¹
What I got: Major improvement above EM algo because you don't need to supply the number of clusters apriori as a hyperparameter. DP figures it out
@cuenta4384 5 років тому
Random question!!
I wonder why there are not many ppl working with Expectation Propagation?
@JordanBoydGraber 5 років тому
I dunno. There are certainly some people (e.g., Jason Eisner). I think it's a little less friendly for stochastic gradient descent, which is more popular these days thanks to things like Pytorch and Tensorflow.
@Blaze098890 3 роки тому
Not sure if the Gibbs sampler makes sense. To my understanding Gibbs sampling results in samples from the joint distribution, from there we can marginalise over a single variable, divide the two out (the joint and joint-1 marginal) and then arrive at the posterior for the variable we marginalised over. This leads me to believe that what you say in 14:20 is incorrect but I might be wrong. Equation 4 to 5 also does not make sense to me as there is no joint to be applying the chain rule to.
@JordanBoydGraber 3 роки тому
Eq 4 to 5 is breaking apart one conditional to two; there's an intermediate step of explicitly writing out the joint that I omitted.
I'm not quite sure what you're referring to at 14:20. The individual Gibbs draws are not from the joint and I didn't give the proof. Radford Neal gives a good treatment:
www.cs.toronto.edu/~radford/ftp/review.pdf
@Blaze098890 3 роки тому
@@JordanBoydGraber My misunderstanding may lie in how Gibbs sampling works. Let's say for p(x,y) we sequentially make the draws p(x|y) and p(y|x). But since the samples are correlated (as we are conditioning on what we previously sampled for the other variable) this can't be considered as a sample from the true posterior (which may be what I misunderstand), however, it is a draw from the joint. So after a single iteration of p(x|y) and p(y|x) we have a single sample from the joint p(x,y) rather than a sample for each conditional. Is it maybe implied in Gibbs sampling then that although the samples are from the joint one can also consider them from the posterior as the conditional and joint are proportional to one another?
@liclaclec Рік тому
"The Delta is like the Indicator" my heart broke
@JordanBoydGraber Рік тому
This is why it's useful to have someone next to you while recording these things. My notation isn't always clear, and it's good to have a reality check.
@Vb2489 7 років тому ⁺³
What are the prerequisites to learn DPMM and Gibbs Sampling ?
I have to learn this in 1 week, is it possible? can someone please guide me ?
@JordanBoydGraber 6 років тому ⁺⁴
Look at "Gibbs Sampling for the Uninitiated" by Resnik and Hardisty.
@Jack-lg9mq 6 років тому ⁺³
I think you were 5 months too late!
@ahmedstatistics2838 Рік тому
hi dear:
is the base distribution same the data distrbtoin ???
@JordanBoydGraber Рік тому ⁺²
Almost certainly not, as the idea is to model the data distribution with the DP. So you typically choose a much simpler distribution (e.g., a Gaussian with wide variance) that describes how new clusters form.
@ahmedstatistics2838 Рік тому
@@JordanBoydGraber thank you so much
@ahmedalsaleh339 2 роки тому
whats mean the baseline distribution and how can I get it
@JordanBoydGraber 2 роки тому
You typically assume it in the model (e.g., a uniform multinomial distribution), but you could also assume it's the unigram distribution inferred from a corpus (e.g., count all the words and divide by the total number of words).
@ahmedstatistics2838 Рік тому
How can I use dirichlet process in panel data models? Thank you for advance
@amineounajim9818 Рік тому ⁺¹
Look up hierarchical dirichlet process.
@ahmedstatistics2838 Рік тому
@@amineounajim9818 is the dirichlet consider alternative for Bayesian. I mean no need to find the maximum likelihood. but the dirichlet only dirichlet process is enough to consider Bayesian approach for nonparametric models
@jasontappan3565 2 роки тому
I have absolutely no idea how you implemented the chain rule in 15:30.
@jasontappan3565 2 роки тому
It looks like there are a few independence assumptions that I am missing.
@jasontappan3565 2 роки тому
It also looks like the = should rather be a proportional sign. Unless I am missing something completely.
@JordanBoydGraber 2 роки тому ⁺²
@@jasontappan3565 The full derivation is here:
arxiv.org/pdf/1106.2697.pdf
And you're right, because I don't have a Dirichlet normalizer, it should be proportional to.
@jasontappan3565 2 роки тому
@@JordanBoydGraber Thank you very much, it makes perfect sense now. Thank you for the video. My wife is busy with her Master's and is doing her dissertation on topic modelling. These videos help alot.
@TheSaintsVEVO 4 роки тому ⁺³
I'm glad you made the video but it is NOT clear or understandable to a beginner. I'm glad it helped everyone else, though.
@JordanBoydGraber 4 роки тому ⁺²
Did you watch the previous videos, especially the one on mixture models?
ua-cam.com/video/qZANeP3Pst8/v-deo.html
It's part of this course, which has even more context:
users.umiacs.umd.edu/~jbg/teaching/CMSC_726/

Наступне

Автоматичне відтворення

Prof. David Blei - Probabilistic Topic Models and User Behavior