@Jim O' Donoghue The numerator can be seen as exp(b+c) = exp(b).exp(c) (associativity of multiplications). When you apply this to the numerator it turns out to be the equation mentioned at 10:26.
10:07, I have a problem where to put Bayesian Networks and HMMs. Do they belong to unsupvised learning like in the video above , or supervised, or do they simply present an own category in Machine Learning?
+Randy Welt Good question! Bayesian Networks and HMMs are in the family of directed graphical models, as opposed to undirected graphical models like the RBM. That's the main distinction. Note also that we could do either supervised or unsupervised learning, with either undirected or directed graphical models.
Dear Hugo. I am implementing RBM but I find out that energy function of joint probability seems so confusing 6:10 E(x,h) = - ( Wi,j * hj * xk + sum of all (dot product of visible unit value times bias) + sum of all (dot product of hidden unit value time biases). Therefore, how can we define Z since all of the value of visible and hidden units have been used in E(x,h) ???
@Hugo: Assuming that I train a network with X1(0,1,1,1) X2(0,0,0,1) then I get H1(0,0,0,0,1), H2(0, 1, 0, 0, 0) respectively. so is that Z = E(x1, h1) + E(x2, h2) right????
Ah, no! Z is the sum of *the exponential* of *all possible values of x and h*, not just the values of x seen in your training set. This is why we can't compute Z exactly in practice for even moderately large RBMs.
According to "A practical Guide to training RBM", partition function Z is given by summing over all possible pair of visible and hidden vector: Z = sum(exp(-E(x,h)). So how can I understand "all possible values of x and h" since with a certain of training data we have only one vector x. I notice that "the reconstruction error is actually a very poor measure of the progress of learning" so if we can't compute Z exactly in practice, then what kind of measure can we use?
Many thanks for the lecture, found it really useful. I'm a bit confused about the notation on the slide entitled Markov Network View though. Firstly, have you split the equation onto multiple lines just to make it a bit more readable? Or is it a significant that the unary factors are on different lines to the pairwise factors. Secondly, from my understanding of MNs, a distribution can be written as a product of the potentials defined by the cliques on the graph (up to a normalising constant). Since it's a pairwise MN I can see that the pair-wise factors are represented in the graph but I can't see where the unary factors are represented. What am I missing?
Thanks for your kind words! Indeed, I split on different lines for readability, the line on which each term is doesn't matter. I agree that the MN representation doesn't make unary factors explicit, and I think that's the main benefit of the factor graph representation, which illustrates all potentials explicitly. Hope this helps!
Good question! For this video, we only provide the formula. But in the following videos, we show some properties that can be derived, thanks to the particular formulation of the energy function. So hopefully that'll help understand why this particular energy function is used. Hope this helps!
@@hugolarochelle Thanks I was wondering if it is the general outcome of simple MRF calculation of joint probability. Considering the bipartite nature of the graph, we can assume there are three two types of connections, hidden node to observed node, and output from hidden or input to hidden. So rest of the random variables are independent. So by sum of product, the product terms would only encode the joint probabilities. Does this make sense?
@@bowindbox1132 Indeed an RBM is a type of MRF. It is one that follows a bipartite graph, binary random variables and with pairwise potentials that correspond to h_j x_k W_{j,k}, and unary potentials that are h_j b_j and x_k c_k. Hope this helps!
@@hugolarochelle So we might assume potential of a node is just the value at that node. For example in input, so in forward pass say we are going from x_k to h_j via edge W_{k,j}. P(x_k, h_j) = Potential on x_k \times Potential to jump to h_j from x_k \times Potential on h_j = x_k \times W_{k,j} \times h_j. The relation would be reversed when we calculate the total energy of the system backward (that is while backpropagating.) Is this correct ?
@@bowindbox1132 actually, the potentials can't given you directly these probabilities. Indeed, potentials in an MRF aren't necessarily normalized, as probabilities need to be. I can't write much on how RBMs and MRFs are related, but I've found this medium article that seems to discuss the relationship, and which perhaps will be useful to you: medium.com/datatype/restricted-boltzmann-machine-a-complete-analysis-part-2-a-markov-random-field-model-1a38e4b73b6d
I think the use of single alphabet characters in formulas really obfuscate the meaning. we have much larger screens now and nobody do these calculation by hand, so why can't we just use the full word, or at least shorten it in a more meaningful way? like instead of Bj Ck, do Bhi (ith bias of hidden unit) and Bvj(jth bias of visible unit) or something like what we do in programming, vu[i].bias and hu[j].bias
I can't see what the vectors *c* and *b* are. I watched the videos of the series on autoencoders first and I understood them but I didn't watch the videos preceding this one. Did I miss something ?
c and b are vectors of parameters, exactly like in autoencoders. In RBMs, they will be used differently than in autoencoders, but in both cases they are vectors of parameters. Hope this helps!
7 років тому
How can I decide a cut off point for RBM results in the case of unsupervised learning.?
Great question! Unfortunately there is no universal answer. It depends on what you are doing. For instance, if you are training features for classification, then you should periodically check on how discriminative the features are for your task, even if that's on a small subset of data.
Thank you professor for this video. I have two questions: 1-How to compute the distribution of the input vector x for instance p(1,0,1)? 2- Is it possible to feed the RBM with a multivariate input vector i.e. possible value for each visible unit are {0,1,2,3}? thank you in advance
Hi! For 1, see video 5.3 for computing p(x) (ua-cam.com/video/e0Ts_7Y6hZU/v-deo.html) For 2, it is indeed possible to have units that aren't binary (e.g. categorical or multinomial). There is more than one way of doing this, not a single one.
It is if both the input layer and the hidden layer are large. But if one is small (e.g. ~20 units), then turns out we can compute the partition function in a reasonable amount of time.
Thanks! I used Camtasia for mac for the recording. For my slides, I used Keynote and some free app for supporting the drawing on the screen (the one I used then isn't available anymore, but there are other equivalents available).
Good question! A "normal" regularizer (like L2 weight decay) will not depend on the input data distribution (L2 weight decay penalizes the sum of the squared parameter values). In contrast, a data dependent regularizer would be one that penalizes certain parameter values but through a term that depends on the input distribution (i.e. the set of x^{(t)} in your dataset). Hope this helps!
All the slides, and the whole course in fact (with suggested readings and assignments) are available here: info.usherbrooke.ca/hlarochelle/neural_networks/content.html
Don't get bogged down by the name. It's just a function. The only reason we call it an energy function is due to the analogy from physics. In physics, a configuration of the environment that has high energy will have low probability of being observed (and this is probably not a super accurate statement on my part... I'm not a physicist :-) ). In an RBM, it's the same: configurations of x and h which have a high energy (i.e. E(x,h) is high), will have low probability under the RBM model. Hope this helps!
Hi Hugo, The way of your teaching during these lectures is so fascinating. So, thanks a lot. I was wondering if you have also some code sources of implementing RBMs using python. I know that scikit learn has already provided an example of how to use it (scikit-learn.org/stable/auto_examples/neural_networks/plot_rbm_logistic_classification.html) , but I am looking for examples using Theano, or Keras or Tensorflow backend. So, I would really appreciate if you share some examples in implementing RBM. Thank you
Thanks for your kind words!! For Theano: deeplearning.net/tutorial/rbm.html For TensorFlow, I don't have any particular recommandation, but I'm sure by googling "TensorFlow RBM" you'll find plenty :-)
Used this lecture to understand lecture given by Jeffrey Hinton on NN course at Coursera, Thanks a ton, you saved me again.
Thanks for your kind words!
@Jim O' Donoghue The numerator can be seen as exp(b+c) = exp(b).exp(c) (associativity of multiplications). When you apply this to the numerator it turns out to be the equation mentioned at 10:26.
great lecture Hugo, thanks for putting all this hard work into it - very well taught too!
Thanks for the explanation on the energy function. Everything suddenly starts to make sense.
Glad I could help!
Best lecture on RBM....Thanks Prof.
10:07, I have a problem where to put Bayesian Networks and HMMs. Do they belong to unsupvised learning like in the video above , or supervised, or do they simply present an own category in Machine Learning?
+Randy Welt Good question! Bayesian Networks and HMMs are in the family of directed graphical models, as opposed to undirected graphical models like the RBM. That's the main distinction.
Note also that we could do either supervised or unsupervised learning, with either undirected or directed graphical models.
I really enjoyed watching this video. As I'm working on a project about DBN, this video is very useful for me. Thanks.
+Yifan Li Thanks for your kind words!
sorry do not understand.
Dear Hugo. I am implementing RBM but I find out that energy function of joint probability seems so confusing 6:10
E(x,h) = - ( Wi,j * hj * xk + sum of all (dot product of visible unit value times bias) + sum of all (dot product of hidden unit value time biases).
Therefore, how can we define Z since all of the value of visible and hidden units have been used in E(x,h) ???
Z is defined as the sum of E(x,h), but over all possible values of x and h. There's otherwise no relationship between the x and h in E(x,h) and Z.
@Hugo: Assuming that I train a network with X1(0,1,1,1) X2(0,0,0,1) then I get H1(0,0,0,0,1), H2(0, 1, 0, 0, 0) respectively. so is that Z = E(x1, h1) + E(x2, h2) right????
btw I am using C++ to implement RBM, so I am sorry if I ask so many questions and bother you :)
Ah, no! Z is the sum of *the exponential* of *all possible values of x and h*, not just the values of x seen in your training set. This is why we can't compute Z exactly in practice for even moderately large RBMs.
According to "A practical Guide to training RBM", partition function Z is given by summing over all possible pair of visible and hidden vector: Z = sum(exp(-E(x,h)). So how can I understand "all possible values of x and h" since with a certain of training data we have only one vector x.
I notice that "the reconstruction error is actually a very poor measure of the progress of learning" so if we can't compute Z exactly in practice, then what kind of measure can we use?
You're awesome for doing this.
Many thanks for the lecture, found it really useful. I'm a bit confused about the notation on the slide entitled Markov Network View though. Firstly, have you split the equation onto multiple lines just to make it a bit more readable? Or is it a significant that the unary factors are on different lines to the pairwise factors. Secondly, from my understanding of MNs, a distribution can be written as a product of the potentials defined by the cliques on the graph (up to a normalising constant). Since it's a pairwise MN I can see that the pair-wise factors are represented in the graph but I can't see where the unary factors are represented. What am I missing?
Thanks for your kind words!
Indeed, I split on different lines for readability, the line on which each term is doesn't matter.
I agree that the MN representation doesn't make unary factors explicit, and I think that's the main benefit of the factor graph representation, which illustrates all potentials explicitly.
Hope this helps!
@@hugolarochelle great, thanks for clearing that up.
c transpose and b transpose are the biases for the hidden and visible nodes
How is the energy function derived at @6:10?
Good question! For this video, we only provide the formula. But in the following videos, we show some properties that can be derived, thanks to the particular formulation of the energy function. So hopefully that'll help understand why this particular energy function is used. Hope this helps!
@@hugolarochelle Thanks I was wondering if it is the general outcome of simple MRF calculation of joint probability. Considering the bipartite nature of the graph, we can assume there are three two types of connections, hidden node to observed node, and output from hidden or input to hidden. So rest of the random variables are independent. So by sum of product, the product terms would only encode the joint probabilities. Does this make sense?
@@bowindbox1132 Indeed an RBM is a type of MRF. It is one that follows a bipartite graph, binary random variables and with pairwise potentials that correspond to h_j x_k W_{j,k}, and unary potentials that are h_j b_j and x_k c_k.
Hope this helps!
@@hugolarochelle So we might assume potential of a node is just the value at that node. For example in input, so in forward pass say we are going from x_k to h_j via edge W_{k,j}. P(x_k, h_j) = Potential on x_k \times Potential to jump to h_j from x_k \times Potential on h_j = x_k \times W_{k,j} \times h_j. The relation would be reversed when we calculate the total energy of the system backward (that is while backpropagating.) Is this correct ?
@@bowindbox1132 actually, the potentials can't given you directly these probabilities. Indeed, potentials in an MRF aren't necessarily normalized, as probabilities need to be.
I can't write much on how RBMs and MRFs are related, but I've found this medium article that seems to discuss the relationship, and which perhaps will be useful to you: medium.com/datatype/restricted-boltzmann-machine-a-complete-analysis-part-2-a-markov-random-field-model-1a38e4b73b6d
Don't really get why the numerator turns into a product @around 10.26...
I think the use of single alphabet characters in formulas really obfuscate the meaning. we have much larger screens now and nobody do these calculation by hand, so why can't we just use the full word, or at least shorten it in a more meaningful way? like instead of Bj Ck, do Bhi (ith bias of hidden unit) and Bvj(jth bias of visible unit) or something like what we do in programming, vu[i].bias and hu[j].bias
I am attempting to understand deep autoencoders. I've followed chapters 1 and 2. Can I omit chapters 3 and 4 (on CRFs)?
Yes, you should be fine without 3 and 4.
Very good explanation! Thanks a lot!
plz give a link to next video, so we could understand what is this all for
I can't see what the vectors *c* and *b* are.
I watched the videos of the series on autoencoders first and I understood them but I didn't watch the videos preceding this one. Did I miss something ?
c and b are vectors of parameters, exactly like in autoencoders. In RBMs, they will be used differently than in autoencoders, but in both cases they are vectors of parameters.
Hope this helps!
How can I decide a cut off point for RBM results in the case of unsupervised learning.?
Great question! Unfortunately there is no universal answer. It depends on what you are doing. For instance, if you are training features for classification, then you should periodically check on how discriminative the features are for your task, even if that's on a small subset of data.
Thank you professor for this video. I have two questions:
1-How to compute the distribution of the input vector x for instance p(1,0,1)?
2- Is it possible to feed the RBM with a multivariate input vector i.e. possible value for each visible unit are {0,1,2,3}?
thank you in advance
Hi!
For 1, see video 5.3 for computing p(x) (ua-cam.com/video/e0Ts_7Y6hZU/v-deo.html)
For 2, it is indeed possible to have units that aren't binary (e.g. categorical or multinomial). There is more than one way of doing this, not a single one.
thank you so much, your lecture are awesome
p(x) is intractable, isn't it?
It is if both the input layer and the hidden layer are large. But if one is small (e.g. ~20 units), then turns out we can compute the partition function in a reasonable amount of time.
Thank you, Hugo. Do you recommend any books on neural networks?
Oh definitely checkout Goodfellow, Bengio and Courville's Deep Learning book: www.deeplearningbook.org/
What an awesome job! Thanks for Ur lecture
Awesome lecture! What softwares did you use for making this video?
Thanks! I used Camtasia for mac for the recording. For my slides, I used Keynote and some free app for supporting the drawing on the screen (the one I used then isn't available anymore, but there are other equivalents available).
actual content starts from 3:14
thank you
love it !! good job Hugo :)
Thanks!
what is data dependent regularizer??
Good question! A "normal" regularizer (like L2 weight decay) will not depend on the input data distribution (L2 weight decay penalizes the sum of the squared parameter values). In contrast, a data dependent regularizer would be one that penalizes certain parameter values but through a term that depends on the input distribution (i.e. the set of x^{(t)} in your dataset).
Hope this helps!
Thanks for the lecture! Recommended!
where can I find this slides plz ?
Here: www.dmi.usherb.ca/~larocheh/neural_networks/content.html
Cheers!
good content
Good lecture! Can you post the slides somewhere? Thx :)
All the slides, and the whole course in fact (with suggested readings and assignments) are available here:
info.usherbrooke.ca/hlarochelle/neural_networks/content.html
Hugo Larochelle Got it. Thanks again.
Hugo Larochelle Hi Hugo I keep getting a timeout error for the link
Yeah, my university is doing some maintenance this weekend. It will be back up on Monday, at the latest.
Thank you.
Thank you very much.
Can anyone tell me what is Energy function?
Don't get bogged down by the name. It's just a function. The only reason we call it an energy function is due to the analogy from physics. In physics, a configuration of the environment that has high energy will have low probability of being observed (and this is probably not a super accurate statement on my part... I'm not a physicist :-) ).
In an RBM, it's the same: configurations of x and h which have a high energy (i.e. E(x,h) is high), will have low probability under the RBM model.
Hope this helps!
E = energy? i though it is an error function, Helped me a a lot though.
In this case it's the same, because we want to minimize the energy.
good job!
Recommendable
Hi Hugo,
The way of your teaching during these lectures is so fascinating. So, thanks a lot.
I was wondering if you have also some code sources of implementing RBMs using python.
I know that scikit learn has already provided an example of how to use it (scikit-learn.org/stable/auto_examples/neural_networks/plot_rbm_logistic_classification.html) , but I am looking for examples using Theano, or Keras or Tensorflow backend. So, I would really appreciate if you share some examples in implementing RBM.
Thank you
Thanks for your kind words!!
For Theano: deeplearning.net/tutorial/rbm.html
For TensorFlow, I don't have any particular recommandation, but I'm sure by googling "TensorFlow RBM" you'll find plenty :-)
Thanks ,
I really learned a lot from you , hope you focus more on definitions and use less Math