Neural networks [5.1] : Restricted Boltzmann machine - definition

Поділитися
Вставка
  • Опубліковано 27 гру 2024

КОМЕНТАРІ • 78

  • @gautamkarmakar3443
    @gautamkarmakar3443 8 років тому +4

    Used this lecture to understand lecture given by Jeffrey Hinton on NN course at Coursera, Thanks a ton, you saved me again.

  • @RudramurthyV
    @RudramurthyV 10 років тому

    @Jim O' Donoghue The numerator can be seen as exp(b+c) = exp(b).exp(c) (associativity of multiplications). When you apply this to the numerator it turns out to be the equation mentioned at 10:26.

  • @peterd3125
    @peterd3125 10 років тому +5

    great lecture Hugo, thanks for putting all this hard work into it - very well taught too!

  • @TheReficul
    @TheReficul 8 років тому +2

    Thanks for the explanation on the energy function. Everything suddenly starts to make sense.

  • @JaydeepDe
    @JaydeepDe 8 років тому +2

    Best lecture on RBM....Thanks Prof.

  • @randywelt8210
    @randywelt8210 8 років тому

    10:07, I have a problem where to put Bayesian Networks and HMMs. Do they belong to unsupvised learning like in the video above , or supervised, or do they simply present an own category in Machine Learning?

    • @hugolarochelle
      @hugolarochelle  8 років тому +1

      +Randy Welt Good question! Bayesian Networks and HMMs are in the family of directed graphical models, as opposed to undirected graphical models like the RBM. That's the main distinction.
      Note also that we could do either supervised or unsupervised learning, with either undirected or directed graphical models.

  • @yifanli2673
    @yifanli2673 8 років тому

    I really enjoyed watching this video. As I'm working on a project about DBN, this video is very useful for me. Thanks.

  • @keghnfeem4154
    @keghnfeem4154 9 років тому +33

    sorry do not understand.

  • @minh1391993
    @minh1391993 8 років тому

    Dear Hugo. I am implementing RBM but I find out that energy function of joint probability seems so confusing 6:10
    E(x,h) = - ( Wi,j * hj * xk + sum of all (dot product of visible unit value times bias) + sum of all (dot product of hidden unit value time biases).
    Therefore, how can we define Z since all of the value of visible and hidden units have been used in E(x,h) ???

    • @hugolarochelle
      @hugolarochelle  8 років тому +1

      Z is defined as the sum of E(x,h), but over all possible values of x and h. There's otherwise no relationship between the x and h in E(x,h) and Z.

    • @minh1391993
      @minh1391993 8 років тому

      @Hugo: Assuming that I train a network with X1(0,1,1,1) X2(0,0,0,1) then I get H1(0,0,0,0,1), H2(0, 1, 0, 0, 0) respectively. so is that Z = E(x1, h1) + E(x2, h2) right????

    • @minh1391993
      @minh1391993 8 років тому

      btw I am using C++ to implement RBM, so I am sorry if I ask so many questions and bother you :)

    • @hugolarochelle
      @hugolarochelle  8 років тому +1

      Ah, no! Z is the sum of *the exponential* of *all possible values of x and h*, not just the values of x seen in your training set. This is why we can't compute Z exactly in practice for even moderately large RBMs.

    • @minh1391993
      @minh1391993 8 років тому

      According to "A practical Guide to training RBM", partition function Z is given by summing over all possible pair of visible and hidden vector: Z = sum(exp(-E(x,h)). So how can I understand "all possible values of x and h" since with a certain of training data we have only one vector x.
      I notice that "the reconstruction error is actually a very poor measure of the progress of learning" so if we can't compute Z exactly in practice, then what kind of measure can we use?

  • @valken666
    @valken666 10 років тому +6

    You're awesome for doing this.

  • @dombat44
    @dombat44 Рік тому

    Many thanks for the lecture, found it really useful. I'm a bit confused about the notation on the slide entitled Markov Network View though. Firstly, have you split the equation onto multiple lines just to make it a bit more readable? Or is it a significant that the unary factors are on different lines to the pairwise factors. Secondly, from my understanding of MNs, a distribution can be written as a product of the potentials defined by the cliques on the graph (up to a normalising constant). Since it's a pairwise MN I can see that the pair-wise factors are represented in the graph but I can't see where the unary factors are represented. What am I missing?

    • @hugolarochelle
      @hugolarochelle  Рік тому

      Thanks for your kind words!
      Indeed, I split on different lines for readability, the line on which each term is doesn't matter.
      I agree that the MN representation doesn't make unary factors explicit, and I think that's the main benefit of the factor graph representation, which illustrates all potentials explicitly.
      Hope this helps!

    • @dombat44
      @dombat44 Рік тому

      @@hugolarochelle great, thanks for clearing that up.

  • @nigeldupaigel
    @nigeldupaigel 6 років тому

    c transpose and b transpose are the biases for the hidden and visible nodes

  • @bowindbox1132
    @bowindbox1132 4 роки тому

    How is the energy function derived at @6:10?

    • @hugolarochelle
      @hugolarochelle  4 роки тому

      Good question! For this video, we only provide the formula. But in the following videos, we show some properties that can be derived, thanks to the particular formulation of the energy function. So hopefully that'll help understand why this particular energy function is used. Hope this helps!

    • @bowindbox1132
      @bowindbox1132 4 роки тому

      ​@@hugolarochelle Thanks I was wondering if it is the general outcome of simple MRF calculation of joint probability. Considering the bipartite nature of the graph, we can assume there are three two types of connections, hidden node to observed node, and output from hidden or input to hidden. So rest of the random variables are independent. So by sum of product, the product terms would only encode the joint probabilities. Does this make sense?

    • @hugolarochelle
      @hugolarochelle  4 роки тому

      @@bowindbox1132 Indeed an RBM is a type of MRF. It is one that follows a bipartite graph, binary random variables and with pairwise potentials that correspond to h_j x_k W_{j,k}, and unary potentials that are h_j b_j and x_k c_k.
      Hope this helps!

    • @bowindbox1132
      @bowindbox1132 4 роки тому

      @@hugolarochelle So we might assume potential of a node is just the value at that node. For example in input, so in forward pass say we are going from x_k to h_j via edge W_{k,j}. P(x_k, h_j) = Potential on x_k \times Potential to jump to h_j from x_k \times Potential on h_j = x_k \times W_{k,j} \times h_j. The relation would be reversed when we calculate the total energy of the system backward (that is while backpropagating.) Is this correct ?

    • @hugolarochelle
      @hugolarochelle  4 роки тому

      ​@@bowindbox1132 actually, the potentials can't given you directly these probabilities. Indeed, potentials in an MRF aren't necessarily normalized, as probabilities need to be.
      I can't write much on how RBMs and MRFs are related, but I've found this medium article that seems to discuss the relationship, and which perhaps will be useful to you: medium.com/datatype/restricted-boltzmann-machine-a-complete-analysis-part-2-a-markov-random-field-model-1a38e4b73b6d

  • @JimODonoghue
    @JimODonoghue 10 років тому +1

    Don't really get why the numerator turns into a product @around 10.26...

  • @XiaosChannel
    @XiaosChannel 9 років тому +1

    I think the use of single alphabet characters in formulas really obfuscate the meaning. we have much larger screens now and nobody do these calculation by hand, so why can't we just use the full word, or at least shorten it in a more meaningful way? like instead of Bj Ck, do Bhi (ith bias of hidden unit) and Bvj(jth bias of visible unit) or something like what we do in programming, vu[i].bias and hu[j].bias

  • @pi5549
    @pi5549 8 років тому +1

    I am attempting to understand deep autoencoders. I've followed chapters 1 and 2. Can I omit chapters 3 and 4 (on CRFs)?

    • @hugolarochelle
      @hugolarochelle  8 років тому

      Yes, you should be fine without 3 and 4.

  • @igorjouravlev2643
    @igorjouravlev2643 4 роки тому

    Very good explanation! Thanks a lot!

  • @stivstivsti
    @stivstivsti 7 років тому

    plz give a link to next video, so we could understand what is this all for

  • @janvonschreibe3447
    @janvonschreibe3447 6 років тому

    I can't see what the vectors *c* and *b* are.
    I watched the videos of the series on autoencoders first and I understood them but I didn't watch the videos preceding this one. Did I miss something ?

    • @hugolarochelle
      @hugolarochelle  6 років тому +1

      c and b are vectors of parameters, exactly like in autoencoders. In RBMs, they will be used differently than in autoencoders, but in both cases they are vectors of parameters.
      Hope this helps!

  •  7 років тому

    How can I decide a cut off point for RBM results in the case of unsupervised learning.?

    • @hugolarochelle
      @hugolarochelle  7 років тому

      Great question! Unfortunately there is no universal answer. It depends on what you are doing. For instance, if you are training features for classification, then you should periodically check on how discriminative the features are for your task, even if that's on a small subset of data.

  • @mahmoudalbardan2730
    @mahmoudalbardan2730 6 років тому

    Thank you professor for this video. I have two questions:
    1-How to compute the distribution of the input vector x for instance p(1,0,1)?
    2- Is it possible to feed the RBM with a multivariate input vector i.e. possible value for each visible unit are {0,1,2,3}?
    thank you in advance

    • @hugolarochelle
      @hugolarochelle  6 років тому

      Hi!
      For 1, see video 5.3 for computing p(x) (ua-cam.com/video/e0Ts_7Y6hZU/v-deo.html)
      For 2, it is indeed possible to have units that aren't binary (e.g. categorical or multinomial). There is more than one way of doing this, not a single one.

  • @osamahabdullah3715
    @osamahabdullah3715 3 роки тому

    thank you so much, your lecture are awesome

  • @MLDawn
    @MLDawn 2 роки тому

    p(x) is intractable, isn't it?

    • @hugolarochelle
      @hugolarochelle  2 роки тому

      It is if both the input layer and the hidden layer are large. But if one is small (e.g. ~20 units), then turns out we can compute the partition function in a reasonable amount of time.

  • @MatthewKleinsmith
    @MatthewKleinsmith 8 років тому

    Thank you, Hugo. Do you recommend any books on neural networks?

    • @hugolarochelle
      @hugolarochelle  8 років тому +3

      Oh definitely checkout Goodfellow, Bengio and Courville's Deep Learning book: www.deeplearningbook.org/

  • @王国鑫-m6t
    @王国鑫-m6t 9 років тому

    What an awesome job! Thanks for Ur lecture

  • @anchitbhattacharya9125
    @anchitbhattacharya9125 5 років тому

    Awesome lecture! What softwares did you use for making this video?

    • @hugolarochelle
      @hugolarochelle  5 років тому

      Thanks! I used Camtasia for mac for the recording. For my slides, I used Keynote and some free app for supporting the drawing on the screen (the one I used then isn't available anymore, but there are other equivalents available).

  • @ghosh5908
    @ghosh5908 4 роки тому

    actual content starts from 3:14

  • @simple_akira
    @simple_akira 8 років тому

    love it !! good job Hugo :)

  • @chiru6753
    @chiru6753 5 років тому

    what is data dependent regularizer??

    • @hugolarochelle
      @hugolarochelle  5 років тому +1

      Good question! A "normal" regularizer (like L2 weight decay) will not depend on the input data distribution (L2 weight decay penalizes the sum of the squared parameter values). In contrast, a data dependent regularizer would be one that penalizes certain parameter values but through a term that depends on the input distribution (i.e. the set of x^{(t)} in your dataset).
      Hope this helps!

  • @brunocosta8974
    @brunocosta8974 9 років тому

    Thanks for the lecture! Recommended!

  • @osamahabdullah3715
    @osamahabdullah3715 5 років тому

    where can I find this slides plz ?

    • @hugolarochelle
      @hugolarochelle  5 років тому

      Here: www.dmi.usherb.ca/~larocheh/neural_networks/content.html
      Cheers!

  • @quranicscience9631
    @quranicscience9631 5 років тому

    good content

  • @zejiazheng1573
    @zejiazheng1573 10 років тому +2

    Good lecture! Can you post the slides somewhere? Thx :)

    • @hugolarochelle
      @hugolarochelle  10 років тому +3

      All the slides, and the whole course in fact (with suggested readings and assignments) are available here:
      info.usherbrooke.ca/hlarochelle/neural_networks/content.html

    • @zejiazheng1573
      @zejiazheng1573 10 років тому +1

      Hugo Larochelle Got it. Thanks again.

    • @liltlefruitfly
      @liltlefruitfly 9 років тому

      Hugo Larochelle Hi Hugo I keep getting a timeout error for the link

    • @hugolarochelle
      @hugolarochelle  9 років тому

      Yeah, my university is doing some maintenance this weekend. It will be back up on Monday, at the latest.

  • @louatimohamedkameleddine6857
    @louatimohamedkameleddine6857 4 роки тому

    Thank you.

  • @rafaellima8146
    @rafaellima8146 10 років тому

    Thank you very much.

  • @rayeric6323
    @rayeric6323 6 років тому

    Can anyone tell me what is Energy function?

    • @hugolarochelle
      @hugolarochelle  6 років тому +2

      Don't get bogged down by the name. It's just a function. The only reason we call it an energy function is due to the analogy from physics. In physics, a configuration of the environment that has high energy will have low probability of being observed (and this is probably not a super accurate statement on my part... I'm not a physicist :-) ).
      In an RBM, it's the same: configurations of x and h which have a high energy (i.e. E(x,h) is high), will have low probability under the RBM model.
      Hope this helps!

  • @abdirahmanhashi5800
    @abdirahmanhashi5800 6 років тому

    E = energy? i though it is an error function, Helped me a a lot though.

    • @MrCmon113
      @MrCmon113 4 роки тому

      In this case it's the same, because we want to minimize the energy.

  • @jimmytsai8069
    @jimmytsai8069 11 років тому +1

    good job!

  • @bingeltube
    @bingeltube 6 років тому

    Recommendable

  • @bahareh_mtl3472
    @bahareh_mtl3472 7 років тому

    Hi Hugo,
    The way of your teaching during these lectures is so fascinating. So, thanks a lot.
    I was wondering if you have also some code sources of implementing RBMs using python.
    I know that scikit learn has already provided an example of how to use it (scikit-learn.org/stable/auto_examples/neural_networks/plot_rbm_logistic_classification.html) , but I am looking for examples using Theano, or Keras or Tensorflow backend. So, I would really appreciate if you share some examples in implementing RBM.
    Thank you

    • @hugolarochelle
      @hugolarochelle  7 років тому

      Thanks for your kind words!!
      For Theano: deeplearning.net/tutorial/rbm.html
      For TensorFlow, I don't have any particular recommandation, but I'm sure by googling "TensorFlow RBM" you'll find plenty :-)

  • @ahmedabdelfattah443
    @ahmedabdelfattah443 9 років тому

    Thanks ,
    I really learned a lot from you , hope you focus more on definitions and use less Math