CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM

Поділитися
Вставка
  • Опубліковано 7 лют 2016
  • Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 10.
    Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.
    Our course website is cs231n.stanford.edu/

КОМЕНТАРІ • 51

  • @leixun
    @leixun 3 роки тому +14

    *My takeaways:*
    1. Recurrent Neural Networks (RNNs) 2:05
    2. Image captioning 31:10
    3. Long short term memory (LSTM) 43:31
    4. Summary 1:08:21

  • @jiahanliu
    @jiahanliu 4 роки тому +7

    Even though new architectures such as attention networks are coming into play, these lectures convey information that's still relevant in understanding the newest models. There's something timeless about these lectures. 231 by Karpathy is the best overall material for NN I've ever come upon.

  • @jion1ar
    @jion1ar 6 років тому +27

    LOL when Andrej said he's confused by diagrams of LSTM.
    Same feeling and you gave a very awesome tutorial of LSTM. Good job man!

  • @mannemsaisivadurgaprasad8987
    @mannemsaisivadurgaprasad8987 7 місяців тому

    one of the best videos on RNN who explains the code perfectly from sratch.

  • @Gods_Of_Multiverses661
    @Gods_Of_Multiverses661 8 років тому +1

    can't wait to see future videos. Looking forward for video applications

  • @lukealexanderhwilson
    @lukealexanderhwilson 2 роки тому +3

    I was trying to conceptualize why features would arise in an RNN the way that key features seem to arise in CNNs, which use filters to preserve spatial relationships. (27:10)
    If you think about it, a recurrent neural network functions and is successful for very similar reasons as to a convolutional neural network. That is, a CNN preserves spatial information using filters that sequence across an image to provide outputs. These sliding filters allow for features to be trained and used downstream as inputs to make broader classifications. An RNN is essentially doing the same thing using a sliding array over a sequence of inputs. The difference being that the input isn't already predefined as an image that will be sequenced by the sliding filters, but instead the sequence has an indeterminate ending and each output will feed back into the network as an additional input.
    For me that helped me see why this neural network architecture fundamentally works.

  • @marxman1010
    @marxman1010 7 років тому

    At 22:54, there is a sample of text generated by the RNN. After looking carefully the source in the video which is open in Github, found all the generated text is from the first character of input-text. The whole process is like below:
    1. Use input text(samples in the video, like algebra geometry book etc.) to train RNN.
    2. Again input the first character to RNN to generate text. Or input any one character from the input-text.
    The point is that the one character of input-text can generate input-text-like text.

  • @theempire00
    @theempire00 8 років тому +15

    After these lectures I'll be a deep learning expert!

    • @top5samples
      @top5samples 8 років тому +10

      +apple-sauce You are so naive :)

    • @theempire00
      @theempire00 8 років тому +15

      +Qlt Trash I know I know, was being facetious ;-)

  • @FalguniDasShuvo
    @FalguniDasShuvo 2 роки тому +3

    Awesome lecture!

  • @VolatilityEdgeRunner
    @VolatilityEdgeRunner 8 років тому +1

    Thanks! RNN, finally!

  • @clairelolification
    @clairelolification 4 роки тому +1

    we need more of this

  • @praveenkumarchandaliya1900
    @praveenkumarchandaliya1900 6 років тому +1

    Excellent Lecture
    Can you make video related to LSTM Encoder and Decoder internal working with image

  • @hanyel-ghaish6836
    @hanyel-ghaish6836 8 років тому

    Thanks for this series. I want to ask How I can apply RNN or LSTM with action classification. Also, I have a problem to combine both CNN + RNN for the same purpose of action classification. I wish you have examples that I can follow and learn

  • @Prithviization
    @Prithviization 8 років тому

    Hi! can you pls give me the code for back propagating through both the networks? How do you simultaneously update the weights of both RNN & VGGnet??
    Also, how would this ensemble learn what features to use, based on its *caption*??

  • @chiragmittal5744
    @chiragmittal5744 6 років тому

    So what are the matrices which we get after building the RNN/LSTM model? Referring to the code, we require W_xh, W_hh and W_hy? Don't we require the hidden state value of the model after it is trained?

  • @DeepGamingAI
    @DeepGamingAI 8 років тому

    How exactly does this approach capture what's going on in the image? To me it seems like it just learns objects(classes) from input image and based on training data it guesses the interaction between these objects. Not too sure how it is able to capture the "verb" from an image and how objects are actually interacting in the image. Anyone any clue about that?

  • @eason_longleilei
    @eason_longleilei 6 років тому

    AWESOME

  • @JasonBlank
    @JasonBlank 2 роки тому

    LSTM and ResNet convergence makes sense if you consider they seem a lot more biological. Forget cells are like inhibitors, skip connections are like long dendrites. I'm using like very loosely here. Little wonder.

  • @MrScattterbrain
    @MrScattterbrain 4 роки тому +1

    An RNN trained on Shakespear's sonnets somehow generated "Natasha" and "Pierre"? That sounds weird, I think what is shown around 22:55 comes from Tolstoy.

  • @ralphblanes8370
    @ralphblanes8370 8 років тому

    Why does an RNN perform better at image captioning when you only feed it with image data at the start? Also, thanks for making those lecture available publicly!

    • @Prithviization
      @Prithviization 8 років тому +2

      +Rafael Blanes They fed the captions as well in the training phase. EXAMPLE=> (target0: "", target1 : "", target2: "", target3: "end")

  • @susiesargsyan2965
    @susiesargsyan2965 7 років тому

    Could you please explain why the dimension of W for LSTM is 4n*2n? I was thinking that you still stack W_hh and W_xh matrices together, and that should form n * 2n, but probably I am missing something. Thanks.

    • @min-hobyun6753
      @min-hobyun6753 6 років тому +1

      The dimension of W is decided as input dimension x output dimension. In LSTM, you need 4 outputs (i, f, o, g) which are each n dimensional vector.

    • @susiesargsyan2965
      @susiesargsyan2965 6 років тому

      Thanks!

  • @jeetsensarma3033
    @jeetsensarma3033 6 років тому

    can anyone help me with the git link used in this video ?

  • @randywelt8210
    @randywelt8210 8 років тому

    14:50 is the hidden layer model also called hidden markov model?

    • @cadeop
      @cadeop 8 років тому +2

      +Randy Welt Nope, HMMs are other stuff!

  • @essamal-mansouri2689
    @essamal-mansouri2689 6 років тому

    Can someone help me understand why is dss[t] at (1:04:20) calculated that way? Shouldn't it be hs[t] * dhs[t]?

    • @AvinashRanganath
      @AvinashRanganath 2 роки тому

      From the second line in the first for-loop (the forward pass part) you can see that there is ReLU activation applied, after the linear operation in the previous line. So, the first step in the backward pass would be through this ReLU function, which is what the first line in the second for-loop (the backward pass part) does.

  • @krishnamishra8598
    @krishnamishra8598 4 роки тому

    HELP !!!!! In RNN we have only 3 unique weight parameters, so during back prop. their will be only 3 parameters to update then why are RNN goes till the 1st input & creates long term dependencies thereby creates vanishing gradient problem ????

    • @AvinashRanganath
      @AvinashRanganath 2 роки тому +1

      Let's consider an RNN with one hidden layer and 10 hidden units, and a sequence length of 25. Unrolling this network, you can think of it as a network that is 26 layers deep (let us indicate them as hiddenLayer_t=1, hiddenLayer_t=2, …, hiddenLayer_t=25), and at each hidden layer (except the last hidden layer), there would be two output connections: one connecting to the layer above (as in a plane network), and another connecting to an output layer (let’s indicate this as outputLayer_t, which only outputs the t^th instance of the output sequence). Also, each of these hidden layers has two input connections: one from the layer below (as in a plane network), and one from the inputs. Here, all the hidden layer to hidden layer connections share weights (Whh), as well as inputs to hidden layers (Wxh), and hidden to output layers (Why).
      Now, back-propagating from the final output layer (i.e, outLayer_t=25), the gradients have to trickle down to the first hidden layer (i.e, hiddenLayer_t=1), and in the process, the weigh matrices (Why, Whh and Wxh) have to be updated with the gradients calculate at each hidden layer, since they are shared by all the hidden layers! On the other hand, if the weight matrices are updated based only on the gradient calculated at outputLayer_t=25 and hiddenLayer_t=25, then it is like training the network to only predict the 25th character in the sequence correctly (actually it might not even be able to do that correctly).

  • @siddharthagrawal8300
    @siddharthagrawal8300 6 років тому

    Anyone can explain me mathematical explanation of differentiating the functions of lstm plssss! I wanna know how the backprop works. I need it for my math school project! Plssss helllpp!!

  • @nimarashidi4510
    @nimarashidi4510 2 роки тому

    Nice

  • @aashudwivedi
    @aashudwivedi 6 років тому +1

    here's the gist referred in the lecture : gist.github.com/karpathy/d4dee566867f8291f086

  • @jg9193
    @jg9193 5 років тому +1

    2.2 is not higher than 4.1

  • @theamazingjonad9716
    @theamazingjonad9716 7 років тому

    I don't understand the cnn + rnn architecture... I m lost in the transition from CNN to rnn...

    • @48956l
      @48956l 7 років тому +2

      hidden state 0 is just a linear transformation of the flattened output of the CNN.

  • @alexandrogomez5493
    @alexandrogomez5493 Рік тому

    Tarea 8

  • @abc3631
    @abc3631 3 роки тому

    32:17

  • @victorburnett6329
    @victorburnett6329 6 років тому +8

    This is so much more watchable at 0.75 speed.

    • @asdfasdfuhf
      @asdfasdfuhf 5 років тому +1

      WRONG

    • @Constantinesis
      @Constantinesis 5 років тому +1

      Speaking about that, UA-cam should use LSTM Machine Learning to do a better voice processing when video speed is shifted.

  • @user-bp1lc2px6m
    @user-bp1lc2px6m 11 місяців тому

    8

  • @48956l
    @48956l 7 років тому +4

    Is it just me or was that explanation of LSTM pretty unhelpful

  • @siddharthagrawal8300
    @siddharthagrawal8300 6 років тому

    Anyone can explain me mathematical explanation of differentiating the functions of lstm plssss! I wanna know how the backprop works. I need it for my math school project! Plssss helllpp!!

    • @notabee4532
      @notabee4532 5 років тому +3

      you have LSTMs as your school math project...*mind blown*

    • @siddharthagrawal8300
      @siddharthagrawal8300 5 років тому

      Rishabh Baghel is it a bad one?