CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM
Вставка
- Опубліковано 7 лют 2016
- Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 10.
Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.
Our course website is cs231n.stanford.edu/
*My takeaways:*
1. Recurrent Neural Networks (RNNs) 2:05
2. Image captioning 31:10
3. Long short term memory (LSTM) 43:31
4. Summary 1:08:21
Even though new architectures such as attention networks are coming into play, these lectures convey information that's still relevant in understanding the newest models. There's something timeless about these lectures. 231 by Karpathy is the best overall material for NN I've ever come upon.
LOL when Andrej said he's confused by diagrams of LSTM.
Same feeling and you gave a very awesome tutorial of LSTM. Good job man!
one of the best videos on RNN who explains the code perfectly from sratch.
can't wait to see future videos. Looking forward for video applications
I was trying to conceptualize why features would arise in an RNN the way that key features seem to arise in CNNs, which use filters to preserve spatial relationships. (27:10)
If you think about it, a recurrent neural network functions and is successful for very similar reasons as to a convolutional neural network. That is, a CNN preserves spatial information using filters that sequence across an image to provide outputs. These sliding filters allow for features to be trained and used downstream as inputs to make broader classifications. An RNN is essentially doing the same thing using a sliding array over a sequence of inputs. The difference being that the input isn't already predefined as an image that will be sequenced by the sliding filters, but instead the sequence has an indeterminate ending and each output will feed back into the network as an additional input.
For me that helped me see why this neural network architecture fundamentally works.
At 22:54, there is a sample of text generated by the RNN. After looking carefully the source in the video which is open in Github, found all the generated text is from the first character of input-text. The whole process is like below:
1. Use input text(samples in the video, like algebra geometry book etc.) to train RNN.
2. Again input the first character to RNN to generate text. Or input any one character from the input-text.
The point is that the one character of input-text can generate input-text-like text.
After these lectures I'll be a deep learning expert!
+apple-sauce You are so naive :)
+Qlt Trash I know I know, was being facetious ;-)
Awesome lecture!
Thanks! RNN, finally!
we need more of this
Excellent Lecture
Can you make video related to LSTM Encoder and Decoder internal working with image
Thanks for this series. I want to ask How I can apply RNN or LSTM with action classification. Also, I have a problem to combine both CNN + RNN for the same purpose of action classification. I wish you have examples that I can follow and learn
Hi! can you pls give me the code for back propagating through both the networks? How do you simultaneously update the weights of both RNN & VGGnet??
Also, how would this ensemble learn what features to use, based on its *caption*??
So what are the matrices which we get after building the RNN/LSTM model? Referring to the code, we require W_xh, W_hh and W_hy? Don't we require the hidden state value of the model after it is trained?
How exactly does this approach capture what's going on in the image? To me it seems like it just learns objects(classes) from input image and based on training data it guesses the interaction between these objects. Not too sure how it is able to capture the "verb" from an image and how objects are actually interacting in the image. Anyone any clue about that?
AWESOME
LSTM and ResNet convergence makes sense if you consider they seem a lot more biological. Forget cells are like inhibitors, skip connections are like long dendrites. I'm using like very loosely here. Little wonder.
An RNN trained on Shakespear's sonnets somehow generated "Natasha" and "Pierre"? That sounds weird, I think what is shown around 22:55 comes from Tolstoy.
Why does an RNN perform better at image captioning when you only feed it with image data at the start? Also, thanks for making those lecture available publicly!
+Rafael Blanes They fed the captions as well in the training phase. EXAMPLE=> (target0: "", target1 : "", target2: "", target3: "end")
Could you please explain why the dimension of W for LSTM is 4n*2n? I was thinking that you still stack W_hh and W_xh matrices together, and that should form n * 2n, but probably I am missing something. Thanks.
The dimension of W is decided as input dimension x output dimension. In LSTM, you need 4 outputs (i, f, o, g) which are each n dimensional vector.
Thanks!
can anyone help me with the git link used in this video ?
14:50 is the hidden layer model also called hidden markov model?
+Randy Welt Nope, HMMs are other stuff!
Can someone help me understand why is dss[t] at (1:04:20) calculated that way? Shouldn't it be hs[t] * dhs[t]?
From the second line in the first for-loop (the forward pass part) you can see that there is ReLU activation applied, after the linear operation in the previous line. So, the first step in the backward pass would be through this ReLU function, which is what the first line in the second for-loop (the backward pass part) does.
HELP !!!!! In RNN we have only 3 unique weight parameters, so during back prop. their will be only 3 parameters to update then why are RNN goes till the 1st input & creates long term dependencies thereby creates vanishing gradient problem ????
Let's consider an RNN with one hidden layer and 10 hidden units, and a sequence length of 25. Unrolling this network, you can think of it as a network that is 26 layers deep (let us indicate them as hiddenLayer_t=1, hiddenLayer_t=2, …, hiddenLayer_t=25), and at each hidden layer (except the last hidden layer), there would be two output connections: one connecting to the layer above (as in a plane network), and another connecting to an output layer (let’s indicate this as outputLayer_t, which only outputs the t^th instance of the output sequence). Also, each of these hidden layers has two input connections: one from the layer below (as in a plane network), and one from the inputs. Here, all the hidden layer to hidden layer connections share weights (Whh), as well as inputs to hidden layers (Wxh), and hidden to output layers (Why).
Now, back-propagating from the final output layer (i.e, outLayer_t=25), the gradients have to trickle down to the first hidden layer (i.e, hiddenLayer_t=1), and in the process, the weigh matrices (Why, Whh and Wxh) have to be updated with the gradients calculate at each hidden layer, since they are shared by all the hidden layers! On the other hand, if the weight matrices are updated based only on the gradient calculated at outputLayer_t=25 and hiddenLayer_t=25, then it is like training the network to only predict the 25th character in the sequence correctly (actually it might not even be able to do that correctly).
Anyone can explain me mathematical explanation of differentiating the functions of lstm plssss! I wanna know how the backprop works. I need it for my math school project! Plssss helllpp!!
Nice
here's the gist referred in the lecture : gist.github.com/karpathy/d4dee566867f8291f086
2.2 is not higher than 4.1
I don't understand the cnn + rnn architecture... I m lost in the transition from CNN to rnn...
hidden state 0 is just a linear transformation of the flattened output of the CNN.
Tarea 8
32:17
This is so much more watchable at 0.75 speed.
WRONG
Speaking about that, UA-cam should use LSTM Machine Learning to do a better voice processing when video speed is shifted.
8
Is it just me or was that explanation of LSTM pretty unhelpful
Anyone can explain me mathematical explanation of differentiating the functions of lstm plssss! I wanna know how the backprop works. I need it for my math school project! Plssss helllpp!!
you have LSTMs as your school math project...*mind blown*
Rishabh Baghel is it a bad one?