*DeepMind x UCL | Deep Learning Lectures | 6/12 | Sequences and Recurrent Networks* *My takeaways:* *1. Plan for this lecture **0:36* *2. Motivation **1:31* 2.1 What is a sequence and why the methods in previous lectures (Feedforward networks and Convolutional networks) don't work on them 2:06 2.2 Sequences are everywhere 3:33 2.3 Summary 4:45 *3. Fundamentals **5:24* 3.1 Modelling word probabilities is very hard 5:28 3.2 Architectures of sequence models -Recurrent Neural Networks (RNNs) 23:25; Train RNNs 26:42; N-gram vs Addition vs RNN 35:21 -Long Short-Term Memory (LSTM) networks 37:30 -Gated Recurrent Unit (GRU) nets: GRU can be seen as a simplified LSTM 42:39 3.3 N-gram vs Addition vs RNN vs LSTM 43:11 *4. Generating sequences **43:46* 4.1 Using a trained model 45:00 4.2 Examples: -Images as sequences: PixelRNN 46:32 -Natural language as sequences: sequence-to-sequence (Seq2Seq) models 50:26 -Other Seq2Seq applications: machine translation; image captioning 53:01 -Audio waves as sequences: using convolutions 1:00:20 --N-gram vs Addition vs RNN vs LSTM vs Conv 1:01:51 -Policies as sequences: reinforcement learning in gaming 1:02:29 -Attention in sequences: transformers 1:04:33 --Convolutions vs transformers: an animation 1:05:05 --e.g. GPT2 1:07:05 --N-gram vs Addition vs RNN vs LSTM vs Conv vs transformers 1:09:02 --RNNs in 2011 vs transformers in 2019 1:09:28 *5. Summary **1:10:52* *6. Q&A **1:11:51*
This is by a wide margin the best lecture in the series so far. Marta explains things well and doesn't come across as insecure, without feeling the need to prove her technical proficiency with cryptic unlabeled equations 🙂 sorry if I sound a little disappointed with the previous lectures 😂
Highly subjective. My favourites until now were lec. 2 and 5. In this lecture I get distracted by the speaker skipping equations (some functions which are to be differentiated were not even properly defined) and asking whether the audience did understand what she just said. (still the quality is very good)
Very clear lecture, thank you. One thing that was somewhat less clear was the transformers part at the end, with that comparison to convolutions, it made it seem like transformers are merely simple densely connected networks.
Amazing series! Although the remaining lecture series were not uploaded, I like that each lecture was self-contained. For those looking for more resources, there are a lot of other great posts around youtube. David Silver's reinforcement learning series, Oxford NLP, UCLxAdvancedML are other places to get more info. Additionally, Berkeley and Stanford have great classes.
You can turn artificial neural networks inside-out by using fixed dot products (weighted sums) and adjustable (parametric) activation functions. The fixed dot products can be computed very quickly using fast transforms like the FFT. Also the number of overall parameters required is vastly reduced. The dot products of the transform act as statistical summary measures. Ensuring good behavour. See Fast Transform (fixed filter bank) neural networks. Also the electricty in your house is an AC sine wave. Turn on a light and the output of the switch is f(x)=x. The same sine wave as the input. Off f(x)=0. Then ReLU is a switch. A ReLU neural network is a switched composition of dot products. If the switch states are known there exists a linear mapping between the input vector and the output vector. Upon which various metrics can be applied to see what is happening for example.
The variable length argument should stop being made. Their is a maximum size to the input of everything. If you pad with 0s you shouldn't call the input variable.
I'm yet to watch the transformer lecture - but initially it seems like that method is a beefed up N-gram? Is this a sensible way of looking at it or have I completely missed the mark? The issue with the N-gram seemed to be scalability but looking at how GPT2 was trained it seemed to overcome it by just throwing a huge amount of computing power and data at it
Because a single word in any language can have 'n' previous words as its context. The image at 13:47 shows the possibilities if 'n=1'. i.e. 1-gram, or p(x1|x2). Now imagine if we have n>1. We have to now determine p(x1|x2,x3,x4,.....). This is huge. Also, there can be various combinations of x2, x3,x4, etc. Therefore, using all previous words in the context is not scalable.
@@samsung6980 @Louis TV Badly formulated question with an even worse answer. The remark that there is an error @13:47 is correct. Here the context size is N=1 and the table size is vocabulary^2. In a general sense, the table size grows exponentially with the context, namely vocabulary^N+1.
Ruben N I agree. I believe Louis meant the space complexity instead of time complexity and so I answered for that. Could you please mention what parts of my reply were incorrect or hard to understand?
@@samsung6980 She (and also you) have mixed up context size with N (in N-grams) which results with nonsense. p(x1|x2) models 2-gram and has a context length of 1. On (@13:37) the image shows 2D matrix of probabilities and according to the text on slide it should be a vector of probabilities (vocabulary^1).
Could you please help me understand what does she say at 32:46? Does she say "...however many times 62" as written in the subtitles? I can't hear "62"... Thx!
@Marcos Pereira The presenter should have started with a simpler version (GRU), before moving to an LSTM. The minimal gated version of GRU is all you need to demonstrate how/why it alleviates vanishing gradient issue. In short, yes it's because it allows SOME access to memory. Mathematically, instead of h(t) as in RNN, the state is now a weighted average of it's updated state h(t) and a previous one, i.e. H(t) = Forget *h(t) + (1-Forget)*H(t-1). When unrolling this recursive formula, one can see that (1-Forget)^N appears for large N (deep networks). So that you will still lose memory (since (1 - Forget)^N is close to zero).
IMO these lectures are useless for beginners and I'll tell you why. Many of the lecturers speak much too fast - a firehose of facts and non stop talking isn't teaching, not cool! some never repeat questions asked by those sitting in class - thanks! and assume we recall the basics from prior lectures without recapitulating - poor pedagogy! e.g use of softmax function, where do I go t remind myself what that is and why it is used. If you are just starting out you will need to do some serious study prior to listening to anything here. There is lots of expertise on display here but no real teaching.
*DeepMind x UCL | Deep Learning Lectures | 6/12 | Sequences and Recurrent Networks*
*My takeaways:*
*1. Plan for this lecture **0:36*
*2. Motivation **1:31*
2.1 What is a sequence and why the methods in previous lectures (Feedforward networks and Convolutional networks) don't work on them 2:06
2.2 Sequences are everywhere 3:33
2.3 Summary 4:45
*3. Fundamentals **5:24*
3.1 Modelling word probabilities is very hard 5:28
3.2 Architectures of sequence models
-Recurrent Neural Networks (RNNs) 23:25; Train RNNs 26:42; N-gram vs Addition vs RNN 35:21
-Long Short-Term Memory (LSTM) networks 37:30
-Gated Recurrent Unit (GRU) nets: GRU can be seen as a simplified LSTM 42:39
3.3 N-gram vs Addition vs RNN vs LSTM 43:11
*4. Generating sequences **43:46*
4.1 Using a trained model 45:00
4.2 Examples:
-Images as sequences: PixelRNN 46:32
-Natural language as sequences: sequence-to-sequence (Seq2Seq) models 50:26
-Other Seq2Seq applications: machine translation; image captioning 53:01
-Audio waves as sequences: using convolutions 1:00:20
--N-gram vs Addition vs RNN vs LSTM vs Conv 1:01:51
-Policies as sequences: reinforcement learning in gaming 1:02:29
-Attention in sequences: transformers 1:04:33
--Convolutions vs transformers: an animation 1:05:05
--e.g. GPT2 1:07:05
--N-gram vs Addition vs RNN vs LSTM vs Conv vs transformers 1:09:02
--RNNs in 2011 vs transformers in 2019 1:09:28
*5. Summary **1:10:52*
*6. Q&A **1:11:51*
The channel needs to add this info to video description. Thanks for adding it here
@@Dirhfifkshdi You are welcome!
This is by a wide margin the best lecture in the series so far. Marta explains things well and doesn't come across as insecure, without feeling the need to prove her technical proficiency with cryptic unlabeled equations 🙂 sorry if I sound a little disappointed with the previous lectures 😂
Highly subjective. My favourites until now were lec. 2 and 5. In this lecture I get distracted by the speaker skipping equations (some functions which are to be differentiated were not even properly defined) and asking whether the audience did understand what she just said. (still the quality is very good)
This is the best explanation of RNNs I have ever seen
Please upload the remaining lectures also.
That was incredible, it explained so cleanly the concept I was struggling with for a while, thank you so much Marta
Very clear lecture, thank you. One thing that was somewhat less clear was the transformers part at the end, with that comparison to convolutions, it made it seem like transformers are merely simple densely connected networks.
The quality of GPT2 sequence prediction is truly astonishing!
And GPT3 is nothing short of terrifying!
Amazing buildup of intuition. I wish I saw this long ago.
Amazing series! Although the remaining lecture series were not uploaded, I like that each lecture was self-contained. For those looking for more resources, there are a lot of other great posts around youtube. David Silver's reinforcement learning series, Oxford NLP, UCLxAdvancedML are other places to get more info. Additionally, Berkeley and Stanford have great classes.
You can turn artificial neural networks inside-out by using fixed dot products (weighted sums) and adjustable (parametric) activation functions. The fixed dot products can be computed very quickly using fast transforms like the FFT. Also the number of overall parameters required is vastly reduced. The dot products of the transform act as statistical summary measures. Ensuring good behavour. See Fast Transform (fixed filter bank) neural networks. Also the electricty in your house is an AC sine wave. Turn on a light and the output of the switch is f(x)=x. The same sine wave as the input. Off f(x)=0. Then ReLU is a switch. A ReLU neural network is a switched composition of dot products. If the switch states are known there exists a linear mapping between the input vector and the output vector. Upon which various metrics can be applied to see what is happening for example.
Great lecture and big thanks to DeepMind for sharing this great content.
Marta: "Cool".
When will the remaining 6 videos uploaded ?
Perfect! please keep lectures updated
This helped me a lot. Yep...Great to watch!
The variable length argument should stop being made. Their is a maximum size to the input of everything. If you pad with 0s you shouldn't call the input variable.
Wow ! very clearly explained,right from the starting :)
I'm yet to watch the transformer lecture - but initially it seems like that method is a beefed up N-gram? Is this a sensible way of looking at it or have I completely missed the mark? The issue with the N-gram seemed to be scalability but looking at how GPT2 was trained it seemed to overcome it by just throwing a huge amount of computing power and data at it
where are the rest of the 6 lectures? :|
What is meant by pairwise encoding?
It’s weird that because of this bulk upload, and different size of the videos, they were uploaded in wrong order
They're all in order here, ua-cam.com/play/PLqYmG7hTraZCDxZ44o4p3N5Anz3lLRVZF.html
Thank you, very interesting.
Really amazing
17:00 Solution
13:47 why the time complexity is not n^2?
Because a single word in any language can have 'n' previous words as its context. The image at 13:47 shows the possibilities if 'n=1'. i.e. 1-gram, or p(x1|x2). Now imagine if we have n>1. We have to now determine p(x1|x2,x3,x4,.....). This is huge. Also, there can be various combinations of x2, x3,x4, etc. Therefore, using all previous words in the context is not scalable.
@@samsung6980 @Louis TV Badly formulated question with an even worse answer.
The remark that there is an error @13:47 is correct. Here the context size is N=1 and the table size is vocabulary^2. In a general sense, the table size grows exponentially with the context, namely vocabulary^N+1.
Ruben N I agree. I believe Louis meant the space complexity instead of time complexity and so I answered for that. Could you please mention what parts of my reply were incorrect or hard to understand?
@@samsung6980 She (and also you) have mixed up context size with N (in N-grams) which results with nonsense. p(x1|x2) models 2-gram and has a context length of 1. On (@13:37) the image shows 2D matrix of probabilities and according to the text on slide it should be a vector of probabilities (vocabulary^1).
@@RubenNuredini Okay, it's clear now. I messed up N-gram with context size. Thanks for clearing that up.
who at deep mind is uploading videos at 3am
you do realise that DeepMind is based in UK?
@@ukaszszkopinski7310 oh that makes sense then lol
Why are americans so often self-centered....?
@@chibrax54 bro its not unreasonable to assume deep mind is based in the same time zone as google hq. i was wrong, not self-centered.
Reminds me of how Americans have the “World Series” 🤣
15:57 She says 1 trillion. But it should be 1 Billion.
Informative session but where can we find the lecture slides?
Thank you!
See description
@@dHnd2j1u Thank you!
Could you please help me understand what does she say at 32:46? Does she say "...however many times 62" as written in the subtitles? I can't hear "62"... Thx!
"however many timesteps you took"
(essentially the power 't' in equation)
Each time she asks, “Do you understand?” I find it optimizes my activation threshold...”
is there a way to access the slides?
See description
Bring back Alphastar !
I'm sure they are working on it
Uh so 7/12??????
Sorry but rather than high level intuition these lectures are pretty bad.
thx :)
40:47...wow, she was not able to answer why/how LSTMs (partially) solve the vanishing gradient problem....
It's because of the gated memory access right?
@Marcos Pereira The presenter should have started with a simpler version (GRU), before moving to an LSTM. The minimal gated version of GRU is all you need to demonstrate how/why it alleviates vanishing gradient issue. In short, yes it's because it allows SOME access to memory. Mathematically, instead of h(t) as in RNN, the state is now a weighted average of it's updated state h(t) and a previous one, i.e. H(t) = Forget *h(t) + (1-Forget)*H(t-1).
When unrolling this recursive formula, one can see that (1-Forget)^N appears for large N (deep networks). So that you will still lose memory (since (1 - Forget)^N is close to zero).
IMO these lectures are useless for beginners and I'll tell you why. Many of the lecturers speak much too fast - a firehose of facts and non stop talking isn't teaching, not cool! some never repeat questions asked by those sitting in class - thanks! and assume we recall the basics from prior lectures without recapitulating - poor pedagogy! e.g use of softmax function, where do I go t remind myself what that is and why it is used. If you are just starting out you will need to do some serious study prior to listening to anything here. There is lots of expertise on display here but no real teaching.
DeepMind project idea: filter-out the sound of Marta swallowing water (@56:20).
The mean yard apically strip because quart qualitatively arrive upon a upset scorpio. bent, erratic bonsai
1st
she talks too fast