DeepMind x UCL | Deep Learning Lectures | 6/12 | Sequences and Recurrent Networks

Поділитися
Вставка
  • Опубліковано 5 тра 2024
  • In this lecture, DeepMind Research Scientist Marta Garnelo focuses on sequential data and how machine learning methods have been adapted to process this particular type of structure. Marta starts by introducing some fundamentals of sequence modeling including common architectures designed for this task such as RNNs and LSTMs. She then moves on to sequence-to-sequence decoding and its applications before finishing with some examples of recent applications of sequence models.
    Download the slides here:
    storage.googleapis.com/deepmi...
    Find out more about how DeepMind increases access to science here:
    deepmind.com/about#access_to_...
    Speaker Bio:
    Marta is a Research Scientist at DeepMind working on deep generative models and meta learning. During her time at DM she has worked on Generative Query Networks as well as Neural Processes and recently her research focus has shifted towards multi-agent systems. In addition she is currently wrapping up her PhD with Prof Murray Shanahan at Imperial College London where she also did an MSc in Machine Learning.
    About the lecture series:
    The Deep Learning Lecture Series is a collaboration between DeepMind and the UCL Centre for Artificial Intelligence. Over the past decade, Deep Learning has evolved as the leading artificial intelligence paradigm providing us with the ability to learn complex functions from raw data at unprecedented accuracy and scale. Deep Learning has been applied to problems in object recognition, speech recognition, speech synthesis, forecasting, scientific computing, control and many more. The resulting applications are touching all of our lives in areas such as healthcare and medical research, human-computer interaction, communication, transport, conservation, manufacturing and many other fields of human endeavour. In recognition of this huge impact, the 2019 Turing Award, the highest honour in computing, was awarded to pioneers of Deep Learning.
    In this lecture series, leading research scientists from leading AI research lab, DeepMind, deliver 12 lectures on an exciting selection of topics in Deep Learning, ranging from the fundamentals of training neural networks via advanced ideas around memory, attention, and generative modelling to the important topic of responsible innovation.
  • Наука та технологія

КОМЕНТАРІ • 66

  • @leixun
    @leixun 3 роки тому +26

    *DeepMind x UCL | Deep Learning Lectures | 6/12 | Sequences and Recurrent Networks*
    *My takeaways:*
    *1. Plan for this lecture **0:36*
    *2. Motivation **1:31*
    2.1 What is a sequence and why the methods in previous lectures (Feedforward networks and Convolutional networks) don't work on them 2:06
    2.2 Sequences are everywhere 3:33
    2.3 Summary 4:45
    *3. Fundamentals **5:24*
    3.1 Modelling word probabilities is very hard 5:28
    3.2 Architectures of sequence models
    -Recurrent Neural Networks (RNNs) 23:25; Train RNNs 26:42; N-gram vs Addition vs RNN 35:21
    -Long Short-Term Memory (LSTM) networks 37:30
    -Gated Recurrent Unit (GRU) nets: GRU can be seen as a simplified LSTM 42:39
    3.3 N-gram vs Addition vs RNN vs LSTM 43:11
    *4. Generating sequences **43:46*
    4.1 Using a trained model 45:00
    4.2 Examples:
    -Images as sequences: PixelRNN 46:32
    -Natural language as sequences: sequence-to-sequence (Seq2Seq) models 50:26
    -Other Seq2Seq applications: machine translation; image captioning 53:01
    -Audio waves as sequences: using convolutions 1:00:20
    --N-gram vs Addition vs RNN vs LSTM vs Conv 1:01:51
    -Policies as sequences: reinforcement learning in gaming 1:02:29
    -Attention in sequences: transformers 1:04:33
    --Convolutions vs transformers: an animation 1:05:05
    --e.g. GPT2 1:07:05
    --N-gram vs Addition vs RNN vs LSTM vs Conv vs transformers 1:09:02
    --RNNs in 2011 vs transformers in 2019 1:09:28
    *5. Summary **1:10:52*
    *6. Q&A **1:11:51*

    • @Dirhfifkshdi
      @Dirhfifkshdi 3 роки тому +1

      The channel needs to add this info to video description. Thanks for adding it here

    • @leixun
      @leixun 3 роки тому +1

      @@Dirhfifkshdi You are welcome!

  • @Marcos10PT
    @Marcos10PT 3 роки тому +26

    This is by a wide margin the best lecture in the series so far. Marta explains things well and doesn't come across as insecure, without feeling the need to prove her technical proficiency with cryptic unlabeled equations 🙂 sorry if I sound a little disappointed with the previous lectures 😂

    • @rydvalj
      @rydvalj 3 роки тому +2

      Highly subjective. My favourites until now were lec. 2 and 5. In this lecture I get distracted by the speaker skipping equations (some functions which are to be differentiated were not even properly defined) and asking whether the audience did understand what she just said. (still the quality is very good)

  • @SK-pm4vq
    @SK-pm4vq 3 роки тому +24

    Please upload the remaining lectures also.

  • @aliyaamirova753
    @aliyaamirova753 3 роки тому +4

    This is the best explanation of RNNs I have ever seen

  • @imranq9241
    @imranq9241 3 роки тому

    Amazing series! Although the remaining lecture series were not uploaded, I like that each lecture was self-contained. For those looking for more resources, there are a lot of other great posts around youtube. David Silver's reinforcement learning series, Oxford NLP, UCLxAdvancedML are other places to get more info. Additionally, Berkeley and Stanford have great classes.

  • @alirisheh8633
    @alirisheh8633 3 роки тому

    Perfect! please keep lectures updated

  • @janszczekulski5133
    @janszczekulski5133 8 місяців тому

    That was incredible, it explained so cleanly the concept I was struggling with for a while, thank you so much Marta

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 роки тому +1

    Amazing buildup of intuition. I wish I saw this long ago.

  • @sandipk1632
    @sandipk1632 3 роки тому

    This helped me a lot. Yep...Great to watch!

  • @lukn4100
    @lukn4100 3 роки тому +1

    Great lecture and big thanks to DeepMind for sharing this great content.

  • @francois__
    @francois__ 3 роки тому +1

    Very clear lecture, thank you. One thing that was somewhat less clear was the transformers part at the end, with that comparison to convolutions, it made it seem like transformers are merely simple densely connected networks.

  • @Iamine1981
    @Iamine1981 3 роки тому

    The quality of GPT2 sequence prediction is truly astonishing!

    • @liambury529
      @liambury529 3 роки тому

      And GPT3 is nothing short of terrifying!

  • @TheAero
    @TheAero 7 місяців тому

    The variable length argument should stop being made. Their is a maximum size to the input of everything. If you pad with 0s you shouldn't call the input variable.

  • @saifurshaikh3283
    @saifurshaikh3283 3 роки тому +4

    When will the remaining 6 videos uploaded ?

  • @wangshuai161
    @wangshuai161 3 роки тому

    Really amazing

  • @nguyenngocly1484
    @nguyenngocly1484 3 роки тому

    You can turn artificial neural networks inside-out by using fixed dot products (weighted sums) and adjustable (parametric) activation functions. The fixed dot products can be computed very quickly using fast transforms like the FFT. Also the number of overall parameters required is vastly reduced. The dot products of the transform act as statistical summary measures. Ensuring good behavour. See Fast Transform (fixed filter bank) neural networks. Also the electricty in your house is an AC sine wave. Turn on a light and the output of the switch is f(x)=x. The same sine wave as the input. Off f(x)=0. Then ReLU is a switch. A ReLU neural network is a switched composition of dot products. If the switch states are known there exists a linear mapping between the input vector and the output vector. Upon which various metrics can be applied to see what is happening for example.

  • @lizgichora6472
    @lizgichora6472 3 роки тому

    Thank you, very interesting.

  • @franciswebb7522
    @franciswebb7522 3 роки тому

    I'm yet to watch the transformer lecture - but initially it seems like that method is a beefed up N-gram? Is this a sensible way of looking at it or have I completely missed the mark? The issue with the N-gram seemed to be scalability but looking at how GPT2 was trained it seemed to overcome it by just throwing a huge amount of computing power and data at it

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 роки тому +1

    What is meant by pairwise encoding?

  • @muhammadharris4470
    @muhammadharris4470 3 роки тому +1

    where are the rest of the 6 lectures? :|

  • @soumyadeeproy6611
    @soumyadeeproy6611 3 роки тому +1

    Wow ! very clearly explained,right from the starting :)

  • @rohandeshpande7515
    @rohandeshpande7515 3 роки тому

    Informative session but where can we find the lecture slides?
    Thank you!

  • @luksdoc
    @luksdoc 3 роки тому +4

    Marta: "Cool".

  • @rgzfuentes
    @rgzfuentes 3 роки тому

    Could you please help me understand what does she say at 32:46? Does she say "...however many times 62" as written in the subtitles? I can't hear "62"... Thx!

    • @dHnd2j1u
      @dHnd2j1u 3 роки тому

      "however many timesteps you took"
      (essentially the power 't' in equation)

  • @l1fecsgo264
    @l1fecsgo264 3 роки тому

    is there a way to access the slides?

    • @dHnd2j1u
      @dHnd2j1u 3 роки тому

      See description

  • @louiswang538
    @louiswang538 3 роки тому +2

    13:47 why the time complexity is not n^2?

    • @samsung6980
      @samsung6980 3 роки тому

      Because a single word in any language can have 'n' previous words as its context. The image at 13:47 shows the possibilities if 'n=1'. i.e. 1-gram, or p(x1|x2). Now imagine if we have n>1. We have to now determine p(x1|x2,x3,x4,.....). This is huge. Also, there can be various combinations of x2, x3,x4, etc. Therefore, using all previous words in the context is not scalable.

    • @RubenNuredini
      @RubenNuredini 3 роки тому

      ​@@samsung6980 @Louis TV Badly formulated question with an even worse answer.
      The remark that there is an error @13:47 is correct. Here the context size is N=1 and the table size is vocabulary^2. In a general sense, the table size grows exponentially with the context, namely vocabulary^N+1.

    • @samsung6980
      @samsung6980 3 роки тому

      Ruben N I agree. I believe Louis meant the space complexity instead of time complexity and so I answered for that. Could you please mention what parts of my reply were incorrect or hard to understand?

    • @RubenNuredini
      @RubenNuredini 3 роки тому +1

      @@samsung6980 She (and also you) have mixed up context size with N (in N-grams) which results with nonsense. p(x1|x2) models 2-gram and has a context length of 1. On (@13:37) the image shows 2D matrix of probabilities and according to the text on slide it should be a vector of probabilities (vocabulary^1).

    • @samsung6980
      @samsung6980 3 роки тому

      @@RubenNuredini Okay, it's clear now. I messed up N-gram with context size. Thanks for clearing that up.

  • @hong-tv
    @hong-tv 3 роки тому

    thx :)

  • @jerrygreenest
    @jerrygreenest 3 роки тому +1

    It’s weird that because of this bulk upload, and different size of the videos, they were uploaded in wrong order

    • @pramethgaddale8242
      @pramethgaddale8242 3 роки тому +3

      They're all in order here, ua-cam.com/play/PLqYmG7hTraZCDxZ44o4p3N5Anz3lLRVZF.html

  • @dbporter
    @dbporter 2 роки тому

    17:00 Solution

  • @prismane_
    @prismane_ 3 роки тому +3

    who at deep mind is uploading videos at 3am

    • @ukaszszkopinski7310
      @ukaszszkopinski7310 3 роки тому +6

      you do realise that DeepMind is based in UK?

    • @prismane_
      @prismane_ 3 роки тому +3

      @@ukaszszkopinski7310 oh that makes sense then lol

    • @chibrax54
      @chibrax54 3 роки тому +9

      Why are americans so often self-centered....?

    • @prismane_
      @prismane_ 3 роки тому +4

      ​@@chibrax54 bro its not unreasonable to assume deep mind is based in the same time zone as google hq. i was wrong, not self-centered.

    • @MO-xi1kv
      @MO-xi1kv 3 роки тому +1

      Reminds me of how Americans have the “World Series” 🤣

  • @InfoRanker
    @InfoRanker 3 роки тому

    Bring back Alphastar !

    • @DiapaYY
      @DiapaYY 3 роки тому

      I'm sure they are working on it

  • @konataizumi5829
    @konataizumi5829 3 роки тому

    Uh so 7/12??????

  • @whitedevil4123
    @whitedevil4123 2 роки тому

    15:57 She says 1 trillion. But it should be 1 Billion.

  • @MrTubber44
    @MrTubber44 3 роки тому +4

    Each time she asks, “Do you understand?” I find it optimizes my activation threshold...”

  • @JumpDiffusion
    @JumpDiffusion 3 роки тому +1

    40:47...wow, she was not able to answer why/how LSTMs (partially) solve the vanishing gradient problem....

    • @Marcos10PT
      @Marcos10PT 3 роки тому +2

      It's because of the gated memory access right?

    • @JumpDiffusion
      @JumpDiffusion 3 роки тому +1

      @Marcos Pereira The presenter should have started with a simpler version (GRU), before moving to an LSTM. The minimal gated version of GRU is all you need to demonstrate how/why it alleviates vanishing gradient issue. In short, yes it's because it allows SOME access to memory. Mathematically, instead of h(t) as in RNN, the state is now a weighted average of it's updated state h(t) and a previous one, i.e. H(t) = Forget *h(t) + (1-Forget)*H(t-1).
      When unrolling this recursive formula, one can see that (1-Forget)^N appears for large N (deep networks). So that you will still lose memory (since (1 - Forget)^N is close to zero).

  • @TheAero
    @TheAero 7 місяців тому

    Sorry but rather than high level intuition these lectures are pretty bad.

  • @useaeqformati5237
    @useaeqformati5237 3 роки тому

    The mean yard apically strip because quart qualitatively arrive upon a upset scorpio. bent, erratic bonsai

  • @BlueFan99
    @BlueFan99 3 роки тому

    1st

  • @RubenNuredini
    @RubenNuredini 3 роки тому +1

    DeepMind project idea: filter-out the sound of Marta swallowing water (@56:20).

  • @jamescapparell5673
    @jamescapparell5673 3 роки тому

    IMO these lectures are useless for beginners and I'll tell you why. Many of the lecturers speak much too fast - a firehose of facts and non stop talking isn't teaching, not cool! some never repeat questions asked by those sitting in class - thanks! and assume we recall the basics from prior lectures without recapitulating - poor pedagogy! e.g use of softmax function, where do I go t remind myself what that is and why it is used. If you are just starting out you will need to do some serious study prior to listening to anything here. There is lots of expertise on display here but no real teaching.

  • @hemantyadav1047
    @hemantyadav1047 3 роки тому

    she talks too fast