DLVU
DLVU
  • 53
  • 286 384
Lecture 12.4 Scaling up (Mixed precision, Data-parallelism, FSDP)
How to train big models.
slides: dlvu.github.io/sa
course website: dlvu.github.io
lecturer: Peter Bloem
Переглядів: 1 656

Відео

Lecture 6.3: variational autoencoders
Переглядів 2759 місяців тому
Lecture 6.3: variational autoencoders
Lecture 6.2: Latent variable model
Переглядів 3719 місяців тому
Lecture 6.2: Latent variable model
Lecture 1.3: Autoencoders
Переглядів 83610 місяців тому
slides: dlvu.github.io In the final video of the first lecture, we investigate autoencoders. These are a simple example of the great variety of architectures that we can build out of neural networks. lecturer: Peter Bloem
Lecture 1.2: Regression, classification and loss functions
Переглядів 70910 місяців тому
slides: dlvu.github.io In the first lecture, we start by reviewing the basics. We expect you know these already, but it helps to review it, and to show what names and notation we use for things. The second video of the lecture shows how we do classification and regression with neural networks, and what loss functions we use. While neural networks aren't that popular for classical machine learni...
Lecture 1.1: Neural networks
Переглядів 1,7 тис.10 місяців тому
slides: dlvu.github.io In the first lecture, we start by reviewing the basics. We expect you know these already, but it helps to review it, and to show what names and notation we use for things. The first video of the lecture provides a review of what neural networks are, and how they are trained. lecturer: Peter Bloem
Lecture 2.4: Automatic Differentiation (DLVU)
Переглядів 1,6 тис.Рік тому
In the final video of this lecture, we look at how to make the computer maintain a computation graph for us, so that all we have to do is define operations and define the forward pass. lecturer: Peter Bloem course site: dlvu.github.io
Lecture 2.3: Backpropagation, a tensor view (DLVU)
Переглядів 1,8 тис.Рік тому
lecturer: Peter Bloem course website: dlvu.github.io In this video, we work out the backpropagation algorithm in a vectorized version: that is purely in terms of basic linear algebra operations like matrix and vector multiplication. This helps us to express neural networks in a clean notation, and to accelarate their computation.
Lecture 2.2: Backpropagation, scalar perspective (DLVU)
Переглядів 1,7 тис.Рік тому
Lecturer: Peter Bloem course website: dvlu.github.com In this video, we look at the backpropagation algorithm from a scalar perspective. We dig into the basic steps and follow an example backward pass to develop our intuition for how the algorithm helps us to push the weights of a neural network in the right direction.
Lecture 1.1: Organization of the course
Переглядів 3,7 тис.2 роки тому
In this introductory video, we provide information about the course.
Lecture 12.3 Famous transformers (BERT, GPT-2, GPT-3)
Переглядів 18 тис.3 роки тому
ERRATA: In the "original transformer" (slide 51), in the source attention, the key and value come from the encoder, and the query comes from the decoder. In this lecture we look at the details of some famous transformer models. How were they trained, and what could they do after they were trained. annotated slides: dlvu.github.io/sa Lecturer: Peter Bloem
Lecture 12.1 Self-attention
Переглядів 71 тис.3 роки тому
ERRATA: - In slide 23, the indices are incorrect. The index of the key and value should match (j) and theindex of the query should be different (i). - In slide 25, the diagram illustrating how multi-head self-attention is computed is a slight departure from how it's usually done (the implementation in the subsequent slide is correct, but these are not quite functionally equivalent). See the sli...
Lecture 12.2 Transformers
Переглядів 22 тис.3 роки тому
ERRATA: In slide 31, the first part of the transformer block should read y = self.layernorm(x) y = self.attention(y) Also, the code currently suggests that the same layer normalization is applied twice. It is more common to apply different layer normalizations in the same block. How to take the basic self-attention mechanism and build it up into a Transformer. We discuss The basic transformer b...
Lecture 11.3: World Models
Переглядів 1,6 тис.3 роки тому
In this video, we discuss World Models, a fairly recent research area that models the environment using neural networks. This allows us to create state representations that can be used to train an agent by just imagining trajectories in the world model! lecturer: Emile van Krieken course website: dlvu.github.io
Lecture 11.2: Variance Reduction for Policy Gradient (Actor-Critic)
Переглядів 1,1 тис.3 роки тому
In this video, we will be discussing variance reduction techniques for policy gradient methods. We will introduce baselines, actor-critic and advantage actor-critic (A2C). We compare how different algorithms choose how to reinforce actions. lecturer: Emile van Krieken course website: dlvu.github.io
Lecture 11.1: Deep Q-Learning
Переглядів 1,9 тис.3 роки тому
Lecture 11.1: Deep Q-Learning
Lecture 10.3: ARM & Flows
Переглядів 9783 роки тому
Lecture 10.3: ARM & Flows
Lecture 10.2: ARM & Flows
Переглядів 1,2 тис.3 роки тому
Lecture 10.2: ARM & Flows
Lecture 10.1: ARM & Flows
Переглядів 1,8 тис.3 роки тому
Lecture 10.1: ARM & Flows
Lecture 9.1: Introduction to Reinforcement Learning
Переглядів 2,3 тис.3 роки тому
Lecture 9.1: Introduction to Reinforcement Learning
Lecture 9.2: The REINFORCE algorithm
Переглядів 3,1 тис.3 роки тому
Lecture 9.2: The REINFORCE algorithm
Lecture 9.3: Gradient Estimation
Переглядів 1,9 тис.3 роки тому
Lecture 9.3: Gradient Estimation
Lecture 8.4: Application - query embedding
Переглядів 1,4 тис.3 роки тому
Lecture 8.4: Application - query embedding
Lecture 8.3: Graph Neural Networks
Переглядів 1,8 тис.3 роки тому
Lecture 8.3: Graph Neural Networks
Lecture 8.2: Graph and node embedding
Переглядів 4,9 тис.3 роки тому
Lecture 8.2: Graph and node embedding
Lecture 8.1a: Introduction - Graphs
Переглядів 2,3 тис.3 роки тому
Lecture 8.1a: Introduction - Graphs
Lecture 8.1b: Introduction - Embeddings
Переглядів 1,5 тис.3 роки тому
Lecture 8.1b: Introduction - Embeddings
Lecture 7.2 Implicit models: GANs
Переглядів 1,2 тис.3 роки тому
Lecture 7.2 Implicit models: GANs
Lecture 7.1 Implicit models: Density Networks
Переглядів 2 тис.3 роки тому
Lecture 7.1 Implicit models: Density Networks
Lecture 5.5 ELMo, Word2Vec
Переглядів 9 тис.3 роки тому
Lecture 5.5 ELMo, Word2Vec

КОМЕНТАРІ

  • @DhaneshKasinathanlove
    @DhaneshKasinathanlove Місяць тому

    Thanks for the course

  • @igorras-ff7oe
    @igorras-ff7oe Місяць тому

    Thank you for this video!

  • @vitaliy_d
    @vitaliy_d Місяць тому

    Very useful lecture. Thanks for sharing! It may be a small typo (at 18:25), should be l = loss(output,target) # "output" instead of "input"

  • @MrCobraTraders
    @MrCobraTraders Місяць тому

    I didn't understand why adding gaussian noise to image does affect the accuracy of discriminative model. (I think it doesn't) In reality models are robust to noise

  • @user-pe4xm7cq5z
    @user-pe4xm7cq5z Місяць тому

    Absolutely amazing! Thank you so much!!

  • @nirajrajai3116
    @nirajrajai3116 2 місяці тому

    Loved it! Clear explanations with simple examples.

  • @Mars.2024
    @Mars.2024 3 місяці тому

    Finally i have intuitive view of seld_attention . Thank you😇

  • @vesk4000
    @vesk4000 4 місяці тому

    This is exceptionally well explained. I'm a student at TU Delft and this really helped me understand how to speed up my code, and why it works. Thanks a lot!

  • @olileveque
    @olileveque 5 місяців тому

    Absolutely amazing series of videos! Congrats!

  • @abhilashbalachandran7160
    @abhilashbalachandran7160 5 місяців тому

    Very well explained

  • @prateekpatel6082
    @prateekpatel6082 7 місяців тому

    i dont understand why we have summation in conditionals , should that be a product instead of summation ?

  • @prateekpatel6082
    @prateekpatel6082 7 місяців тому

    quite bad explanation , just repeating slides text

  • @MariemStudiesWithMe
    @MariemStudiesWithMe 9 місяців тому

    the approach of the noise filter presented in the video can cause neuron saturation i guess because having high weighted input with maximize the output of the sigmoid function.. which is not desirable

  • @user-fd5px2cw9v
    @user-fd5px2cw9v 9 місяців тому

    Thanks for your sharing! Nice and clear video!

  • @nadeem1969100
    @nadeem1969100 9 місяців тому

    Currently I am doing research work on TCN

  • @37kuba
    @37kuba 9 місяців тому

    Superb Superb explanation, Superb explanation thanks Superb explanation thanks a Superb explanation thanks a lot! With no predictions now: Wish you all the best and am very grateful for your work.

  • @Yassinius
    @Yassinius 10 місяців тому

    Why do these videos have ads on them

  • @MrOntologue
    @MrOntologue 10 місяців тому

    Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.

  • @saurabhmahra4084
    @saurabhmahra4084 11 місяців тому

    Watching this video feels like trying to decipher alien scriptures with a blindfold on.

  • @user-ir5mu5rc8r
    @user-ir5mu5rc8r 11 місяців тому

    Waiting for more lectures..

  • @scienceprojectsofdccpn3430
    @scienceprojectsofdccpn3430 11 місяців тому

    Self attension animationua-cam.com/video/WusQB464qMY/v-deo.htmlsi=NBxi02yTPzSfMCb6

  • @soumilbinhani8803
    @soumilbinhani8803 11 місяців тому

    Hello can someone explain me this, the key and the values for each iteration wont it be the same, as we compare it to 5:29 , please help me on this

  • @fredoliveira7569
    @fredoliveira7569 11 місяців тому

    Best explanation ever! Congratulations and thank you!

  • @sergionic1821
    @sergionic1821 Рік тому

    what the dimensions of first conv filter in AlexNet - is it 11x11x 1 or it 11x11x 3 ?

    • @linux2650
      @linux2650 6 місяців тому

      its 11x11x 3, operating on those three channels.

  • @adrielcabral6634
    @adrielcabral6634 Рік тому

    I loved u explanation !!!

  • @zadidhasan4698
    @zadidhasan4698 Рік тому

    You are a great teacher.

  • @somerset006
    @somerset006 Рік тому

    Great lecture, thanks!

  • @somerset006
    @somerset006 Рік тому

    Really good series of mini-lectures, thanks!

  • @erdemozkol9049
    @erdemozkol9049 Рік тому

    Brilliant content and explanation! It's unfortunate that the fourth part of this lecture series was never published, this realization has left me very sad in 2023. :(

  • @davealsina848
    @davealsina848 Рік тому

    love this two times

  • @davealsina848
    @davealsina848 Рік тому

    Loved this, thanks a lot now I undersand better this things and feel more confident to jump into the code part.

  • @mahmoudebrahimkhani1384
    @mahmoudebrahimkhani1384 Рік тому

    Such a clear explanation! Thank you!

  • @user-oq1rb8vb7y
    @user-oq1rb8vb7y Рік тому

    Thanks for the great explanation! Just one question, if simple self-attention has no parameters, how can we expect it to learn? it is not trainable.

  • @Isomorphist
    @Isomorphist Рік тому

    Is this ASMR?

  • @xiaoweidu4667
    @xiaoweidu4667 Рік тому

    good tutorial

  • @senthil2sg
    @senthil2sg Рік тому

    Better than the Karpathy explainer video. Enough said!

  • @HiHi-iu8gf
    @HiHi-iu8gf Рік тому

    holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video. very well explained, very good video :)

  • @ChrisHalden007
    @ChrisHalden007 Рік тому

    Great video. Thanks

  • @user-ch3gs7el5k
    @user-ch3gs7el5k Рік тому

    Hi I think there are several things in the video that I don't know if they are correct. 1. Computation graph of transformer block: In the original paper, it says that layer normalization is performed AFTER adding output of attention layer to input of attention layer, yet in your presentation 2:22, it seems that you perform layer normalization before self attention. 2. in 3:10, I'm confused whether gamma and beta are vectors or scalars. If gamma is vector, then gamma times x should give a scalar, but that seems not true...| Can you please clarify these questions? Albeit, I'm benefiting so much from your videos. Thanks for sharing!

  • @aiapplicationswithshailey3600

    so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.

  • @ron0studios
    @ron0studios Рік тому

    extremely underrated resource for learning backpropagation! Thank you for this!

  • @farzinhaddadpour7192
    @farzinhaddadpour7192 Рік тому

    I think one of the best videos describing self-attention. Thank you for sharing.

  • @RioDeDoro
    @RioDeDoro Рік тому

    Great lecture! I really appreciated your presentation by starting with simple self-attention, very helpful.

  • @AlirezaAroundItaly
    @AlirezaAroundItaly Рік тому

    best explanation i found for self attention and multi head attention on internet , thank you sir

  • @SVRamu
    @SVRamu Рік тому

    wow

  • @markusdicks648
    @markusdicks648 Рік тому

    brilliant....nothing less !

  • @deestort
    @deestort Рік тому

    Models are based on historic information. There’s no bias.

    • @trenvert123
      @trenvert123 Рік тому

      You clearly have never done any deep learning development. There is bias.

  • @ecehatipoglu209
    @ecehatipoglu209 Рік тому

    Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer

    • @ecehatipoglu209
      @ecehatipoglu209 Рік тому

      Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.

  • @mr.django8409
    @mr.django8409 Рік тому

    Hello sir, am very thankful for your videos learnt alot from them. I just need some help in how to use TCN for anomaly detection for multivariate time series data.

  • @iamjerryliu
    @iamjerryliu Рік тому

    The best deep learning introductory video i had watched. And I cannot understand why such a great content only has 7 likes.