53
286 384

31:58

Lecture 6.2: Latent variable model

26:06

Lecture 1.3: Autoencoders

11:01

Lecture 1.2: Regression, classification and loss functions

14:31

Lecture 1.1: Neural networks

14:46

Lecture 2.4: Automatic Differentiation (DLVU)

28:04

Lecture 12.4 Scaling up (Mixed precision, Data-parallelism, FSDP)

How to train big models.
slides: dlvu.github.io/sa
course website: dlvu.github.io
lecturer: Peter Bloem

Відео

31:58

Lecture 6.3: variational autoencoders

Переглядів 2759 місяців тому

Lecture 6.3: variational autoencoders

26:06

Lecture 6.2: Latent variable model

Переглядів 3719 місяців тому

Lecture 6.2: Latent variable model

11:01

Lecture 1.3: Autoencoders

Переглядів 83610 місяців тому

slides: dlvu.github.io In the final video of the first lecture, we investigate autoencoders. These are a simple example of the great variety of architectures that we can build out of neural networks. lecturer: Peter Bloem

Lecture 1.2: Regression, classification and loss functions

14:31

Lecture 1.2: Regression, classification and loss functions

Переглядів 70910 місяців тому

slides: dlvu.github.io In the first lecture, we start by reviewing the basics. We expect you know these already, but it helps to review it, and to show what names and notation we use for things. The second video of the lecture shows how we do classification and regression with neural networks, and what loss functions we use. While neural networks aren't that popular for classical machine learni...

14:46

Lecture 1.1: Neural networks

Переглядів 1,7 тис.10 місяців тому

slides: dlvu.github.io In the first lecture, we start by reviewing the basics. We expect you know these already, but it helps to review it, and to show what names and notation we use for things. The first video of the lecture provides a review of what neural networks are, and how they are trained. lecturer: Peter Bloem

Lecture 2.4: Automatic Differentiation (DLVU)

28:04

Lecture 2.4: Automatic Differentiation (DLVU)

Переглядів 1,6 тис.Рік тому

In the final video of this lecture, we look at how to make the computer maintain a computation graph for us, so that all we have to do is define operations and define the forward pass. lecturer: Peter Bloem course site: dlvu.github.io

Lecture 2.3: Backpropagation, a tensor view (DLVU)

26:55

Lecture 2.3: Backpropagation, a tensor view (DLVU)

Переглядів 1,8 тис.Рік тому

lecturer: Peter Bloem course website: dlvu.github.io In this video, we work out the backpropagation algorithm in a vectorized version: that is purely in terms of basic linear algebra operations like matrix and vector multiplication. This helps us to express neural networks in a clean notation, and to accelarate their computation.

Lecture 2.2: Backpropagation, scalar perspective (DLVU)

31:02

Lecture 2.2: Backpropagation, scalar perspective (DLVU)

Переглядів 1,7 тис.Рік тому

Lecturer: Peter Bloem course website: dvlu.github.com In this video, we look at the backpropagation algorithm from a scalar perspective. We dig into the basic steps and follow an example backward pass to develop our intuition for how the algorithm helps us to push the weights of a neural network in the right direction.

9:30

Lecture 1.1: Organization of the course

Переглядів 3,7 тис.2 роки тому

In this introductory video, we provide information about the course.

Lecture 12.3 Famous transformers (BERT, GPT-2, GPT-3)

23:35

Lecture 12.3 Famous transformers (BERT, GPT-2, GPT-3)

Переглядів 18 тис.3 роки тому

ERRATA: In the "original transformer" (slide 51), in the source attention, the key and value come from the encoder, and the query comes from the decoder. In this lecture we look at the details of some famous transformer models. How were they trained, and what could they do after they were trained. annotated slides: dlvu.github.io/sa Lecturer: Peter Bloem

22:30

Lecture 12.1 Self-attention

Переглядів 71 тис.3 роки тому

ERRATA: - In slide 23, the indices are incorrect. The index of the key and value should match (j) and theindex of the query should be different (i). - In slide 25, the diagram illustrating how multi-head self-attention is computed is a slight departure from how it's usually done (the implementation in the subsequent slide is correct, but these are not quite functionally equivalent). See the sli...

18:08

Lecture 12.2 Transformers

Переглядів 22 тис.3 роки тому

ERRATA: In slide 31, the first part of the transformer block should read y = self.layernorm(x) y = self.attention(y) Also, the code currently suggests that the same layer normalization is applied twice. It is more common to apply different layer normalizations in the same block. How to take the basic self-attention mechanism and build it up into a Transformer. We discuss The basic transformer b...

24:27

Lecture 11.3: World Models

Переглядів 1,6 тис.3 роки тому

In this video, we discuss World Models, a fairly recent research area that models the environment using neural networks. This allows us to create state representations that can be used to train an agent by just imagining trajectories in the world model! lecturer: Emile van Krieken course website: dlvu.github.io

Lecture 11.2: Variance Reduction for Policy Gradient (Actor-Critic)

29:02

Lecture 11.2: Variance Reduction for Policy Gradient (Actor-Critic)

Переглядів 1,1 тис.3 роки тому

In this video, we will be discussing variance reduction techniques for policy gradient methods. We will introduce baselines, actor-critic and advantage actor-critic (A2C). We compare how different algorithms choose how to reinforce actions. lecturer: Emile van Krieken course website: dlvu.github.io

33:06

Lecture 11.1: Deep Q-Learning

Переглядів 1,9 тис.3 роки тому

Lecture 11.1: Deep Q-Learning

32:47

Lecture 10.3: ARM & Flows

Переглядів 9783 роки тому

Lecture 10.3: ARM & Flows

33:42

Lecture 10.2: ARM & Flows

Переглядів 1,2 тис.3 роки тому

Lecture 10.2: ARM & Flows

2:20

Lecture 10.1: ARM & Flows

Переглядів 1,8 тис.3 роки тому

Lecture 10.1: ARM & Flows

Lecture 9.1: Introduction to Reinforcement Learning

24:15

Lecture 9.1: Introduction to Reinforcement Learning

Переглядів 2,3 тис.3 роки тому

Lecture 9.1: Introduction to Reinforcement Learning

25:14

Lecture 9.2: The REINFORCE algorithm

Переглядів 3,1 тис.3 роки тому

Lecture 9.2: The REINFORCE algorithm

31:29

Lecture 9.3: Gradient Estimation

Переглядів 1,9 тис.3 роки тому

Lecture 9.3: Gradient Estimation

Lecture 8.4: Application - query embedding

16:57

Lecture 8.4: Application - query embedding

Переглядів 1,4 тис.3 роки тому

Lecture 8.4: Application - query embedding

21:27

Lecture 8.3: Graph Neural Networks

Переглядів 1,8 тис.3 роки тому

Lecture 8.3: Graph Neural Networks

23:34

Lecture 8.2: Graph and node embedding

Переглядів 4,9 тис.3 роки тому

Lecture 8.2: Graph and node embedding

10:42

Lecture 8.1a: Introduction - Graphs

Переглядів 2,3 тис.3 роки тому

Lecture 8.1a: Introduction - Graphs

9:14

Lecture 8.1b: Introduction - Embeddings

Переглядів 1,5 тис.3 роки тому

Lecture 8.1b: Introduction - Embeddings

23:42

Lecture 7.2 Implicit models: GANs

Переглядів 1,2 тис.3 роки тому

Lecture 7.2 Implicit models: GANs

Lecture 7.1 Implicit models: Density Networks

8:35

Lecture 7.1 Implicit models: Density Networks

Переглядів 2 тис.3 роки тому

Lecture 7.1 Implicit models: Density Networks

22:59

Lecture 5.5 ELMo, Word2Vec

Переглядів 9 тис.3 роки тому

Lecture 5.5 ELMo, Word2Vec

КОМЕНТАРІ

@DhaneshKasinathanlove Місяць тому
Thanks for the course
@igorras-ff7oe Місяць тому
Thank you for this video!
@vitaliy_d Місяць тому
Very useful lecture. Thanks for sharing! It may be a small typo (at 18:25), should be l = loss(output,target) # "output" instead of "input"
@MrCobraTraders Місяць тому
I didn't understand why adding gaussian noise to image does affect the accuracy of discriminative model. (I think it doesn't) In reality models are robust to noise
@user-pe4xm7cq5z Місяць тому
Absolutely amazing! Thank you so much!!
@nirajrajai3116 2 місяці тому
Loved it! Clear explanations with simple examples.
@Mars.2024 3 місяці тому
Finally i have intuitive view of seld_attention . Thank you😇
@vesk4000 4 місяці тому
This is exceptionally well explained. I'm a student at TU Delft and this really helped me understand how to speed up my code, and why it works. Thanks a lot!
@olileveque 5 місяців тому
Absolutely amazing series of videos! Congrats!
@abhilashbalachandran7160 5 місяців тому
Very well explained
@prateekpatel6082 7 місяців тому
i dont understand why we have summation in conditionals , should that be a product instead of summation ?
@prateekpatel6082 7 місяців тому
quite bad explanation , just repeating slides text
@MariemStudiesWithMe 9 місяців тому
the approach of the noise filter presented in the video can cause neuron saturation i guess because having high weighted input with maximize the output of the sigmoid function.. which is not desirable
@user-fd5px2cw9v 9 місяців тому
Thanks for your sharing! Nice and clear video!
@nadeem1969100 9 місяців тому
Currently I am doing research work on TCN
@37kuba 9 місяців тому
Superb Superb explanation, Superb explanation thanks Superb explanation thanks a Superb explanation thanks a lot! With no predictions now: Wish you all the best and am very grateful for your work.
@Yassinius 10 місяців тому
Why do these videos have ads on them
@MrOntologue 10 місяців тому
Google should rank videos according to the likes and the number of previously viewed videos on the same topics: this should go straight to the top for Attention/Transformer searches because I have seen and read plenty, and this is the first time the QKV as dictionary vs RDBMs made sense; that confusion had been so bad it literally stopped me thinking every time I had to consider Q, or K, or V and thus prevented me grokking the big idea. I now want to watch/read everything by you.
@saurabhmahra4084 11 місяців тому
Watching this video feels like trying to decipher alien scriptures with a blindfold on.
@user-ir5mu5rc8r 11 місяців тому
Waiting for more lectures..
@scienceprojectsofdccpn3430 11 місяців тому
Self attension animationua-cam.com/video/WusQB464qMY/v-deo.htmlsi=NBxi02yTPzSfMCb6
@soumilbinhani8803 11 місяців тому
Hello can someone explain me this, the key and the values for each iteration wont it be the same, as we compare it to 5:29 , please help me on this
@fredoliveira7569 11 місяців тому
Best explanation ever! Congratulations and thank you!
@sergionic1821 Рік тому
what the dimensions of first conv filter in AlexNet - is it 11x11x 1 or it 11x11x 3 ?
@linux2650 6 місяців тому
its 11x11x 3, operating on those three channels.
@adrielcabral6634 Рік тому
I loved u explanation !!!
@zadidhasan4698 Рік тому
You are a great teacher.
@somerset006 Рік тому
Great lecture, thanks!
@somerset006 Рік тому
Really good series of mini-lectures, thanks!
@erdemozkol9049 Рік тому
Brilliant content and explanation! It's unfortunate that the fourth part of this lecture series was never published, this realization has left me very sad in 2023. :(
@davealsina848 Рік тому
love this two times
@davealsina848 Рік тому
Loved this, thanks a lot now I undersand better this things and feel more confident to jump into the code part.
@mahmoudebrahimkhani1384 Рік тому
Such a clear explanation! Thank you!
@user-oq1rb8vb7y Рік тому
Thanks for the great explanation! Just one question, if simple self-attention has no parameters, how can we expect it to learn? it is not trainable.
@Isomorphist Рік тому
Is this ASMR?
@xiaoweidu4667 Рік тому
good tutorial
@senthil2sg Рік тому
Better than the Karpathy explainer video. Enough said!
@HiHi-iu8gf Рік тому
holy shit, been trying to wrap my head around self-attention for a while, but it all finally clicked together with this video. very well explained, very good video :)
@ChrisHalden007 Рік тому
Great video. Thanks
@user-ch3gs7el5k Рік тому
Hi I think there are several things in the video that I don't know if they are correct. 1. Computation graph of transformer block: In the original paper, it says that layer normalization is performed AFTER adding output of attention layer to input of attention layer, yet in your presentation 2:22, it seems that you perform layer normalization before self attention. 2. in 3:10, I'm confused whether gamma and beta are vectors or scalars. If gamma is vector, then gamma times x should give a scalar, but that seems not true...| Can you please clarify these questions? Albeit, I'm benefiting so much from your videos. Thanks for sharing!
@aiapplicationswithshailey3600 Рік тому
so far the best video describing this topic. Only questoin i have is how do we get around the fact that a word will have highest self attention with it self. You said you would clarify about this but I could not find this point.
@ron0studios Рік тому
extremely underrated resource for learning backpropagation! Thank you for this!
@farzinhaddadpour7192 Рік тому
I think one of the best videos describing self-attention. Thank you for sharing.
@RioDeDoro Рік тому
Great lecture! I really appreciated your presentation by starting with simple self-attention, very helpful.
@AlirezaAroundItaly Рік тому
best explanation i found for self attention and multi head attention on internet , thank you sir
@SVRamu Рік тому
wow
@markusdicks648 Рік тому
brilliant....nothing less !
@deestort Рік тому
Models are based on historic information. There’s no bias.
@trenvert123 Рік тому
You clearly have never done any deep learning development. There is bias.
@ecehatipoglu209 Рік тому
Hi extremely helpful video here, I really appreciate but i have a question i dont understand how multi head self attention works if we are not generate extra parameters for each stack of self attention layer, what is the difference in each stack so that we can grasp the different relations of the same word in each layer
@ecehatipoglu209 Рік тому
Yeah after 9 days and re-watching this video i think I grasped why we are not using extra parameters. Lets say you have an embedding dimension of 768 and you want to make 3 attention head meaning somehow dividing the 768 vector so you could have a 256x1 vector for each attention head. (This splitting is actually a linear transformation so there is no weights to be learned here right. ) . After that, for each of this 3 attention heads we have parameters 3 of [K, Q, W](superscripted for each attention head). For each attention head our K will be the dimension of 256xwhatever, Q will be the dimension of 256xwhatever and V will be the dimension of 256xwhatever. And this is for one head. Concatanating all learned vectors K. Q and will end up a 768xwhatever for each of them, exact size that we would have with a single attention. Voila.
@mr.django8409 Рік тому
Hello sir, am very thankful for your videos learnt alot from them. I just need some help in how to use TCN for anomaly detection for multivariate time series data.
@iamjerryliu Рік тому
The best deep learning introductory video i had watched. And I cannot understand why such a great content only has 7 likes.