02L - Modules and architectures

Поділитися
Вставка
  • Опубліковано 17 гру 2024

КОМЕНТАРІ • 50

  • @wolfisraging
    @wolfisraging 3 роки тому +18

    Been reading and doing ML for the last 5 years, and every time I hear Yan I always get to know something I don't know or missed. Thanks to both Alfredo and Yan for this amazing course. Lovin it!!!!

    • @alfcnz
      @alfcnz  3 роки тому +1

      Yann, with two n's 😉
      You're welcome 😊😊😊

    • @wolfisraging
      @wolfisraging 3 роки тому

      @@alfcnz mah bad 😄

    • @locutusdiborg88
      @locutusdiborg88 3 роки тому

      @@alfcnz then it should be also Lecunn , with 2 n's ( to read using Igor's voice)

    • @alfcnz
      @alfcnz  3 роки тому

      Who's Igor? 😮😮😮

    • @wolfisraging
      @wolfisraging 3 роки тому

      yeah who's Igor?

  • @hafezfarazi5513
    @hafezfarazi5513 3 роки тому +1

    @8:33 One of the reasons why ReLU is better in deep networks than say Sigmoid is that the gradient in backward pass after each sigmoid nonlinearity gets smaller (multiplied by around 0.25), but in ReLU-like nonlinearities, the gradient does not get smaller after each layer (the gradient is one in positive part)

    • @alfcnz
      @alfcnz  3 роки тому +1

      Assuming no normalisation layer is used in-between, yes.

    • @alfcnz
      @alfcnz  3 роки тому

      What would be bounded?

  • @fuzzylogicq
    @fuzzylogicq 2 роки тому

    Awesome lecture. Particularly intrigued by the mixture of experts part, definitely trying this out.

    • @alfcnz
      @alfcnz  2 роки тому

      😊😊😊

  • @asmabeevi
    @asmabeevi 3 роки тому +2

    First of all, thank you very much for doing this. The world owes you!
    What is the meaning of the update rule when the parameter vector is the output a function [at time 1:31:21] ? As the name implies, w is the output of a function, so how can you update the output?

    • @alfcnz
      @alfcnz  3 роки тому +1

      Through changes on its input.
      Yann is showing how gradient descent changes its direction when an input is a function of another input.

  • @AdityaSanjivKanadeees
    @AdityaSanjivKanadeees 3 роки тому

    at 1:14:38 Yann says that we can do a Non-linear classification with mixture of linear classifiers which are gated, isn't it still linear classifier? Why is it non-linear, what is it that makes it non-linear classification.

    • @alfcnz
      @alfcnz  3 роки тому

      A linear classifier has a single hyperplane cutting the data space. Here we partition the space in two and then in two again. So, we end up with something that is clearly not linear. It actually smells a lot as a decision tree, where you have iterative subdivision of the data space.

    • @AdityaSanjivKanadeees
      @AdityaSanjivKanadeees 3 роки тому

      @@alfcnz Thank you!!

    • @alfcnz
      @alfcnz  3 роки тому +1

      You're welcome 😊

    • @alfcnz
      @alfcnz  3 роки тому

      No one mentioned a ReLU here. 🤨🤨🤨

    • @alfcnz
      @alfcnz  3 роки тому

      How is that addressing Aditya's question?

  • @AdityaSanjivKanadeees
    @AdityaSanjivKanadeees 3 роки тому

    do the ideas for update rules for W in the example @1:33:00 come from $dw = \frac{\partial H}{\partial u} \cdot du$ ?

    • @alfcnz
      @alfcnz  3 роки тому +1

      This side has a better comment explaining the math in the previous year's video.
      ua-cam.com/video/FW5gFiJb-ig/v-deo.html&lc=UgxPIlrkdcQAncIPyQ14AaABAg

  • @cristiano24597
    @cristiano24597 Рік тому

    Regarding the Mixture of experts, would it make sense to train the Experts nets separatelly?
    I mean something like training the Expert 1 over a dataset that is for sure Catalan, and the Expert 2 over other language. After that we'll have those trained models, and a general dataset (with multiple languages) could be used to train the Gater only (the Experts' weights wouldn't change anymore).

    • @alfcnz
      @alfcnz  Рік тому

      Sure you can do that. It turns out that the joint model works better because it can exploit the similarity between the two contexts.

    • @cristiano24597
      @cristiano24597 Рік тому

      @@alfcnz makes sense, thanks!

    • @alfcnz
      @alfcnz  Рік тому

      You’re welcome 😊😊😊

  • @asmabeevi
    @asmabeevi 3 роки тому

    Also, Do you have quizzes for this class? In the course website I see Homework problems mainly. Thanks!

  • @НиколайНовичков-е1э

    Hello, Alfredo :) Thank you for video! :))

    • @alfcnz
      @alfcnz  3 роки тому

      Hello 👋🏻 You're most welcome 😇

  • @pypy1285
    @pypy1285 3 роки тому

    Thank you Alfredo,
    and I want to know what's the 'z' in the attention architecture example as Yan introduced, does the 'z' also come from the training data? thank you!

    • @alfcnz
      @alfcnz  3 роки тому

      You need to point out minutes:seconds for me to be able to address your question.

    • @pypy1285
      @pypy1285 3 роки тому

      @@alfcnz ~59:02 the topic is "multiplicative modules", thank you

    • @alfcnz
      @alfcnz  3 роки тому

      z is a latent input. Latent means it's missing from the data set. Hence, you need to infer it by minimisation of the energy using GD. We've extensively covered latent variable energy based models in previous lectures.

    • @pypy1285
      @pypy1285 3 роки тому +1

      @@alfcnz Thank you,
      Sorry, I didn't get this information in previous lecture (if it in the order of the video), but I noticed that there is a lecture call "05.1 - Latent Variable Energy Based Models (LV-EBMs), inference" in later videos, thank you!

    • @alfcnz
      @alfcnz  3 роки тому +2

      My bad, I apologise.
      I thought this was the lecture on associative memories. These topics are only briefly introduced here and will be extensively covered later on.

  • @antoniovelag.8080
    @antoniovelag.8080 3 роки тому

    Hello Alfredo, I can't find these slides in the website. Am I just not looking right or are they missing?

    • @alfcnz
      @alfcnz  3 роки тому

      Click on the icon next to the lecture title on the website.

    • @antoniovelag.8080
      @antoniovelag.8080 3 роки тому

      @@alfcnz There is only a camera icon and it sends me to this video :(

    • @alfcnz
      @alfcnz  3 роки тому +1

      Oh, is this link missing?
      drive.google.com/file/d/1IaDI6BJ6g4SJbJLtNjVE_miWRzBH1-MX/
      Feel free to send a PR if it's correct.

  • @SubhomMitra
    @SubhomMitra 3 роки тому

    Hey @Alfredo! I'm new to the SP21 course. Is there an order to the videos? You've previously uploaded videos numbered 01, 02, 03... but your recent videos are 01L, 02L, ... What does the "L" mean? Should I watch 01L after 01? I am trying to understand the naming convention here. Thanks! :)

    • @alfcnz
      @alfcnz  3 роки тому +1

      L stands for lecture. I was not planning to release them, initially.
      The order / index is on the class website, the official content organisation homepage.

    • @SubhomMitra
      @SubhomMitra 3 роки тому

      @@alfcnz Thanks for clarifying! The class website isn't fully updated yet, so I was a bit confused. Will you be uploading more Lecture videos?

    • @alfcnz
      @alfcnz  3 роки тому

      I've just published the first theme. You'll have a new one every week. For the latest news you want to follow me on Twitter, where I announce all these things.

    • @SubhomMitra
      @SubhomMitra 3 роки тому

      @@alfcnz Than you very much! 😇

  • @HassanAliAnwar
    @HassanAliAnwar 3 роки тому

    Why doesn't ReLU have a variant with ReLU (x) = -x for x

    • @alfcnz
      @alfcnz  3 роки тому +1

      What do you need the identify function for? 🤨🤨🤨

    • @HassanAliAnwar
      @HassanAliAnwar 3 роки тому

      @@alfcnz I meant RELU(x) = abs(x). It will still be non-linear, but I guess it serves no purpose.

    • @alfcnz
      @alfcnz  3 роки тому

      Absolute value has been used as non linear function, but it wouldn't let you turn off specific inputs. So, the output would always be the non-zero piecewise linear combination of the input.

    • @ryans6946
      @ryans6946 3 роки тому

      @@HassanAliAnwar Hey! I think this is covered at the start of the lecture when Yann takes a question regarding non-monotonic activation functions. To summarise, I think intuitively, he explains that as there would be two solutions for x when f(x) = some value (except for x=0) this means that the gradient descent step could be taken in multiple directions, which can, but not always, lead to less efficient learning for the problem, i.e. if abs(x) = 2, do we walk towards the direction of the gradient being -1 or in the direction of the gradient being 1?

    • @alfcnz
      @alfcnz  3 роки тому +1

      Yup.