Stanford CS224W: Machine Learning with Graphs | 2021 | Lecture 7.2 - A Single Layer of a GNN

Поділитися
Вставка
  • Опубліковано 26 гру 2024

КОМЕНТАРІ • 22

  • @haongngoc1215
    @haongngoc1215 Рік тому +10

    This lecture is the best one in the whole series.

  • @Ray-b2w
    @Ray-b2w 2 місяці тому

    About the attention of GAT, it can not reduce the computation cost. 19:00

  • @omarelsaka40
    @omarelsaka40 Місяць тому

    in minute 27:00, which different functions can be used to get different alpha coefficients (other than softmax)?

  • @maksimkazanskii4550
    @maksimkazanskii4550 Рік тому +4

    I assume "dividing by degree" should be a part of the aggregation instead of messaging. Because the children do not know the degree of the parent vector.

  • @mathmo
    @mathmo 2 роки тому +5

    This might be nitpicking, but I would still like to ask for conceptual clarity: Doesn't it make more sense to ascribe (at 10:20, when trying to understand the classical GCN layer as a message transformation + aggregation scheme) the normalizing factor 1/|N(v)| to the aggregation phase since it depends on the node towards which the messages are being passed and not just the message itself? In other words, pull it out of the sum using the distributive law? If one does that the message transformation step will only depend on the message and not on the node it is being delivered to. That seems conceptually cleaner to me for something to be thought of as "message transformation".

    • @mathmo
      @mathmo 2 роки тому +2

      Later on when graph attention networks are described, the normalizing factor is interpreted as an attention weight and so is clearly part of the aggregation scheme (not message transformation scheme).

  • @ir0nt0ad
    @ir0nt0ad 19 днів тому

    The explanation of multi-head attention is confusing, and I believe the notation is incorrect, had to look into the GAT paper. Each of the K heads gets its own learnable Wk matrix (not the same Wl as in the equations), this is how they end up computing different functions; the underlying math stays the same in each case.

  • @abhishekkumarjha2467
    @abhishekkumarjha2467 2 роки тому

    When he said we have the value of (1:30) the self node from previous later, is he referring to the value of V as in last itteration because here layer is used for actually the layer before it where v doesn't actually exist. Its drilling in to the words too much but just wanna know for absolute certainty.
    Or it is the self loop that is being used as a previous layer here

  • @mathmo
    @mathmo 2 роки тому

    Two questions that I had after watching this lecture:
    Is there a qeuery key value interpretation of the simple "linear attention" mechanism in graph attention networks?
    At 22:07 when describing how to normalize the unnormalized attention weights a_{vu} in the graph attention network scheme why do we not just divide by the total sum of all the attention weights? Why do we use a softmax to suppresses the non-maximal weights to mostly focus in on a single neighbor only?

    • @jaydeeppawar2336
      @jaydeeppawar2336 2 роки тому

      Yes, softmax kind of supresses lower values more

    • @BorisVasilevskiy
      @BorisVasilevskiy Рік тому

      Coefficients e_AB can be negative in the way they are defined because the last step is a matrix multiplication.
      Perhaps, one can use ReLU and then what you've described. It might be an interesting thing to try in GraphGym.

  • @maksimkazanskii4550
    @maksimkazanskii4550 Рік тому

    From the lecture it is not clear how Batch Norm works for the mini-batch.

  • @plabanb
    @plabanb 2 роки тому +2

    In a sense, GCN also considers previous layer embedding (via self loop) of the present node while doing aggregation. Isn't it?

    • @jaydeeppawar2336
      @jaydeeppawar2336 2 роки тому

      Yes, He mentioned this in his slides

    • @yb801
      @yb801 2 роки тому +1

      Yes, pytorch_geomtric library adds self loop to the graph nodes by default.

    • @shoummoahsan3483
      @shoummoahsan3483 Рік тому

      @@yb801 That would also mean the normalization is done by dividing with |N(v)| + 1, right?

  • @maksimkazanskii4550
    @maksimkazanskii4550 Рік тому

    Why the multiple attention scores would have different values and not one value? Probably to several local minima.... If so, I assume attention scores are not very efficient since they converge to local minima. If we have single attention which converges to quasi-global minimum it would be more efficient?

    • @ducanidaho
      @ducanidaho Рік тому

      Random initialization leads to different learned weights; they start in different attractive basins.

  • @prachijadhav9098
    @prachijadhav9098 Рік тому

    so the box interprets as a neural network which means "applying transformation followed by a non-linearity to create a next level message" (mentioned in a previous lecture). How is it then aggregation function? Am I missing anything, would be helpful if anyone clarifies

    • @ujjawalpanchal
      @ujjawalpanchal Рік тому

      The transformation applied before the non-linearity can include the aggregation function.

  • @lujiahuang616
    @lujiahuang616 Рік тому

    ua-cam.com/video/247Mkqj_wRM/v-deo.html how could it be both localized AND have Inductive capability?