I assume "dividing by degree" should be a part of the aggregation instead of messaging. Because the children do not know the degree of the parent vector.
This might be nitpicking, but I would still like to ask for conceptual clarity: Doesn't it make more sense to ascribe (at 10:20, when trying to understand the classical GCN layer as a message transformation + aggregation scheme) the normalizing factor 1/|N(v)| to the aggregation phase since it depends on the node towards which the messages are being passed and not just the message itself? In other words, pull it out of the sum using the distributive law? If one does that the message transformation step will only depend on the message and not on the node it is being delivered to. That seems conceptually cleaner to me for something to be thought of as "message transformation".
Later on when graph attention networks are described, the normalizing factor is interpreted as an attention weight and so is clearly part of the aggregation scheme (not message transformation scheme).
The explanation of multi-head attention is confusing, and I believe the notation is incorrect, had to look into the GAT paper. Each of the K heads gets its own learnable Wk matrix (not the same Wl as in the equations), this is how they end up computing different functions; the underlying math stays the same in each case.
When he said we have the value of (1:30) the self node from previous later, is he referring to the value of V as in last itteration because here layer is used for actually the layer before it where v doesn't actually exist. Its drilling in to the words too much but just wanna know for absolute certainty. Or it is the self loop that is being used as a previous layer here
Two questions that I had after watching this lecture: Is there a qeuery key value interpretation of the simple "linear attention" mechanism in graph attention networks? At 22:07 when describing how to normalize the unnormalized attention weights a_{vu} in the graph attention network scheme why do we not just divide by the total sum of all the attention weights? Why do we use a softmax to suppresses the non-maximal weights to mostly focus in on a single neighbor only?
Coefficients e_AB can be negative in the way they are defined because the last step is a matrix multiplication. Perhaps, one can use ReLU and then what you've described. It might be an interesting thing to try in GraphGym.
Why the multiple attention scores would have different values and not one value? Probably to several local minima.... If so, I assume attention scores are not very efficient since they converge to local minima. If we have single attention which converges to quasi-global minimum it would be more efficient?
so the box interprets as a neural network which means "applying transformation followed by a non-linearity to create a next level message" (mentioned in a previous lecture). How is it then aggregation function? Am I missing anything, would be helpful if anyone clarifies
This lecture is the best one in the whole series.
About the attention of GAT, it can not reduce the computation cost. 19:00
in minute 27:00, which different functions can be used to get different alpha coefficients (other than softmax)?
I assume "dividing by degree" should be a part of the aggregation instead of messaging. Because the children do not know the degree of the parent vector.
This might be nitpicking, but I would still like to ask for conceptual clarity: Doesn't it make more sense to ascribe (at 10:20, when trying to understand the classical GCN layer as a message transformation + aggregation scheme) the normalizing factor 1/|N(v)| to the aggregation phase since it depends on the node towards which the messages are being passed and not just the message itself? In other words, pull it out of the sum using the distributive law? If one does that the message transformation step will only depend on the message and not on the node it is being delivered to. That seems conceptually cleaner to me for something to be thought of as "message transformation".
Later on when graph attention networks are described, the normalizing factor is interpreted as an attention weight and so is clearly part of the aggregation scheme (not message transformation scheme).
The explanation of multi-head attention is confusing, and I believe the notation is incorrect, had to look into the GAT paper. Each of the K heads gets its own learnable Wk matrix (not the same Wl as in the equations), this is how they end up computing different functions; the underlying math stays the same in each case.
When he said we have the value of (1:30) the self node from previous later, is he referring to the value of V as in last itteration because here layer is used for actually the layer before it where v doesn't actually exist. Its drilling in to the words too much but just wanna know for absolute certainty.
Or it is the self loop that is being used as a previous layer here
I think he referring to feature vector of V.
Two questions that I had after watching this lecture:
Is there a qeuery key value interpretation of the simple "linear attention" mechanism in graph attention networks?
At 22:07 when describing how to normalize the unnormalized attention weights a_{vu} in the graph attention network scheme why do we not just divide by the total sum of all the attention weights? Why do we use a softmax to suppresses the non-maximal weights to mostly focus in on a single neighbor only?
Yes, softmax kind of supresses lower values more
Coefficients e_AB can be negative in the way they are defined because the last step is a matrix multiplication.
Perhaps, one can use ReLU and then what you've described. It might be an interesting thing to try in GraphGym.
From the lecture it is not clear how Batch Norm works for the mini-batch.
In a sense, GCN also considers previous layer embedding (via self loop) of the present node while doing aggregation. Isn't it?
Yes, He mentioned this in his slides
Yes, pytorch_geomtric library adds self loop to the graph nodes by default.
@@yb801 That would also mean the normalization is done by dividing with |N(v)| + 1, right?
Why the multiple attention scores would have different values and not one value? Probably to several local minima.... If so, I assume attention scores are not very efficient since they converge to local minima. If we have single attention which converges to quasi-global minimum it would be more efficient?
Random initialization leads to different learned weights; they start in different attractive basins.
so the box interprets as a neural network which means "applying transformation followed by a non-linearity to create a next level message" (mentioned in a previous lecture). How is it then aggregation function? Am I missing anything, would be helpful if anyone clarifies
The transformation applied before the non-linearity can include the aggregation function.
ua-cam.com/video/247Mkqj_wRM/v-deo.html how could it be both localized AND have Inductive capability?