I have a question regarding the Barlow twins. Q1: For a batch of B samples, the output of the projector networks will be BxD. We have two such projections A and B. We know that rank(AxB)
every 2 minutes he referes work of others and his as a reference. I was shocked and overwhelmed by the number of papers refereed in this just 1 lecture. haha.
People really need to stop using linear classifiers to gauge the “correctness” of representations learned at different layers!! Use something like a Silhouette score, or anything that measures *local* consistency of the representation (could also use a k-fold Delaunay interpolant approximation if you’re attached to things being locally linear). Neural networks (ReLU) are capturing linearly seperable subsets of data at each layer, which means even the layer two before the output could have a highly nonlinear representation of data that is easily transformed with the right set of selections. You won’t succumb to this problem if you just measure local continuity of a representation with respect to your target output instead of using a global linear approximation.
Thanks for the comment! I agree that evaluating representations with linear classifiers is not sufficient. Like you suggest, there are many different ways to evaluate them, and each of them tests different aspects of representations. Depending on the comparison/final application, the methodology for evaluating them will change.
@@khushpatelmd great questions. A simplified example: consider a binary classification problem where the model outputs a single number (the truth is either 0 or 1). Suppose we want to evaluate the amount of information captured by an embedding relative to this downstream prediction task. Option 1: We could measure the mean squared error of a best-fit linear function over the embedded data. In effect, this measures how "linearly separable" our embedded data is for this classification problem. Option 2: You compute the average distance to the nearest point of a different class (for all points) minus the average distance to the nearest point of the same class. (similar to the concept behind Silhouette scores, which answer the question "how near is each point to its own cluster relative to other clusters?") Now imagine that the embedding has data placed perfectly in a separated "three stripe" pattern, where the left stripe is all 0's, the middle stripe is all 1's, and the right stripe is all 0's. The pure linear evaluation (option 1) will tell us that the embedding yields about ~66% accuracy (not so good). However, a nearness approach (option 2) would tell us that the embedding is very good and yields all nearest points in the same class (distance to other class - distance to same class >> 0). Realistically option 2 is correct here, because there is a very simple 2-hidden-node MLP that can *perfectly* capture the binary classification problem given this embedding. I realize that some people might say, "well option 2 is irrelevant if you always know you're going to use a linear last layer." But that's against the point. In general we are trying to evaluate how representative the newly learned geometry is for downstream tasks. Restricting ourselves to only linearly-good geometries for evaluation is unnecessary and can be misleading. In the end most people care how difficult it would be to take an embedding and train a new accurate model given the embedding. I assume few people will arbitrarily restrict themselves to linear models in practice.
I'm in touch with UA-cam support team. They have identified the issue and are currently working on it. I'll let you know when there is any update. Thank you for your patience. 😇😇😇
@@alfcnz THANK YOU VERY MUCH!! I really appreciate the length that you're going through just to make sure the auto caption is on 😭 Once again, thank you very much!
They replied and… I'm losing my patience. UA-cam support is not cooperating. I'm escalating this soon. I'm not sure what part of “feed the audio stream to your text-to-speech model” is hard to comprehend.
I was wondering about sth. In contrastive learning, if one uses a self-attention transformer encoder within the batch dimension, before feeding the representations to the contrastive loss, will it ruin the objective of contrastive learning? I am saying this since the transformer encoder over the batch will basically reweight the representation of each sample with respect to the dot-product similarity between each other. Thank you for the wonderful introduction btw.
So I was watching the "Scaling machine learning on graphs" @Scale talk the other day, for which they used the contrastive method w/ massive parallelism and negative sampling to prevent the trivial solution collapse: fb.watch/v/1pqXNP5au/ After this lecture now I wonder if we can use the other options (clustering, distillation, and redundancy reduction) in the arsenal instead. Has anyone at Facebook tried those for graph embedding training yet?
Yup, we can indeed use the other techniques, where the positive pairs are defined by the adjacency matrix (connectivity defined by the graph). For the question about whether FB has tried these, I'll let Adam reply. (Let me ping him.)
Hi Chuan-Chih, thanks for watching my talk! I don't know of anyone at Facebook who has applied these unsupervised methods to the problem of learning node features for graphs. The graph embeddings problem is a little different than learning unsupervised image features so I don't immediately see how these methods would apply, but I wouldn't be surprised if there was a way! In the type of unsupervised learning described in this talk, you are learning a function f that converts a high-dimensional feature vector x_i into low-dimensional semantic feature z_i. In the graph embedding setting, the nodes don't have input features - you *learn the input features* in order to approximate the adjacency matrix. There are probably ways to apply these methods if you think of the one-hot edge list (aka each node's row of the adjacency matrix) as the features, but I haven't thought about it. Maybe a better place to start is the graph neural network setting where nodes *do* have input features and you're learning a function f that combines the features over the graph neighborhood to predict some supervised labels. I haven't seen any work on unsupervised graph neural networks, but there probably is some and some of these same approaches may work well!
One of the best lectures on SSL ever. Thank you, Alfredo and Ishan for making this available for everyone.
🥳🥳🥳
I am sure students from all over the world thanks a Lot.
🤗🤗🤗
awesome lecture covering all different method of unsupervised learning! Thank you for making these video public.
💪🏻💪🏻💪🏻
I have a question regarding the Barlow twins. Q1: For a batch of B samples, the output of the projector networks will be BxD. We have two such projections A and B. We know that rank(AxB)
Thanks for this! Can't wait to see how the best of all worlds can be combined for SSL!
🔥🔥🔥
Thanks, Ishan. This is excellent.
🥳🥳🥳
Excellent presentation. Thanks
A really informative lecture on self-supervised learning.
Thanks for this. Really terrific content.
🥳🥳🥳
This is excellent, thank you!
You're very welcome! 😇😇😇
This is actually my research is focusing on, hopefully I can finish it on time to apply for phD at NYU and join with you Alfredo.
😍😍😍
Very informative as usual. Thank you @Alfredo
🤓🤓🤓
every 2 minutes he referes work of others and his as a reference. I was shocked and overwhelmed by the number of papers refereed in this just 1 lecture. haha.
😅😅😅
@@alfcnz I enjoyed this session, very good content. thanks for organizing it.
Beautiful lecture! Thanks :)
Prego 😇😇😇
Thank you so much sir
You're welcome 🤗🤗🤗
Basically an informative video :-)
🤓🤓🤓
basically agree :)
🤓🤓🤓
People really need to stop using linear classifiers to gauge the “correctness” of representations learned at different layers!!
Use something like a Silhouette score, or anything that measures *local* consistency of the representation (could also use a k-fold Delaunay interpolant approximation if you’re attached to things being locally linear).
Neural networks (ReLU) are capturing linearly seperable subsets of data at each layer, which means even the layer two before the output could have a highly nonlinear representation of data that is easily transformed with the right set of selections. You won’t succumb to this problem if you just measure local continuity of a representation with respect to your target output instead of using a global linear approximation.
Thanks for the comment! I agree that evaluating representations with linear classifiers is not sufficient. Like you suggest, there are many different ways to evaluate them, and each of them tests different aspects of representations. Depending on the comparison/final application, the methodology for evaluating them will change.
How to implement it? Do you have any use case? I understand the rationale but don't understand how to use something like Silhouette score over here.
@@khushpatelmd great questions. A simplified example: consider a binary classification problem where the model outputs a single number (the truth is either 0 or 1). Suppose we want to evaluate the amount of information captured by an embedding relative to this downstream prediction task.
Option 1: We could measure the mean squared error of a best-fit linear function over the embedded data. In effect, this measures how "linearly separable" our embedded data is for this classification problem.
Option 2: You compute the average distance to the nearest point of a different class (for all points) minus the average distance to the nearest point of the same class. (similar to the concept behind Silhouette scores, which answer the question "how near is each point to its own cluster relative to other clusters?")
Now imagine that the embedding has data placed perfectly in a separated "three stripe" pattern, where the left stripe is all 0's, the middle stripe is all 1's, and the right stripe is all 0's. The pure linear evaluation (option 1) will tell us that the embedding yields about ~66% accuracy (not so good). However, a nearness approach (option 2) would tell us that the embedding is very good and yields all nearest points in the same class (distance to other class - distance to same class >> 0). Realistically option 2 is correct here, because there is a very simple 2-hidden-node MLP that can *perfectly* capture the binary classification problem given this embedding.
I realize that some people might say, "well option 2 is irrelevant if you always know you're going to use a linear last layer." But that's against the point. In general we are trying to evaluate how representative the newly learned geometry is for downstream tasks. Restricting ourselves to only linearly-good geometries for evaluation is unnecessary and can be misleading. In the end most people care how difficult it would be to take an embedding and train a new accurate model given the embedding. I assume few people will arbitrarily restrict themselves to linear models in practice.
@@tchlux Thanks a lot Thomas. This is so clearly explained by you.
Great Video
😎😎😎
Hi, are you planning on to add subtitles or enable the automatic caption?
Automatic captions should be enabled by default. I'll check this later if and why this is not working. Thanks for the feedback. 🙏🏻🙏🏻🙏🏻
I'm in touch with UA-cam support team. They have identified the issue and are currently working on it. I'll let you know when there is any update.
Thank you for your patience. 😇😇😇
@@alfcnz THANK YOU VERY MUCH!! I really appreciate the length that you're going through just to make sure the auto caption is on 😭 Once again, thank you very much!
😇😇😇
They replied and… I'm losing my patience. UA-cam support is not cooperating. I'm escalating this soon.
I'm not sure what part of “feed the audio stream to your text-to-speech model” is hard to comprehend.
Thank you, Alfredo :) It will be very difficult for me to read all the materials indicated in the video for a week. )
I'm aiming now at two videos per week.
Haha, sorry 😅😅😅
So it will be very, very, very difficult for me to read everything, but I will try. Thank you for videos, Alfredo :)
Wait for our CVPR paper that will solve the memory problem. We hope it will be accepted.
🤞🏻🤞🏻🤞🏻
I was wondering about sth. In contrastive learning, if one uses a self-attention transformer encoder within the batch dimension, before feeding the representations to the contrastive loss, will it ruin the objective of contrastive learning? I am saying this since the transformer encoder over the batch will basically reweight the representation of each sample with respect to the dot-product similarity between each other.
Thank you for the wonderful introduction btw.
Why would you want to use a transformer “within the batch dimension” (whatever this means)?
Can you clarify what you're trying to do? 🤔🤔🤔
@@alfcnz I sent an email to you. tnx
I don't have the bandwidth to reply to emails, I'm sorry. I haven't checked them in a few months by now, I think.
So I was watching the "Scaling machine learning on graphs" @Scale talk the other day, for which they used the contrastive method w/ massive parallelism and negative sampling to prevent the trivial solution collapse:
fb.watch/v/1pqXNP5au/
After this lecture now I wonder if we can use the other options (clustering, distillation, and redundancy reduction) in the arsenal instead. Has anyone at Facebook tried those for graph embedding training yet?
Yup, we can indeed use the other techniques, where the positive pairs are defined by the adjacency matrix (connectivity defined by the graph). For the question about whether FB has tried these, I'll let Adam reply. (Let me ping him.)
Hi Chuan-Chih, thanks for watching my talk! I don't know of anyone at Facebook who has applied these unsupervised methods to the problem of learning node features for graphs. The graph embeddings problem is a little different than learning unsupervised image features so I don't immediately see how these methods would apply, but I wouldn't be surprised if there was a way!
In the type of unsupervised learning described in this talk, you are learning a function f that converts a high-dimensional feature vector x_i into low-dimensional semantic feature z_i. In the graph embedding setting, the nodes don't have input features - you *learn the input features* in order to approximate the adjacency matrix. There are probably ways to apply these methods if you think of the one-hot edge list (aka each node's row of the adjacency matrix) as the features, but I haven't thought about it.
Maybe a better place to start is the graph neural network setting where nodes *do* have input features and you're learning a function f that combines the features over the graph neighborhood to predict some supervised labels. I haven't seen any work on unsupervised graph neural networks, but there probably is some and some of these same approaches may work well!
Thanks Alf :)
You're welcome 😺😺😺
awesom
😻😻😻
that bear, he knows everything.
Indeed he does. He's been present to all my lessons! 🐻🐻🐻