Supervised Contrastive Learning

Поділитися
Вставка
  • Опубліковано 23 кві 2020
  • The cross-entropy loss has been the default in deep learning for the last few years for supervised learning. This paper proposes a new loss, the supervised contrastive loss, and uses it to pre-train the network in a supervised fashion. The resulting model, when fine-tuned to ImageNet, achieves new state-of-the-art.
    arxiv.org/abs/2004.11362
    Abstract:
    Cross entropy is the most widely used loss function for supervised training of image classification models. In this paper, we propose a novel training methodology that consistently outperforms cross entropy on supervised learning tasks across different architectures and data augmentations. We modify the batch contrastive loss, which has recently been shown to be very effective at learning powerful representations in the self-supervised setting. We are thus able to leverage label information more effectively than cross entropy. Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes. In addition to this, we leverage key ingredients such as large batch sizes and normalized embeddings, which have been shown to benefit self-supervised learning. On both ResNet-50 and ResNet-200, we outperform cross entropy by over 1%, setting a new state of the art number of 78.8% among methods that use AutoAugment data augmentation. The loss also shows clear benefits for robustness to natural corruptions on standard benchmarks on both calibration and accuracy. Compared to cross entropy, our supervised contrastive loss is more stable to hyperparameter settings such as optimizers or data augmentations.
    Authors: Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
  • Наука та технологія

КОМЕНТАРІ • 73

  • @aliawad2244
    @aliawad2244 Рік тому +1

    What an elegant explanation. Huge thanks!

  • @johnkrafnik5414
    @johnkrafnik5414 3 роки тому

    Great stuff. Im impressed how many videos you have put up.

  • @davidcato6192
    @davidcato6192 3 роки тому +1

    Excellent explanation, thank you!

  • @ghostlv4030
    @ghostlv4030 3 роки тому +6

    The paper delivers the main idea clearly and effectively, it is that they are rich ! ! !

  • @mhadnanali
    @mhadnanali Рік тому

    Thank you very much. I was stuck on some problem in my contrastive learning paper implementation. your explanation helped me understand better.

  • @aday7475
    @aday7475 Рік тому

    Thank you for the clear, great explanation!

  • @kalastasurepe
    @kalastasurepe 2 роки тому +1

    Thanks a lot! I liked it so much! You explained it in a very simple way even though all these are very complex.

  • @theBatchNorm
    @theBatchNorm 3 роки тому

    Thank you for the enjoyable explanation

  • @philippeisen1910
    @philippeisen1910 4 роки тому +12

    Nice level of detail for going over papers - really appreciate your work!
    Im curious, what is your setup to create those nice visualizations?

  • @msfasha
    @msfasha Рік тому

    Elegant and appreciated, thanx for the effort

  • @igorl01
    @igorl01 Рік тому

    Amazing explanations!

  • @yataoabian7094
    @yataoabian7094 3 роки тому +1

    nice talk, Yannic!!

  • @lucazzo1990
    @lucazzo1990 2 роки тому

    Thank you, very helpful!

  • @Shujaat-Khan
    @Shujaat-Khan 2 роки тому

    Nice explanation 👌

  • @reginaldanderson7218
    @reginaldanderson7218 3 роки тому +1

    Cool content

  • @songmeishu5445
    @songmeishu5445 3 роки тому

    super great video!!!

  • @wyalexlee8578
    @wyalexlee8578 3 роки тому

    Thank you for explaining this!

  • @florianhonicke5448
    @florianhonicke5448 3 роки тому +1

    very informative

  • @GradientDude
    @GradientDude 4 роки тому +2

    Hey! Thanks for the review. Which software do you use to annotate and draw on pdfs ?

  • @JapiSandhu
    @JapiSandhu Рік тому

    Good explanation thanks

  • @ruitao2099
    @ruitao2099 3 роки тому +3

    Nice talk! But I still confused about the motivation of supervised contrastive learning. What were the differences between it with normal supervised learning. We could get the embedded space by training a deep supervised model and take the feature layers out and put them into different work. Thanks for your replying!

    • @shaikrasool1316
      @shaikrasool1316 2 роки тому

      Contrastive supervised learning is used to compare two images, example:- siamese network

  • @user-st3dx8pd1o
    @user-st3dx8pd1o 24 дні тому

    Thank you soo much!

  • @amirpourmand
    @amirpourmand Рік тому

    Thanks! I just wanted to ask if you could make more videos that you actually code in them. I learned a lot from them.

  • @srivatsabhargavajagarlapud2274
    @srivatsabhargavajagarlapud2274 3 роки тому +6

    It would have been great to see if this (pre-training) method could achieve(as a by-product) representations that honor semantic similarity based inter-class representation distance amongst classes. By this I mean, for example, cats are more similar in a semantic sense to dogs, than are cars/trucks to dogs so, after pre-training here, though you haven't explicitly sought for this in your loss(both in this supervised-contrastive other losses such as triplet losses more commonly used in siamese nets), do you by any chance see d(cat,dog)

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      there is a hierarchy in imagenet, so this would actually be feasible (and I'm sure people have done this) :D

  • @bradhatch8302
    @bradhatch8302 3 роки тому +1

    Listening at 1.75 speed it’s like I read and understood this paper in about 18 mins. Mucho thanks!

    • @louislouis7388
      @louislouis7388 3 роки тому +1

      The paper tried to make it complicated. Not interesting direction, it is not advantage to self-supervised learning at all. Just wasting my time to read that paper.

    • @Metalwrath2
      @Metalwrath2 3 роки тому +1

      @@louislouis7388 Lots of papers do that. One of the reasons why I don't like academy

  • @waterflarz
    @waterflarz 4 роки тому +2

    Great paper review! What software do you use for pdf annotation and recording?

  • @herp_derpingson
    @herp_derpingson 4 роки тому +19

    This doesnt sound very novel to me. I swear I saw something similar in an introductory ML course.
    Regardless, I wonder how much of that 1% is from this algorithm and how much is from raw GPU power.

    • @YannicKilcher
      @YannicKilcher  4 роки тому

      Yes, I agree. This will have to be replicated before I believe it.

    • @kaushikroy4041
      @kaushikroy4041 4 роки тому +5

      Herp Derpingson This sounds like supervised metric learning to me. Then take last but one layer. Done before to my mind.

    • @markdaoust4598
      @markdaoust4598 3 роки тому +2

      Yes. Isn’t this “center loss”: ydwen.github.io/papers/WenECCV16.pdf

    • @delikatus
      @delikatus 3 роки тому +1

      ​@@markdaoust4598 As said at 26:08, isn't it also pretty much the same thing as "siamese networks" / "triplet loss"? arxiv.org/pdf/1503.03832.pdf
      Also see: yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf and probably there's some Schmidhuber stuff that's exactly the same, too? :D
      Also relevant:
      arxiv.org/abs/1907.13625
      and
      arxiv.org/pdf/2003.08505.pdf

  • @shreejaltrivedi9731
    @shreejaltrivedi9731 3 роки тому +1

    Great video Yannic.
    I was curious about one thing. Here in Contrastive Pretraining whether it is supervised/unsupervised, they do the different augmentations and then do the pretraining. What if we do the same augmentations for every image in my labeled dataset that Unsupervised Contrastaive Pretraining uses and train the network on this new augmented dataset in the simple supervised fashion accompanying the cross-entropy loss? . At the end of the day supervision and mass of the data matters in DL is the best path to achieve commendable results.
    What are your thoughts?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      I don't know, but it's a good idea, maybe worth a try

  • @soufianekun11
    @soufianekun11 4 роки тому +11

    I wonder why they didn't use the triplet loss of a siamese network ??!

    • @grb321
      @grb321 Рік тому

      The claim in the paper is that supervised contrastive loss is a lot more robust than triplet loss, which usually requires some form of negative example mining to work well. The authors also claim that supervised contrastive loss makes hyperparameter tuning easier, as classification performance is less sensitive on hyperparameter settings.

  • @user-nl4qz3ej1y
    @user-nl4qz3ej1y Рік тому +1

    In supervised contrastive loss, the augmented view of images seem not necessary.
    But without the two-crop-transform augmentation, the accuracy of CIFAR-10, CIFAR-100, tinyImageNet will drop down 3% ~ 5% depend on the tasks.

  • @patrickjdarrow
    @patrickjdarrow 4 роки тому +2

    Couldn't you use a standard training epoch as a proxy for mining hard negatives? Before each next epoch, take the top n lossy samples to use for contrastive learning.

    • @iloos7457
      @iloos7457 2 роки тому

      probably, a good extention to the triplet loss.. but perhaps unnessecary for supcon. I feel like Supcon tries to solve the hard-negative with contrastive learning

  • @underlecht
    @underlecht 2 роки тому

    The paper in arxiv has changed a bit comparing to the one in video. Why is that? Also, the dominator which is supposed to be minimized, includes not only "other samples but positive", but includes also all positive augmentations except the one we are counting for

  • @eduarddurech5188
    @eduarddurech5188 3 роки тому +2

    Yannic what did you mean by, "Supervised learning is the only thing right now in deep learning that works"? ;) Thank you for the videos btw!

  • @robertchamoun7914
    @robertchamoun7914 2 роки тому +1

    Great explanation thank you!!
    Can Someone please explain to me what would be the benefit of contrastive pre-training compared to Autoencoder Pretraining for CNN ?

    • @abedog90210
      @abedog90210 2 роки тому +2

      I think maybe because here (with contrastive loss) you're explicitly training your model to cluster the same images together,
      whereas in autoencoder pretraining you're training the encoder to extract useful features for reconstruction of the same image, hoping that images from the same class will have similar features in that latent space, but you're not explicitly telling it to do so.

    • @robertchamoun7914
      @robertchamoun7914 2 роки тому

      thanks for the explanation.

  • @mattiasfagerlund
    @mattiasfagerlund 2 роки тому

    You can still use unlabeled data for the negative samples, because the odds of them being in the same class is miniscule?

  • @loveislulu264
    @loveislulu264 2 місяці тому

    can you provide a vid for implementation of supervised contrastive learning

  • @abirnaskar3458
    @abirnaskar3458 4 місяці тому

    Nice, I was wondering how it will work in text, I mean if I replace transformers with this.
    Is there any paper which use transformer based model along with contrastive learning?

  • @johnakbar1
    @johnakbar1 3 роки тому +1

    hey.. Does contrastive loss on self-supervised learning require the presence of minimal positive samples in the denominator of loss function? would this make it harder to deploy this in live unlabelled data or random samples?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      The numerator is always included in the denominator, so you have some positive samples by construction

  • @DistortedV12
    @DistortedV12 4 роки тому +3

    Yannic are you going to ICLR 2020?

    • @YannicKilcher
      @YannicKilcher  4 роки тому +4

      If you mean whether I'll be sitting on my couch and on the internet, then yes :D I'll probably follow the interesting bits, panels and such

  • @LouisChiaki
    @LouisChiaki 3 роки тому +1

    I was about to try this on the Kaggle competition until I saw their batch size...

  • @jinusbordbar1264
    @jinusbordbar1264 3 роки тому

    TnX

  • @thongnguyen1292
    @thongnguyen1292 3 роки тому +2

    1:30 "Supervised learning is the only thing right now in deep learning that works"
    Woaah who is making the big claim here :D

  • @XecutionStyle
    @XecutionStyle 3 роки тому +1

    But that's the point right. ImageNet percentages had saturated regardless of hardware. This answers can we be more efficient just as much as can we incorporate more compute.

  • @jonatan01i
    @jonatan01i 4 роки тому

    So, pre-train on a HUGE image dataset with self-supervised contrastive learning and then start with this network to pre-train on your dataset with supervised contrastive and then can come softmax.

  • @amirafsharmoshtaghpour8895
    @amirafsharmoshtaghpour8895 3 роки тому +2

    Another excellent paper explanation. Around 23:00, I wonder why a hard positive amounts to = 0, not = -1.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      -1 would be as much aligned as +1

    • @GradientDude
      @GradientDude 3 роки тому

      @@YannicKilcher I also noticed that. I don't agree. The sign does matter here. And, if = -1 , then the loss will be pretty high for such a pair, because exp (z_i • z_p /τ ) = exp(-1) which is much much smaller than exp(1), and in this case denominator on Eq.4 will prevail.
      Actually all the derivations in the supplementary break apart (maybe there is a mistake somewhere) if you consider hard positive = -1 and hard negative = 1.
      I'm very surprised that nobody noticed such a flaw in the paper.

    • @vzoryan1769
      @vzoryan1769 2 роки тому

      @@GradientDude they kinda leave this out, but in a high-dimensional space the probability of two random vectors being orthogonal is close to 1. Therefore, it's improbable that a positive example will face the opposite direction and you don't need to account for that. You can do a little numerical simulation and see for yourself.

  • @yosefricardochmulek2822
    @yosefricardochmulek2822 3 роки тому +1

    You gotta be a time traveler.

  • @Manu-em6ed
    @Manu-em6ed 3 роки тому +1

    isn't that just normal supervised learning with extra steps ? :-P