BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)

Поділитися
Вставка
  • Опубліковано 30 кві 2024
  • Self-supervised representation learning relies on negative samples to keep the encoder from collapsing to trivial solutions. However, this paper shows that negative samples, which are a nuisance to implement, are not necessary for learning good representation, and their algorithm BYOL is able to outperform other baselines using just positive samples.
    OUTLINE:
    0:00 - Intro & Overview
    1:10 - Image Representation Learning
    3:55 - Self-Supervised Learning
    5:35 - Negative Samples
    10:50 - BYOL
    23:20 - Experiments
    30:10 - Conclusion & Broader Impact
    Paper: arxiv.org/abs/2006.07733
    Abstract:
    We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network. While state-of-the art methods intrinsically rely on negative pairs, BYOL achieves a new state of the art without them. BYOL reaches 74.3% top-1 classification accuracy on ImageNet using the standard linear evaluation protocol with a ResNet-50 architecture and 79.6% with a larger ResNet. We show that BYOL performs on par or better than the current state of the art on both transfer and semi-supervised benchmarks.
    Authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
  • Наука та технологія

КОМЕНТАРІ • 131

  • @tanmay8639
    @tanmay8639 3 роки тому +90

    This guy is epic.. never stop making videos bro

    • @Denetony
      @Denetony 3 роки тому +4

      Don't say that "never stop" ! He has been posting every single day! He deserves a holiday or at least a day break haha

  • @amitchaudhary9869
    @amitchaudhary9869 3 роки тому +54

    I have this intuition on why they are adding an MLP after ResNet50. Let's say a student has an exam tomorrow on a subject. His/her study a few days before would mostly be focused on the questions/topics that might come the next day instead of getting a holistic understanding of the subject itself.
    Similarly, if we use ResNet-50 representations directly on the task, it would be updated in a way to get better at the task of contrastive loss instead of learning to generate useful generic representations. The SimCLR paper also sheds light on this through their ablation studies. In learning to separate two things, it could discard useful properties(for downstream tasks) in the representation that is not helping it separate things.
    Adding an MLP can help project it into space where the MLP representations can be specialized for contrastive tasks, while the ResNet-50 can produce generic representations. In analogy to NLP, this reminds me of the query/key/value division in the transformer model.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +3

      Yes that makes sense

    • @tengotooborn
      @tengotooborn 3 роки тому +1

      Why would adding an MLP help that?

    • @3663johnny
      @3663johnny Рік тому

      I actually think this is to reduce the storage for image search engines🧐🧐

  • @BradNeuberg
    @BradNeuberg 3 роки тому +16

    Just a small note how much I appreciate these in-depth paper reviews you've been doing! Great work, and its a solid contribution to the broader community.

  • @1Kapachow1
    @1Kapachow1 3 роки тому +11

    There is actual motivation behind the projection step. The idea is that in our self supervised task, we try to reach a representation which doesn't change with augmentations of the image. However, we don't know the actual task, and we might remove TOO MUCH information from the representation. Therefore, adding a final "projection head", and taking the representation before that projection increases the chance of having "a bit more" useful information for other tasks as well. It is mentioned/describe a bit in SimCLR.

    • @sushilkhadka8069
      @sushilkhadka8069 Місяць тому

      so the vector representation before the projection layer would contain fine-grained information about the input and after projection vector would contain more semantic meaning?

  • @PhucLe-qs7nx
    @PhucLe-qs7nx 3 роки тому +23

    This can be seen from the Yann Lecun talk as a method in "Regularized Latent Variable Energy-based Model", instead of the "Contrastive Energy-based Model". In the Contrastive setting, you need positive and negative samples to know where to push up and down your energy function. This method doesn't need negative samples because it's implicitly constrained to have the same amount of energy as the "target network", only gradually shifted around by the positive samples.
    So I agree with Yannic, this must be in some kind of balance in the intialization for it to work.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +8

      Very good observation

    • @RaviAnnaswamy
      @RaviAnnaswamy 3 роки тому +3

      This is a deeper principle. It is much easier to learn from what is working than from what is NOT correct. but slower.

  • @AdamRaudonis
    @AdamRaudonis 3 роки тому +4

    These are my favorite videos! Love your humor as well. Keep it up!

  • @siyn007
    @siyn007 3 роки тому +22

    The broader impact rant is hilarious

    • @veedrac
      @veedrac 3 роки тому +1

      What's wrong with it though? The paper has no clear broader ethical impacts beyond those generally applicable to the field, so it makes sense that they've not said anything concrete to their work.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +8

      Nothing is wrong. It's just that it's pointless, which was my entire point about broader impact statements.

    • @veedrac
      @veedrac 3 роки тому

      ​@@YannicKilcher If we only had papers like these, sure, but not every paper is neutral-neutral alignment.

    • @theodorosgalanos9663
      @theodorosgalanos9663 3 роки тому +3

      The 'broader impact Turing test': if a paper requires a broader impact statement for users to see its impact outside of the context of the paper, then it doesn't really have one. The broader impact of your work should be immediately evident after the abstract and introduction, where the real problem you're trying to solve is described.
      Most papers in ML nowadays don't really do that, real problems are tough and they don't allow you to set a new SOTA on ImageNet.
      p.s. a practitioner in another domain I admit my bias towards real life applications in this rant; I'm not voting for practice over research I'm just voting against separating it.

    • @francois__
      @francois__ 3 роки тому

      Probably it's a field required by a journal.

  • @clara4338
    @clara4338 3 роки тому +1

    Awesome explanations, thank you for the video!!

  • @alifarrokh9863
    @alifarrokh9863 11 місяців тому

    Hi Yannic, I really appreciate what you do. Thanks for the great videos.

  • @bensas42
    @bensas42 Рік тому +1

    This video was amazing. Thank you so much!!

  • @actualBIAS
    @actualBIAS 11 місяців тому +1

    Wahnsinn. Dein Channel ist eine wahre Goldgrube. Danke für den Content!

  • @ayankashyap5379
    @ayankashyap5379 3 роки тому

    Thank you for all your videos

  • @thak456
    @thak456 3 роки тому

    love your honesty....

  • @florianhonicke5448
    @florianhonicke5448 3 роки тому +1

    Great summary!

  • @RohitKumarSingh25
    @RohitKumarSingh25 3 роки тому +10

    I think the idea behind projection layer might be to avoid curse of dimensionality for the loss function. In low dimensional space distance between points are more differentiable. So that might help in l2 norm. For no collapse no idea. 😅

  • @dorbarber1
    @dorbarber1 Рік тому +1

    hi yannic. Thanks for your great surveys on the best DL papers. I really enjoy them. Regarding your comments on the projection necessity, it is better explain in the simCLR paper(by Hinton) that was publish about the same time as this paper. In short, they showed empirical results that the projection does help. the reason behind this idea is that you would like “z” representation to be agnostic to the augmentation but you don't need “y” representation to be agnostic to the augmentation. you would want the “y” representation to be separable to the augmentation so it will be easy to "get rid of” this information in the projection stage. the difference between y and z is not only the dimensionality. z is used for the loss calculations and y representation is the representation you will eventually use in the downstream task

  • @LidoList
    @LidoList Рік тому +1

    Super simple, all credits go to Speaker. Thanks bro ))))

  • @amirpourmand
    @amirpourmand Рік тому

    Thanks again!

  • @sushilkhadka8069
    @sushilkhadka8069 Місяць тому

    Okay for those of you wondering how is it that it's avoiding model collapse (learning a constant representation for all the inputs) when there's no constractive setting in the approach.
    It's because of the BatchNormalisation Layer and ema update of the target network.
    Both BN and EMA encourages variability in the model's representation.

  • @Marcos10PT
    @Marcos10PT 3 роки тому +32

    Damn Yannick you are fast! Soon these videos will start being released before the papers 😂

  • @NewTume123
    @NewTume123 3 роки тому +1

    Regarding the projection head. The guys in the SimClr/SupCon paper showed that you get better results by using projection head (just need to remove it for future fine-tuning). That's why in the MoCoV2 they also started applying a projection head

  • @StefanausBC
    @StefanausBC 3 роки тому +1

    Hey Yannic, thank you for all this great videos. Always enjoying it. Yesterday, ImageGPT got published, I would be interested to get your feedback on that on too!

  • @Phobos11
    @Phobos11 3 роки тому +2

    Self-supervised week 👏

  • @daniekpo
    @daniekpo 3 роки тому

    When I first saw the authors, I wondered what DeepMind was doing with computer vision (this is maybe the first vision paper that I've seen from them), but after reading the paper it totally made sense coz they use the same idea as DQN and some other reinforcement learning algorithms.

  • @jwstolk
    @jwstolk 3 роки тому +10

    8 hours on 512 TPU's and still doesn't end up with a constant output. I'd say it works.

  • @JTMoustache
    @JTMoustache 3 роки тому +2

    For the projection the parameters go from O(Y x N) > O(Y x Z + Z x N).. maybe that is the reason ?
    Of course that makes sense only if YN > ZY + ZN => N ( Y - Z ) > ZY => N > ZY / (Y-Z) and Z < Y

  • @franklinyao7597
    @franklinyao7597 3 роки тому +3

    Yannic, at 18:01. You said there is no need for a projection. My answer is that maybe the intermidiate represent is better than the final representation because earlier layers learn universal representation while finall layers give you more specific represnetation for specific tasks.

  • @behzadbozorgtabar9413
    @behzadbozorgtabar9413 3 роки тому +1

    It is very similar to MoCov2, the only difference is using deeper network (prediction network, which I guess the gain comes from) and MSE loss instead of InfoNSE loss.

  • @teslaonly2136
    @teslaonly2136 3 роки тому +1

    Nice!

  • @asadek100
    @asadek100 3 роки тому +4

    Hi Yannic, Can u make a video to explain Graph Neural Network.

  • @Phobos11
    @Phobos11 3 роки тому +2

    This configuration of the two encoders sounds exactly like stochastic weight averaging, only that the online and sliding window parameters are both being used actively 🤔. From SWA, the second encoder should have a wider minima, helping it generalize better

  • @ensabinha
    @ensabinha Місяць тому

    16:20 using the projection at the end allows the encoder to learn and capture more general and robust features beneficial for various tasks, while the projection, closely tied to the specific contrastive learning task, focuses on more task-specific features.

  • @youngjin8300
    @youngjin8300 3 роки тому

    You rolling my friend rolling!

  • @cameron4814
    @cameron4814 3 роки тому +3

    this video is better than the paper. i'm going to try to reimplement this based on only the video.

  • @dcantor
    @dcantor Рік тому

    @Yannic I think the projections are important because otherwise the representations are too sparse to calculate a useful loss.

  • @lamdang1032
    @lamdang1032 3 роки тому +1

    My 2 cents on why collapse doesn’t happen. For ResNet collapse means one of the few projection layer must have weight all zeroed. This is very difficult to obtain since it is only 1 global minimum vs infinite number of local minima.also the moving average helps because in order to the weight to be completely zeroed the moving average must also mean be zeroed, which means weights are zeroed for a alot of iterations. Mode collapse may have happens partially for the experiment where they remove the moving average.

  • @MH-km9gd
    @MH-km9gd 3 роки тому +1

    So it's basically iterative mean teacher with a symmetrical loss?
    Bit the projection layer is a neat thing.
    Would be nice if they would have shown it working on a single GPU

  • @dshlai
    @dshlai 3 роки тому +1

    I just saw that leaderboard on the paperwithcode

  • @mohammadyahya78
    @mohammadyahya78 Рік тому

    Thank you very much. Can this be used with NLP instead of images?

  • @citiblocsMaster
    @citiblocsMaster 3 роки тому +15

    33:40 Yeah let me just spin up my 512 TPUv3 cluster real quick

  • @herp_derpingson
    @herp_derpingson 3 роки тому +14

    00:00 I wonder if there is a correlation between number of authors and GPU/TPU hours used.
    .
    I dont really see the novelty in this paper. They are just flexing their GPU budget on us.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +6

      No, I think it's more like how many people took 5 minutes to make a spurious comment on the design doc at the beginning of the project.

  • @anthonybell8512
    @anthonybell8512 3 роки тому

    Couldn't you predict the distance between two patches to prevent the network from degenerating to the constant representation c? This seems like it would help the model even more for learning representations than a yes/no label because it would also encode a notion of distance - i.e. I expect patches at opposite corners of the image to be less similar than patches right next to each other.

  • @Schrammletts
    @Schrammletts 3 роки тому +2

    I strongly suspect that normalization is a key part of making this work. If you had a batchnorm on the last layer, then you couldn't map everything to 0, because the output must be mean 0 variance 1. So you have to find the mean 0 variance 1 representation that's invariant to the data augmentations, which starts to sound much more like a good representation

  • @marat61
    @marat61 3 роки тому

    If you compare representation y' which is obtained using exp mean of the past parameters your new parametrs will be as close to past as possible inspite of all possible augmentations. So model indeed is forced to stay at the same spot forced to move in dirrection of just ignoring augmentations

  • @roman6575
    @roman6575 3 роки тому +1

    Around the the 17:30 mark, I think the idea behind using the projection MLP 'g' was to make the encoder 'f' learn more semantic features by using higher level features for the contrastive loss.

    • @WaldoCampos1
      @WaldoCampos1 Рік тому

      The idea of using a projection comes from SimCLR's architecture. In their paper they proved that it improves the quality of the representation.

  • @jonatan01i
    @jonatan01i 3 роки тому +1

    The target network's representation of an original image is a planet (moving slowly) that pulls the augmented versions' representations to itself. The other original images' planet representations are far away and only if there are many common things in them do they have a strong attraction towards each other.

    • @dermitdembrot3091
      @dermitdembrot3091 3 роки тому +1

      Agreed! These representations form a cloud where there is a tiny pull of gravity toward the center. This pull is however insignificant compared to the forces acting locally. It would probably take a huge lot more training for a collapse to materialize.

    • @dermitdembrot3091
      @dermitdembrot3091 3 роки тому +1

      I suspect that collapse would first happen locally so that it would be interesting to test whether this method has problems encoding fine differences between similar images.

  • @daisukeniizumi6023
    @daisukeniizumi6023 3 роки тому

    "if you construct augmentations smartly" ;)

  • @maxraphael1
    @maxraphael1 3 роки тому +1

    The idea of weights moving averaging in teacher-student architecture was also used in this other paper arxiv.org/abs/1703.01780

  • @adamrak7560
    @adamrak7560 3 роки тому +2

    You are right in assuming that there are no strong guarantees to stop it from reaching the global minimum, which is degenerate.
    But the way they train it, with the averaged version creates a _massive_ mountain around the degenerate solution. Gradients will point away from the degenerate case unless you are already too close, because of the "target" model. The degenerate case is a _local_ attractor only. (and the good solutions are local attractors too, but more numerous)
    This means that unless the stepsize is too big, this network should robustly converge to a adequate solution instead of the degenerate one.
    In the original network, where you differentiate with with both paths, you will always get gradient which points somewhat towards the degenerate solution, because you use the same network in the two paths and sum the gradients: in a local linear approximation this will tend towards the degenerate case. The degenerate case is a _global_ attractor. (and the good solutions are only local attractors)

  • @darkmythos4457
    @darkmythos4457 3 роки тому +17

    Using stable targets with an EMA model is not really new, used a lot in semi-supervised learning, like in the paper Mean Teachers. As for the projection, in SimCLR, they explain why it is important.

    • @DanielHesslow
      @DanielHesslow 3 роки тому +2

      So if I understand it correctly, the projection is useful because we train the representations to be invariant to data augmentations, if instead we use the layer before the projection as the representation we can also keep information that does vary over data augmentations, which may be useful in downstream tasks?
      In the SimCLR they also show that the output size of the projection is largely irrelevant. However for this paper I wonder if there is not a point to projecting it down to a lower dimension in that it would increase the likelihood of getting stuck in a local minima and not degenerating to constant representations? Although 256 dimensions is still fairly large so maybe that doesn't play a role.

    • @gopietz
      @gopietz 3 роки тому +1

      I still see his argument because to my understanding the q should be able to replace the projection step and do the same thing.

    • @DanielHesslow
      @DanielHesslow 3 роки тому +3

      @@gopietz The prediction network, q, is only applied to the online network, so if we were to yank out the projection network our embeddings would need to be invariant to data augmentations, so I do think there is a point to have it there.

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 3 роки тому +1

    I guess it's because completely ignoring the input is kind of hard when there is a separate architecture which is tracking parameters. Forgetting the input would be a huge change in initial parameters so should increase the loss between the architectures thus could be stopped.

  • @bzqp2
    @bzqp2 2 роки тому +3

    29:36 All the performance gains are from the seed=1337

  • @krystalp9856
    @krystalp9856 3 роки тому +4

    That broader impact section seems like an undergrad wrote it lol. I would know cuz that's exactly how I write reports for projects in class.

  • @mlworks
    @mlworks 2 роки тому

    It does sound little like GANs except instead of random noise and ground truth image, we have two variants of the same image learning from each other. Am I misinterpreting completely ?

  • @004307ec
    @004307ec 3 роки тому +1

    Off-policy design from RL?

  • @tsunamidestructor
    @tsunamidestructor 3 роки тому +4

    Hi Yannic, just dropping in (as usual) to say that I love your content!
    I was just wondering - what h/w and/or s/w do you use to draw around the PDFs?

  • @CapHuuQuan
    @CapHuuQuan 2 роки тому

    Wait, at 18:26, shouldn't it be q_theta(f_theta(v)) instead of q_theta(f_theta(z))?
    Anyway, great video!

  • @daisukeniizumi6023
    @daisukeniizumi6023 3 роки тому +2

    Here we got one mystery partly confirmed, this nice article shows that BN in MLP has an effect like contrastive loss:
    untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html
    And paper:
    arxiv.org/abs/2010.00578

  • @citiblocsMaster
    @citiblocsMaster 3 роки тому +4

    31:12 I disagree slightly. I think the power of the representations comes from the fact that they throw an insane amount of compute at the problem. Approaches such as Adversarial AutoAugment (arxiv.org/abs/1912.11188) or AutoAugment more broadly show that it's possible to learn such augmentations.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      Yes in part, but I think a lot of papers show just how important the particulars of the augmentation procedure are

  • @qcpwned07
    @qcpwned07 3 роки тому +1

    ua-cam.com/video/YPfUiOMYOEE/v-deo.html
    I don't think the projection can be ignored here, ResNet has been trained for a long while, so it's weights are fairly stable, it would probably take a long time to notch them in the right direction, and you would loose some of its capacity to represent the input set. By adding the projection, you have a new naive network on which you can apply a greater learning rate without fear of alienising the structure of the pre-learned model. -- Basically it is serves somewhat the same function as the classification layer one would usually add for fine tuning

  • @AIology2022
    @AIology2022 3 роки тому +1

    What is its difference with Siamese Neural Network. I did not see anything new.

  • @korydonati8926
    @korydonati8926 3 роки тому +7

    I think you pronounce this "B-Y-O-L". It's a play on "B-Y-O-B" for "Bring your own booze". Basically this means to bring your own drinks to a party or gathering.

  • @tuanphamanh7919
    @tuanphamanh7919 3 роки тому

    hay, chúc bạn một ngày tốt lành

  • @Ma2rten
    @Ma2rten 3 роки тому +2

    I'm not sure if I put a link in UA-cam, but if you google "Understanding self-supervised and contrastive learning with Bootstrap Your Own Latent" someone figured out why it works and doesn't produce the degenerate solution.

  • @eelcohoogendoorn8044
    @eelcohoogendoorn8044 3 роки тому +1

    One question that occurs with me is wrt the batch size. The show it matters less than methods that mine negatives within the batch, obviously. But why does it matter at all? Just because of batchnorm related concerns? There are some good solutions to this available nowadays are there not? If I used groupnorm for instance, and my dataset was not too big / my patience sufficed (kinda the situation I am in), could I make this work on a single GPU with tiny batches? I dont see any reason why not. Just trying to get a sense of how important the bazzilion TPUs are in the big picture.

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 роки тому +1

      Ah right they explicitly claim its just the batch norm. Interesting, would not expect it to matter that much with triple-digit batch sizes.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      It's also the fact that there are better negatives in larger batches

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 роки тому

      @@YannicKilcher Yeah it matters a lot if you are mining within the batch; been there, done that. Im actually surprised at how little it seems to matter for the negatives in their example. My situation is a very big mem-hungry model and hardware-constrained situation; im having to deal with batches of size 4 a lot of the time. Sadly they dont go that low in this paper, but if its true batchnorm is the only bottleneck, thats very promising to me.

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 роки тому

      ...im not sampling negatives from the batch if the batch size is 4, obviously. memory banks for decently but i also find them very tuning-sensitive; how tolerant is one supposed to be 'staleness' of entries in the memory bank, etc. Actually finding negatives of the right hardness consistently isnt easy. Its kinda interesting that the staleness of the memory bank entries can be a feature, if this paper is to be believed.

  • @bzqp2
    @bzqp2 2 роки тому +1

    10:30 The local minimum idea doesn't make too much sense with so many dimensions. Without any preventive mechanism it would always collapse to a constant. It's really hard for me to believe it's just an initialization balance.

    • @user-rh8hi4ph4b
      @user-rh8hi4ph4b 2 роки тому +1

      I think the different "target parameters" are crucial in preventing collapse here. As long as the target parameters are different from the online parameters, it would be almost impossible for both online and target parameters to produce the same constant output, _unless_ the model is fully converged (as in the parameters no longer change, causing the online and target parameters to become the same over time). So one might argue that the global minimum of "it's just the same constant output regardless of input" doesn't exist, since that behavior could never yield a minimal loss because of the difference in online and target parameters. If that difference is large enough, such as with a very high momentum in the update of the target parameters, the loss of the trivial/collapsed behavior might even be worse than that of non-trivial behaviors, preventing collapse that way.

  • @grafzhl
    @grafzhl 3 роки тому +3

    16:41 4092 is my favorite power of 2 😏

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      Haha I realized while recording 😁

  • @snippletrap
    @snippletrap 3 роки тому +4

    Yannic it's probably pronounced bee-why-oh-ell, after BYOB, which stands for Bring Your Own Beer,

  • @wagbagsag
    @wagbagsag 3 роки тому +1

    Why wouldn't the q(z) network (prediction of target embedding) just be be the identity? q(z) doesn't know which transformations were applied, so the only thing that differs consistently between the two embeddings is that one was produced by a target network whose weights are filtered by an exponential moving average. I would think that the expected difference between a signal and it's EMA filtered target should be zero since EMA is an unbiased filter...?

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      I don't think EMA is an unbiased filter. I mean, I'm happy to be convinced otherwise, but it's just my first intuition.

    • @wagbagsag
      @wagbagsag 3 роки тому

      ​@@YannicKilcher I guess the EMA-filtered params lag the current params, and in general that means that they should be higher loss than the current params. So the q network is basically learning which direction is uphill on the loss surface? I agree that it's not at all clear why this avoids mode collapse

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      @@wagbagsag yes that sounds right. And yes, it's a mystery to me too 😁

  • @MariuszWoloszyn
    @MariuszWoloszyn 3 роки тому

    Actually it's the other way around. Transformers are inspired by transfer learning on image recognition models.

  • @Gabend
    @Gabend 3 роки тому

    "Why is this exactly here? Probably simply because it works." This sums up the whole deep learning paradigm.

  • @3dsakr
    @3dsakr 3 роки тому +3

    Very interesting!!, Thanks a lot for your videos, I just found your channel yesterday and I already watched like 5 videos :)
    I want your help in understanding only 1 thing that I feel I'm lost in,
    which is: learning what a single neuron do.
    for example: you have an input jpg image of 800x600, pass to Conv2D layer1, then Conv2D layer2
    the question is: how to get the final result from layer1 (as an image), and how to find out what each neuron is doing in that layer? (same for layer2)
    keep the good work up.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +2

      That's a complicated question, especially in a convolutional network. have a look at openai microscope

    • @3dsakr
      @3dsakr 3 роки тому

      @@YannicKilcher Thanks, after some search I found this, github.com/tensorflow/lucid
      and this
      microscope.openai.com/models/inceptionv1/mixed5a_3x3_0/56
      and this
      distill.pub/2020/circuits/early-vision/#conv2d1

    • @3dsakr
      @3dsakr 3 роки тому

      @@YannicKilcher and this, interpretablevision.github.io/

  • @TShevProject
    @TShevProject 2 роки тому

    It is interesting what numbers can you get with 3090 in one week.

  • @CristianGarcia
    @CristianGarcia 3 роки тому

    I asked the question about the mode colapse and a researcher friend pointed me to Mean Teacher (awesome name) where they also do exponential averaging, they might have some insights into why it works: arxiv.org/abs/1703.01780

  • @AndrewKamenMusic
    @AndrewKamenMusic 3 роки тому +1

    Party at my place.
    BYOL

  • @Agrover112
    @Agrover112 3 роки тому +1

    By the time I enter a Master's , DL will be so fucking different

  • @theodorosgalanos9663
    @theodorosgalanos9663 3 роки тому +2

    Alternate name idea: BYOTPUs

  • @sui-chan.wa.kyou.mo.chiisai
    @sui-chan.wa.kyou.mo.chiisai 3 роки тому +1

    self-supervised is hot

  • @dippatel1739
    @dippatel1739 3 роки тому +2

    As I said earlier.
    Label exists.
    Augmentation: I am about to end this man‘s career.

  • @larrybird3729
    @larrybird3729 3 роки тому +4

    Deepminds next paper:
    We solved general-intelligence
    === Here is the code ===
    def general_intelligence(information):
    return "intelligence"

  • @BPTScalpel
    @BPTScalpel 3 роки тому +5

    I respectfully disagree with the pseudo code release. With it, it took me half a day to implement the paper while it took me way more time to replicate SimCLR because the codebase the SimCLR authors published was disgusting to say the least.
    One thing that really grinds my gears though is that they (deliberately?) ignored the semi supervised literature.

    • @ulm287
      @ulm287 3 роки тому +2

      Are you able to reproduce the results? I would imagine 512 TPUs must cost a lot of money or do you just let it run for days?
      My main concern like in the video is why it doesnt collapse to constant representation ... from a theoretical perspective, you are literally exploiting the problem of not being able to optimize a NN, which is weird. If they use their loss for example as validation, then that would mean that they are not CV over the init, as constant zero init would be 0 loss ...
      Did you find any hacks they used to avoid this? like "anti-"weight decay or so lol

    • @BPTScalpel
      @BPTScalpel 3 роки тому +5

      ​@@ulm287 I have access to a lot of compute for free thanks to the french gov. Right now, we replicated MoCo, MoCo v2 and SimCLR with our codebase and BYOL is currently running.
      I think the main reason it works is that, because of the EMA and the different views, the collapse is very unlikely: while it seems obvious to us that outputting constant features minimises the loss, this solution is very hard to find because of the EMA updates. To the network, it's just one of the solution and not even an easy one. The reason it learns anything at all is because there is already some very weak signal in a randomly initialised network, signal that they refine over time.
      Some pattern are so obvious that even a randomly initialised target network with a modern architecture will be able to output features that contain informations related to them (1.4% accuracy on ImageNet with random features). The online network will pick this signal up and will refine it to make it robust to the different views (18.4% accuracy on ImageNet with BYOL with a random target network).
      Now you can replace the target network with this trained online network and start again. The idea is that this new target network will be able to pick up signals from more patterns that the original random one. You iterate this process many times until convergence.
      That's basically what they are doing but instead of waiting for the online network to converge before updating the target network, they maintain an EMA of it.

    • @eelcohoogendoorn8044
      @eelcohoogendoorn8044 3 роки тому +1

      ​@@BPTScalpel Yeah it sort of clicks but I have to think about this a little longer, before I convince myself I really understand it. Curious to hear if your replication code works; if it does, id love to see a github link. Not sure which is better; really clean pseudocode or trash real code. But why not both?
      Having messed around with quite a lot of negative mining schemes I know how fragile and prone to collapse that is for the type of problems ive worked on, and what I like about this method is not so much the 1 or 2 % this or that, but that it might (emphasis on might) do away with all that tuning.
      So yeah pretty cool stuff; interesting applications, interesting conceptually... perhaps only downside is that it may not follow a super efficient path to its end state, and require quite some training time. But I can live with that for my applications.

    • @YannicKilcher
      @YannicKilcher  3 роки тому +1

      I'm not against releasing pseudocode. But why can't they just release the actual code?

    • @BPTScalpel
      @BPTScalpel 3 роки тому +1

      @@YannicKilcher My best guess is that the actual code contains a lot of boilerplate code to make it work on their amazing infra and they could not be bothered to clean it =P

  • @sphereron
    @sphereron 3 роки тому +1

    First

  • @RickeyBowers
    @RickeyBowers 3 роки тому

    Augmentations work to focus the network - it's too great an oversimplification to say the algorithm learns to ignore the augmentations, imho.

  • @amitsinghrawat8483
    @amitsinghrawat8483 2 роки тому

    What the hell is this why I'm here?