Neural Architecture Search without Training (Paper Explained)

Поділитися
Вставка
  • Опубліковано 28 чер 2024
  • #ai #research #machinelearning
    Neural Architecture Search is typically very slow and resource-intensive. A meta-controller has to train many hundreds or thousands of different models to find a suitable building plan. This paper proposes to use statistics of the Jacobian around data points to estimate the performance of proposed architectures at initialization. This method does not require training and speeds up NAS by orders of magnitude.
    OUTLINE:
    0:00 - Intro & Overview
    0:50 - Neural Architecture Search
    4:15 - Controller-based NAS
    7:35 - Architecture Search Without Training
    9:30 - Linearization Around Datapoints
    14:10 - Linearization Statistics
    19:00 - NAS-201 Benchmark
    20:15 - Experiments
    34:15 - Conclusion & Comments
    Paper: arxiv.org/abs/2006.04647
    Code: github.com/BayesWatch/nas-wit...
    Abstract:
    The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be extremely slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine how the linear maps induced by data points correlate for untrained network architectures in the NAS-Bench-201 search space, and motivate how this can be used to give a measure of modelling flexibility which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU. Code to reproduce our experiments is available at this https URL.
    Authors: Joseph Mellor, Jack Turner, Amos Storkey, Elliot J. Crowley
    Links:
    UA-cam: / yannickilcher
    Twitter: / ykilcher
    Discord: / discord
    BitChute: www.bitchute.com/channel/yann...
    Minds: www.minds.com/ykilcher
    Parler: parler.com/profile/YannicKilcher
    LinkedIn: / yannic-kilcher-488534136
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar (preferred to Patreon): www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • Наука та технологія

КОМЕНТАРІ • 71

  • @YeshwanthReddy
    @YeshwanthReddy 3 роки тому +74

    At 21:55 I think you should draw horizontal lines (not vertical) to discard architectures since score is on the y-axis. No?

    • @herp_derpingson
      @herp_derpingson 3 роки тому +2

      We dont really know what the cutoff score should be. Maybe we can have the top N highest scoring models.

    • @tonyrobinson1349
      @tonyrobinson1349 3 роки тому +4

      ah I was about to say the same! It;s confusing when the cause has been plotted on the y axis and the effect on the x axis.

    • @pladselsker8340
      @pladselsker8340 3 роки тому

      Isn't the score measurement on the left evaluated from actual trainning? I think he meant to discard architectures before even trainning them, which I think means that you have to select a vertical threshold on the validation accuracy, like he did

    • @Deez-Master
      @Deez-Master 3 роки тому

      Funny tho, Yes you can discard any crap with bad validation accuracy... if only there were some way to predict that without having to train and validate it ... :0

    • @hocusbogus7930
      @hocusbogus7930 3 роки тому +10

      @@pladselsker8340 Nope, 'score' is what is determined by eqn (2) for an untrained network, while validation accuracy is for a trained network. Before training, one could calculate 'score' for each network, and they would look like dots plotted on a vertical line. Then, discard all networks below a certain score -- by drawing a horizontal line that cuts this vertical line -- and only train the networks that lie above it.

  • @RickeyBowers
    @RickeyBowers 3 роки тому +4

    These long videos are really growing on me. Not just introducing me to papers that I am not familiar with, but also the additional insight of your perspectives. Thank you.

  • @Miestro85h
    @Miestro85h 3 роки тому +16

    If it's true, then pre filtering (or rejection sampling) based up on these avg score is a cheap speedup tool for any neural architecture search algorithm too.

  • @marohs5606
    @marohs5606 3 роки тому +6

    Thank you for the efforts.. highly appreciated!!

  • @Notshife
    @Notshife 3 роки тому +2

    Yet another area I am most interested to hear about. Thank you

  • @JeroenMW2
    @JeroenMW2 3 роки тому +1

    Thank you so much, you save 10s of thousands of people hours of work. The impact of your work is immense even if you don't get hundreds of thousands of views. Please never stop, you're amazing!

  • @eddtsoi
    @eddtsoi 3 роки тому +3

    amazing, always like your inspiring interpretation

  • @gauravkoradiya1236
    @gauravkoradiya1236 3 роки тому +2

    Thanks for sharing. Commendable work.

  • @dipsyteletubbie802
    @dipsyteletubbie802 Рік тому

    Thank you for the very clear explanation!

  • @user-pr8dr7vw7j
    @user-pr8dr7vw7j 4 місяці тому

    Thank you for sharing this. It's very interesting. I learned a lot.

  • @herp_derpingson
    @herp_derpingson 3 роки тому +2

    It is basically an "anti-lottery ticket hypothesis".
    .
    33:00 For the RL based search models, I think we would still need some negative samples. Otherwise the RL model would keep suggesting bad models for the sake of exploration.
    .
    Nice paper, easy to implement. Will definitely use this trick.

  • @albertwang5974
    @albertwang5974 3 роки тому +1

    The role of nonlinear function in a neural network can be treated as the if...else statement in a traditional programming language.
    the LSTM, GRU, Attention also can be treated as the same way, they provide switch control capability.

  • @lucha6262
    @lucha6262 3 роки тому +3

    This is very exciting and as others mention in the comments a super speedy tool to discard faulty architectures, thanks for the video!

  • @Kram1032
    @Kram1032 3 роки тому +6

    I wonder if this scoring could be improved simply by exchanging regular correlation with distance correlation, since that will also capture non-linearities. It might make a difference in particular in those networks where currently the score no longer tells you much.

    • @herp_derpingson
      @herp_derpingson 3 роки тому +1

      I just read about distance correlation. It makes sense.

  • @TimeofMinecraftMods
    @TimeofMinecraftMods 3 роки тому +1

    I think that a lot of the lack of performance compared to other techniques can be explained by the way the NAS-Bench-201 benchmark is constructed:
    We only have 15,625 different architectures: enough for the sample efficient "train until you're done" NAS systems, but searching without training may just need significantly different architectures. This would also explain why the more complex tasks the metric "spreads out": There just isn't enough ways the NAS-201 networks differ to make a meaningful difference that can be observed just by looking at the initial state. Maybe one could combine this approach with something like NEAT to generate a population of architectures and score them pretty much instantly using this. This would allow the system to get away from the "resnet-likes" that make up the NAS-BENCH.

  • @CosmiaNebula
    @CosmiaNebula 3 роки тому

    my 6-word slogan for this paper:
    Neural architecture physiognomy! And it works!

  • @deepblender
    @deepblender 3 роки тому +2

    Great video as always!
    Did you activate ads? I don't mind them at all! I am only asking because you recently mentioned, you didn't plan to enable them.

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      Yes, I announced that in the latest channel update :)

  • @bluel1ng
    @bluel1ng 3 роки тому

    I somehow like these one-step methods. What I do not directly understand is how this method can predict the generalization capabilities of a network-architecture (e.g. validation set accuracy) from the linear map histogram of one mini-batch.

  • @lugae4619
    @lugae4619 4 місяці тому

    Great video! Thanks for you personal interpretation too---helps think things through. I would argue though that the interpretation of the pytorchcv at (25:40) is wrong (admittedly, I don't know if its your interpretation or the authors since they seemed to have removed this part from their most recent version). But it looks like they're showing that their metric scores methods that we know do well high. That is, architectures that have been found by humans to preform well achieve a high NASWOT score (or whatever they call it).

  • @utku_yucel
    @utku_yucel 3 роки тому

    Thanks!

  • @sweatobertrinderknecht3480
    @sweatobertrinderknecht3480 3 роки тому

    your thumbnails are getting better

  • @dermitdembrot3091
    @dermitdembrot3091 3 роки тому +2

    Is J of shape NxD or DxN (where D is the dimensionality of x)? The shape of JJ^T would be NxN and DxD respectively in these two cases. Clearly the first makes more sense in context but the J_i,n in the second line below (1) seems to indicate otherwise.

  • @norik1616
    @norik1616 3 роки тому +2

    🤔 Add this before HyperBand in keras-tuner + add Bayesian opt after HyperBand.
    1. search without train to get rid of 50-80 % of the really bad architectures
    2. HyperBand to quickly abandon poor architectures
    3. Bayes opt (with all of the HyperBand runs as input) for the fine-search

  • @yashrunwal5111
    @yashrunwal5111 3 роки тому

    @Yannick Kilcher. Great job. Thank you! Can you explain EfficientNet/EfficientDet paper?

  • @tonyrobinson1349
    @tonyrobinson1349 3 роки тому +4

    The idea is interesting, thanks for making it so accessible. The big question is: Is it useful? I read fig 3 differently to you and so I come to a different conclusion. You think that this method weeds out most/90% of the bad architectures, I think it weeds out very few. If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy. By eye I integrate vertically to get the distribution of scores that would happen with a useless method. I then do the same again for (say) the top 10% of scores. Scatter plots are terrible at showing density, but it looks to me that all the probability mass is at the top of the plot, so the distributions would be very similar. The authors could have done this basic stats,.

    • @ameetrahane1445
      @ameetrahane1445 3 роки тому +1

      I think the idea has merit, at least intuitively. I'd like your input on why it wouldn't work.

    • @tonyrobinson1349
      @tonyrobinson1349 3 роки тому +2

      @@ameetrahane1445 I agree, I do think it's an interesting idea. Please note that my long comment was on fig 3 only, it's easy to have some really bad architectures out of all the 15625 combinations and I believe these are given too much weight by the use of the scatter diagram (which doesn't show density well). Thus I was commenting on specifics, not making a general statement that "it wouldn't work".

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      True, you have a good point. Maybe it would be worth investigating this quantitatively.

    • @julespoon2884
      @julespoon2884 3 роки тому

      Am kinda disappointed they did not show the scores for the well performing trained networks after showing initialisation affects the score significantly. If trained networks tend to have a greater correlation between it's score and accuracy perhaps this method can be useful by somehow mitigating the randomness from initialisation. Perhaps training the network a few rounds on random data to maximise the score and use that?
      A tangent: If random data does not affect the score, and the score is correlated with accuracy, what if a network is trained on random data to maximise the score, would it necessarily increase the accuracy on the actual data? This is a reason I'm skeptical on this method as idt the score is a good indication on the accuracy as it does not seem to account for the training data much.

    • @annasuperjump
      @annasuperjump 3 роки тому

      Hi @Tony Robinson, I do not quite understand this sentence in your comment " If we had an uninformative method then the correlation line would be vertical in fig 3, that is the score is not correlated with accuracy.", could you please clarify more? thanks

  • @hannesstark5024
    @hannesstark5024 3 роки тому +1

    Very interesting video once again! I have to say I thought I was going to like the "historical papers" but I have to admit that I found the present word2vec and gan videos boring and did not finish them.
    Just wanted to leave that feedback.

  • @drozen214
    @drozen214 3 роки тому +1

    I don’t know if you’re aware, but the paper seems to have been edited/updated since you made this video with different graphs, showing correlation matrices instead of histograms, and a different equation for computing the score. Is this common for papers to be changed after publishing? Do you know if the new equation is mathematically equivalent and preferred because it’s easier to calculate? Or is it just a different score that measures approximately the same thing?

  • @111dimka111
    @111dimka111 3 роки тому +1

    Great paper and great review! I wonder if we can replace the gradient w.r.t. input with gradient w.r.t. weights. The updated score can be related to NTK and to the metric over the function space. Would such change produce a better expressiveness score? Insights anyone?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      interesting idea. I know too little about ntk to have an informed opinion :)

  • @astroganov
    @astroganov 3 роки тому

    You have a misunderstanding on #21:42 about axes. Covariation score is on a vertical axe, and validation accuracy (after training) - is horizontal axe. So, if you wish to use described method to filter "bad" architectures in a fast way, you should cut by Score (draw horizontal line on some threshold level), instead of drawing vertical line at #22:21 That actually means, this method is even more far from being precise by itself...

  • @blanamaxima
    @blanamaxima 3 роки тому +1

    I guess that if the loss landscape is similar to a random spin glass hamiltonian , as Yann was saying then it makes some sense to have some spreading in the orientation of the linearization... To some extent it is sad that we are basically saying that a convex function has to be discarded from the begining :) I am curious to see some changes in loss function as well.

    • @arnoldchen1108
      @arnoldchen1108 3 роки тому +1

      Could you please point out the reference from Yann's point? Thank you!!

    • @blanamaxima
      @blanamaxima 3 роки тому +1

      @@arnoldchen1108 it is a pretty old paper by current standards arxiv.org/abs/1412.0233 .

    • @arnoldchen1108
      @arnoldchen1108 3 роки тому +2

      @@blanamaxima Definitely an interesting read though. Thanks a lot!

  • @franciskusxaveriuserick7608
    @franciskusxaveriuserick7608 3 роки тому

    Unrelated dumb comment but that annotated NAS-Bench-201 diagram at 19:52 roughly looks like map of Switzerland. Though yeah I am doing research on network compressions and this is a really cool idea, would like to see more studies between these parameters, scores and inference speed so that we can also optimize NAS to get the "smallest" or whatsoever that results to the best inference speed in embedded systems while still giving reasonably good accuracy value.

  • @siyn007
    @siyn007 3 роки тому +2

    I wonder if there's a way to use that score as the reward for the RL algorithms instead of the accuracy, I think that will lower down the computation time without necessarily dropping performance so much, but I might horribly wrong haha

    • @sheggle
      @sheggle 3 роки тому +1

      Thought so too, should at least be interesting to find out!

  • @bryce-bryce
    @bryce-bryce 3 роки тому +5

    What if you first train for let's say 5 epochs and then compute the score?

    • @dermitdembrot3091
      @dermitdembrot3091 3 роки тому +1

      I think this goes in the direction of something people do, where computing power is saved by only doing as few updates as necessary to see whether the architecture is good/ bad. If it's good after five steps you might decide to continue for another bit since good models are harder to tell apart than bad models.
      The paper in this video seems to have found a better predictor of performance at convergence than is the score after five steps.

    • @bryce-bryce
      @bryce-bryce 3 роки тому +2

      @@dermitdembrot3091 I know. I was wondering if the scores get more accurate / reliable when training the network for a few epochs and then looking at the correlations. Because if I understand it correctly, the network is just initialized and the correlations are based on the random weights. I just find it hard to understand why correlations of random weights are a good indicator of the final prediction. But I did not read the paper, just watched the video, so maybe I did not fully understand the idea.

    • @dermitdembrot3091
      @dermitdembrot3091 3 роки тому +2

      Oh sorry I completely misread your comment. Kind of assumed you meant accuracy score. I agree that it's worth investigating whether the "gradient correlation" score improves the evaluation. Quite possibly the authors tried and didn't see an improvement.

  • @norik1616
    @norik1616 3 роки тому

    I think the correlation could be better after training for few batches for the hard tasks (ImageNet) - The lottery ticket also had similar problem with harder tasks and needed a bit of training.
    Does it make sense?

  • @CosmiaNebula
    @CosmiaNebula 3 роки тому

    Alternative 6-word slogan:
    First, sanity check neural architecture expressivity!

  • @YeshwanthReddy
    @YeshwanthReddy 3 роки тому +4

    I can't believe they've trained image net 50000 times 😝

  • @robbiero368
    @robbiero368 3 роки тому

    Am I right in thinking that people always initialize networks with random weights (or weights from a previous training)
    Has anyone done any work looking at what happens if you use some sort of "less" random values as initial ones?
    Is all randomness created equally, so to speak, and is it important to be completely random in your starting point?
    What happens if you initialize with a regular pattern, does it fail to train at all?

    • @YannicKilcher
      @YannicKilcher  3 роки тому

      There is work in this area, but I think without significant improvement over random init.

  • @rishabhmanishsahlot129
    @rishabhmanishsahlot129 3 роки тому

    Has anyone used this? Does it actually work? Please Let me know?

  • @jasdeepsinghgrover2470
    @jasdeepsinghgrover2470 3 роки тому +1

    Have observed this in some places. At least in simple datasets, too much non-linear behaviour also allows over-fitting which might cause unexpected behaviours for high scores.

  • @vsiegel
    @vsiegel 3 роки тому +1

    So we use an AI to build another AI... why does that feel so spooky...

  • @user-sv5vb1mj1q
    @user-sv5vb1mj1q 3 роки тому +1

    I thinl that this article implies some linf od contradiction if we look it in context of maniflod mixup, in that article they claimed (if I ma right) that they reduced number of meangfull eigenvalues making manifold itself more linear alike, here I am hearing exactly opposite thing

  • @DasGrosseFressen
    @DasGrosseFressen 3 роки тому +5

    Why do guys in ML love to rename already established concepts?... Why "linear map of data" instead of simple the first oder Taylor expansion?