Building makemore Part 2: MLP

Поділитися
Вставка
  • Опубліковано 11 січ 2025

КОМЕНТАРІ • 395

  • @OlleMattsson
    @OlleMattsson 2 роки тому +407

    0% hype. 100% substance. GOLD!

  • @farhadkarimi
    @farhadkarimi 2 роки тому +370

    It’s insanely awesome that you are taking time out of your day to provide the public with educational videos like these.

  • @WouterHalswijk
    @WouterHalswijk 2 роки тому +83

    I'm a senior aerospace engineer, so no CS or ML training at all, and I'm now totally fascinated with PyTorch. First that micrograd intro, which totally clicked the methods used for backprop into place. Now this intro with embedding and data preparation etc. I almost feel like transformers are within reach already. Inspiring!

    • @rajaahdhananjey4803
      @rajaahdhananjey4803 2 роки тому

      Quality Engineer with a Production Engineering background. Same feeling !

    • @staggeredextreme8213
      @staggeredextreme8213 11 місяців тому +5

      How you guys landed here, i mean me as a cs graduate, I'll never land directly to a lecture series of aerospace that suddenly start to make sense 🤔

    • @shashankmadan
      @shashankmadan 4 місяці тому

      @@staggeredextreme8213 aerospace engineer's are far more intelligent than CS engs for sure.

    • @staggeredextreme8213
      @staggeredextreme8213 4 місяці тому

      @@shashankmadan i wish you were correct but unfortunately you are wrong

    • @maercaestro
      @maercaestro 3 місяці тому

      @@staggeredextreme8213 engineers apart from computer engineer rarely feels the need to share knowledge freely. closeted

  • @peterwangsc
    @peterwangsc Рік тому +29

    This is amazing. Using just a little bit of what I was able to learn from part 3, namely the Kaiming init, and turning back on the learning rate decay, I was able to achieve 2.03 and 2.04 in my test and validation with a 1.89 in my training loss with just 300k iterations and 23k parameters. I set my block size to 4 and my embeddings to 12 and increased my hidden layer to 300 while decaying my learning rate exponent from -1 to -3 linear space over the 300k steps. All that without even using batch normalization yet. After applying batch norm, was able to get these down to 1.99 and 1.98 with training loss in the 1.7s after a little more tweaking. Really good content in this lecture, it really has me feeling like a chef in the kitchen almost, cooking up a model with a few turns of the knobs...This sounds like a game or a problem that can be solved with an AI trained on turning knobs.

    • @peterwangsc
      @peterwangsc Рік тому +4

      intuition: why 4 block size instead of 3 block size? well the english language i think has an average of somewhere between 3 to 5 characters per syllable, which most 1 syllable names falling between that 3-5 character bucket and some 2 syllable names falling in that 4-6 character bucket and beyond. I wanted a block size that would give some better indication on whether we're in a one syllable or two syllable context, and so we could end up with some more pronounceable names. It also just made sense to scale up the dimension of embeddings and neurons to give a little more nuance to the relationships between the different context blocks. English has so many different rules when it comes to vowels and silent letters and so I felt like we needed to give enough room for 3-4 degrees of freedom for each character in the context block, and therefore needed more neurons in the net to account for those extra dimensions. running the model for more steps just allows the convergence to happen. I don't know if it could get much better after more steps but this took 6-7 minutes to run so I think i squeezed all that I could out of these hyperparams.

    • @JeavanCooper
      @JeavanCooper 6 місяців тому +2

      Deeply sympathetic! I always feel like deep learning is a lot like cooking or building blocks.

    • @INGLERAJKAMALRAJENDRA
      @INGLERAJKAMALRAJENDRA 5 місяців тому

      Thanks for sharing your experience :)

  • @rayallinkh
    @rayallinkh 2 роки тому +87

    Pls continue this series(and similar ones) to eternity! You are THE teacher which everyone interested/working in AI really needs!

  • @rmajdodin
    @rmajdodin 2 роки тому +19

    53:20 To break the data to training, developement and test, one can also use torch.tensor_split.
    n1 = int(0.8 * X.shape[0])
    n2 = int(0.9 * X.shape[0])
    Xtr, Xdev, Xts = X.tensor_split((n1, n2), dim=0)
    Ytr, Ydev, Yts = Y.tensor_split((n1, n2), dim=0)

  • @matjazmuc-7124
    @matjazmuc-7124 2 роки тому +107

    I just want to say thank you Andrej, you are the best !
    I've spent the last 2 days going over the first 3 videos (and completing the exercises),
    I must say that this is by far the best learning experience I ever had.
    The quality of the lectures is just immeasurable, in fact you completely ruined
    how I feel about lectures at my University.

    • @ahmedivy
      @ahmedivy Рік тому +3

      where are the exercises?

    • @sam.rodriguez
      @sam.rodriguez Рік тому

      Check the comments from Andrej in each video @@ahmedivy

    • @allahm-ast3mnlywlatstbdlny164
      @allahm-ast3mnlywlatstbdlny164 Рік тому

      ​@@ahmedivydescription

    • @shaypeleg7812
      @shaypeleg7812 Рік тому +1

      @@ahmedivyAlso asked myself, then found them in the movie description:
      Exercises:
      - E01: Tune the hyperparameters of the training to beat my best validation loss of 2.2
      - E02: I was not careful with the intialization of the network in this video. (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?
      - E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

  • @anveshicharuvaka2823
    @anveshicharuvaka2823 Рік тому +46

    Hi Andrej, Even though I am already familiar with all this I still watch your videos for the pedagogical value and for learning how to do things. But, I still learn many new things about pytorch as well as how to think through things. The way you simplify complex stuff is just amazing. Keep doing this. You said on a podcast that you spend 10 hours for 1 hour of content, but you save 1000s of hours of frustration and make implementing ideas a little bit easier.

    • @Sovereign589
      @Sovereign589 Рік тому +2

      great and true sentence: "You said on a podcast that you spend 10 hours for 1 hour of content, but you save 1000s of hours of frustration and make implementing ideas a little bit easier."

  • @rmajdodin
    @rmajdodin 2 роки тому +21

    A two hour workshop on NLP with transformers costs 149$ in Invidia GTC conference.
    You tutor us with amazing quality for free. Thank you!🙂

  • @manuthegameking
    @manuthegameking 2 роки тому +46

    This is amazing!!! I am an undergraduate student researching deep learning. This series is a gold mine. The attention to detail as well as the intuitive explanations are amazing!!

  • @ncheymbamalu4906
    @ncheymbamalu4906 Рік тому +6

    Much thanks, Andrej! I increased the embedding dimension to 5 from 2, initialized the model parameters from a uniform distribution [0, 1) instead of a standard normal distribution, increased the batch size to 128, and used the sigmoid activation for the hidden layer instead of the hyperbolic tangent, and was able to get the negative log-likelihood for the train and validation sets down to ~2.15, respectively.

  • @moalimus
    @moalimus 2 роки тому +4

    Can't believe the value of these lecture and how helpful they are, you are literally changing the world. Thanks very much for your effort and knowledge

  • @rezathr8968
    @rezathr8968 2 роки тому +26

    Really enjoyed watching these lectures so far :) also +1 for the PyTorch internals video (@25:36)

  • @jackjayden1162
    @jackjayden1162 4 місяці тому +1

    This whole Neural Networks: Zero to Hero series is just phenominal, absolutely some of the best content out there in the internet, top notch knowledges well taught by industry leading figure yet in such a patient and practical manner! We couldn't thank you enough for the remarkable work you did for the community!

  • @caeras1864
    @caeras1864 2 роки тому +5

    Thanks. Seeing things coded from scratch clears up any ambiguities one may have when reading the same material in a book.

  • @tylerxiety
    @tylerxiety Рік тому +1

    Love all the tips and explanations on pytorch, training efficiency, and educational purposed errors. I was writing both code and notes and rewatching and enjoyed it and felt having a fruitful day after finished. It's like I was learning with a kind and insightful mentor sitting next to me. Thanks so much Andrej.

  • @vulkanosaure
    @vulkanosaure 2 роки тому +9

    Thank you so much, this is gold, I'm watching all of this thoroughly, pausing the video a lot to wrap my head around those tensors manipulation (i didn't know anything abt python/numpy/pytorch). I'm also really inspired from how you quickly plot datas to get important insights, I'll do that too from now on

  • @mbpiku
    @mbpiku Рік тому +2

    Never understood the basics of hyper parameter tuning so well. A sincere Thanks for the foundation and including that part in this video.

  • @SandeepAitha
    @SandeepAitha 9 місяців тому +3

    Watching your videos constantly reminds me of "There are no bad students but only bad teachers"

  • @bebebewin
    @bebebewin Рік тому +2

    This is perhaps the best series on UA-cam I have ever seen - Without a doubt I can't recall the last time a 1 hour video was able to teach me so much!

  • @cangozpinar
    @cangozpinar Рік тому +2

    Thank you very much for taking your time to go step by step whether it be torch API, your code or the math behind things. I really appreciate it.

  • @cristobalalcazar5622
    @cristobalalcazar5622 2 роки тому +1

    This lecture compress an insanely amount of wisdom in 1.15hrs! Thanks

  • @louiswang538
    @louiswang538 2 роки тому +3

    29:20 we can also use torch.reshape() to get the right shape for W. However, there is a difference between torch.view and torch.reshape
    TL;DR:
    If you just want to reshape tensors, use torch.reshape. If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use torch.view.

  • @bassRDS
    @bassRDS Рік тому +4

    Thank you Andrej! I find your videos not only educational, but also very entertaining. Learning new things is exciting!

  • @alexandermedina4950
    @alexandermedina4950 2 роки тому +1

    This is priceless, you have such a low and high level understanding of the topic, that's just amazing.

  • @siddhantverma532
    @siddhantverma532 Місяць тому +2

    1:06:56 Fascinating how the vowels end up clustered together!

  • @joshwolff4592
    @joshwolff4592 Рік тому +2

    The amount of times in college we used the PyTorch "view" function with ZERO explanation. And your explanation is not only flawless, you even make the explanation itself look easy! Thank you so much

  • @not_elm0
    @not_elm0 2 роки тому +1

    This educational vid will reach more students than a regular teaching job at a regular school. Thanks for sharing & giving back👍

  • @vil9386
    @vil9386 Рік тому +1

    Can't thank you enough. It's such a satisfying feeling to understand the logic under the ML models clearly. Thank you!

  • @grayboywilliams
    @grayboywilliams Рік тому +3

    So many insights, I’ll have to rewatch it again to retain them all. Thank you!

  • @timilehinfasipe485
    @timilehinfasipe485 2 роки тому +8

    Thank you so much for this, Andrej !! I’m really learning and enjoying this

  • @julian1000
    @julian1000 2 роки тому +3

    This is absolutely amazing stuff, thank you so much for putting this out for FREE!!!! I thought your name looked familiar and then I remembered you sparked my initial interest in NNs with "the unreasonable effectiveness of RNNs". It was SO fun and fascinating to just toss any old random text at it and see what it did! Can't believe how much progress has happened so quickly. Really really excited to get a better practical understanding of NNs and how to program them.
    Thank you again!

  • @yiannigeorgantas1551
    @yiannigeorgantas1551 2 роки тому +5

    Whoa, you’re putting these out quicker than I can go through them. Thank you!

  • @ShouryanNikam
    @ShouryanNikam Рік тому +2

    What a time to be alive, someone as smart as Andrej giving away for free probably the best lectures on the subject. Thanks so much!!!

  • @myao8930
    @myao8930 Рік тому +4

    @00:45:40 'Finding a good initial learning rate', each learning rate is used just one time. The adjustment of the parameter of one learning rate is based on the parameters already adjusted using the prior smaller learning rates. I feel that each of the 1,000 learning rate candidates should go through the same number of iterations. Then, the losses at the end of the iterations are compared. Please tell me if I am wrong. Thanks!

    • @wolk1612
      @wolk1612 Рік тому +1

      each time you make exponentially bigger steps, so you can neglect previous path. It's like if you make one step toward your goal, and than make another 10 steps your overall path is not really affected by you first step. And generally you want to find the biggest number of steps (lr) which you should take in some direction (gradient) to not overshoot your goal (best model weights) to get there faster.

    • @myao8930
      @myao8930 Рік тому

      Thanks! The instructor says the test should not be run many times since each time the model learns something from the test data. In the test, the parameters are not adjusted. How can the model learn from the test data?@@wolk1612

  • @SK-ke8nu
    @SK-ke8nu 3 місяці тому

    Great work Andrej! It is a rare skill to be able to explain such complex topics to people without your background. Hats off!!

  • @myanxiouslife
    @myanxiouslife 2 роки тому +2

    So cool to see the model learn through the embedding matrix that vowels share some similarity, 'q' and '.' are outlier characters, and so on!

  • @pruthvirajr.chavan
    @pruthvirajr.chavan Місяць тому

    About the exercise -
    By setting an exponentially decaying learning rate (as discussed earlier in the video), and batch size of 48 I achieved a Dev loss of 2.13 in just 100k iterations. (Rest parameters are the same)
    Thanks so much @Andrej for this series! Totally loving it.

    • @Dhadheechi
      @Dhadheechi 14 днів тому

      Can you elaborate on the learning rate scheme? How often did you change the learning rates? I tried 200k iterations, with 50k iterations using lr=0.1, the next 100k iterations using lr=0.05, and the next 50k iterations using lr=0.01 and got a dev loss of 2.157. My choice was pretty arbitrary, but I'm sure there must be a systematic way of smoothly decaying the lr after a certain number of iterations

  • @Joseph_Myers
    @Joseph_Myers Рік тому +1

    I wanted to let you know i listened to the podcast with Lex Fridman and i know understand how much of a Rockstar you are in the Artificial Intelligence space. Like many others i appreciate you and all you qre doing to push forward with this incredible technology. Thank you.

  • @mehulsuthar7554
    @mehulsuthar7554 7 місяців тому

    I am deeply grateful to all the effort and time you have put in this. Thank you so much.
    i tried to do various kind of weights initialization and got the accuracy of 2.03 on test and 2.08 on dev. I am still going on but i wanted to appreciate the work you have done.
    Once again thank you. wishing you a better health and life.

  • @ghaisaniwnl
    @ghaisaniwnl 8 днів тому

    That feeling of finally being able to "See it" is so satisfying. Thank you Karpathy.

  • @koenBotermans
    @koenBotermans Рік тому

    I believe that at 49:22 the losses and the learning rates are misaligned.
    The first loss (derived from completely random weights) is computed before the first learning rate is used, and therefor the first learning rate should be aligned with the second loss.
    You can simply solve this problem by using this snippet;
    lri = lri[:-1]
    lossi = lossi[1:]
    Also, thank you so much for these amazing lectures

  • @pastrop2003
    @pastrop2003 Рік тому +3

    On top of everything else, this is absolutely the best documentation & explainer of PyTorch. This is infinitely better that the PyTorch documentation. In fact, it should be a must-see video for the PyTorch team to show them how to write good documentation. Meta should pay Adrej any fee he asks for the rights to use this video in the PyTorch docs...Thank you Andrej!

  • @oshaya
    @oshaya 2 роки тому +34

    Amazing, astounding… Andrej, you’re continuing your revolution for people’s education in ML. You are the “Che” of AI.

    • @isaacfranklin2712
      @isaacfranklin2712 Рік тому +1

      Quite an ominous comparison, especially with Andrej working at OpenAI now.

    • @jeevan288
      @jeevan288 Рік тому

      what does "Che" mean?

    • @gregoriovilardo
      @gregoriovilardo Рік тому

      ​@@jeevan288 is a murderer that fight "for" cuba. "Che Guevara"

  • @pulkitgarg189
    @pulkitgarg189 6 місяців тому +4

    Reminder - please create a video on internal of torch tensor and how it works. Thanks!!

  • @mdmusaddique_cse7458
    @mdmusaddique_cse7458 Рік тому

    I was able to achieve a loss of 2.14 on test set
    Some hyperparameters:
    Neurons in hidden layer: 300
    Batch size: 64 for first 400k iterations then 32 for rest
    Total Iterations: 600,000
    Thank you for uploading such insightful explanations. I really appreciate that you explained how things work under the hood and insights of PyTorch's internals.

  • @zmm978
    @zmm978 Рік тому

    I watched and followed many such courses, yours are really special, easy to understand yet very indepth, with many useful tricks.

  • @pedroaugustoribeirogomes7999
    @pedroaugustoribeirogomes7999 Рік тому +11

    Please create the "entire video about the internals of pytorch" that you mentioned in 25:40. And thank you so much for the content, Andrej !!

  • @nginfrared
    @nginfrared Рік тому

    Your lectures make me feel like I am in an AI Retreat :). I come out so happy and enriched after each lecture.

  • @Yenrabbit
    @Yenrabbit 2 роки тому +2

    Really great series of lessons! Lots of gems in here for any knowledge level.
    PS: Increasing the batch size and lowering the LR a little does result in a small improvement in the loss. Throwing out 2.135 as my test score to beat :)

  • @JayPinho
    @JayPinho Рік тому +8

    Great video! One question, @AndrejKarpathy: around 50:30 or so you show how to graph an optimal learning rate and ultimately you determine that the 0.1 you started with was pretty good. However, unless I'm misunderstanding your code, aren't you iterating over the 1000 different loss function candidates while *simultaneously* doing 1000 consecutive passes over the neural net? Meaning, the loss will naturally be lower during later iterations since you've already done a bunch of backward passes, so the biggest loss improvements would always be stacked towards the beginning of the 1000 iterations, right? Won't that bias your optimal learning rate calculation towards the first few candidates?

    • @bres6486
      @bres6486 Рік тому +3

      I found this a little confusing too since the expectation is to do 1000 steps of gradient descent with each learning rate separately. I think this trick of simultaneously changing learning rate while training (on mini-batches) is just a quick way to broadly illustrate how learning rate changes impact the loss. If the learning rate is too low initially then the loss will decrease very slowly, which is what happens. When the learning rate increases the loss decrease is more rapid. When the learning rate is too high the loss becomes unstable (can increase).

  • @shaypeleg7812
    @shaypeleg7812 Рік тому

    hi Andrej,
    Your lectures are the best ones I saw.
    It's amazing you take complex ideas and explain them in such a level that even beginners understand.
    Thank you for that.

  • @softwaredevelopmentwiththo9648
    @softwaredevelopmentwiththo9648 2 роки тому

    It's one of the great pleasures of UA-cam to be taught by someone with Andrejs experience.
    Your series is honestly one of the best on UA-cam. It's not too short like the typical DL intro videos. And it's not boring because you build the solution from the ground up with real code and common errors included. I love the format and the clear and concise structure.
    Thank you for the work that you put into these videos.

  • @PrarthanaShah-nk1xh
    @PrarthanaShah-nk1xh 3 місяці тому

    We need more of this!!!!! The way he explains part by part, with maths, unreal

  • @varunjain8981
    @varunjain8981 Рік тому

    Beautiful......The explanation!!!! This builds the intuition to venture out in unknown territories. Thanks from the bottom of my heart.

  • @shreyasdaniel627
    @shreyasdaniel627 2 роки тому

    You are amazing! Thank you so much for all your work :) You explain everything very intuitively!!!
    I was able to achieve a train loss of 2.15 and test loss of 2.17 with block_size = 4, 100k iterations and embed dimension = 5.

  • @RickeyBowers
    @RickeyBowers 2 роки тому

    Such a valuable resource to help people in other fields get up to speed on these concepts. Thank you.

  • @alexandertaylor4190
    @alexandertaylor4190 2 роки тому +4

    I feel pretty lucky that my intro to neural networks is these videos. I've wanted to dive in for a while and I'm hooked already. Absolutely loving this lecture series, thank you, I can't wait for more!
    I'd love to join the discord but the invite link seems to be broken

  • @DrKnowitallKnows
    @DrKnowitallKnows 2 роки тому +3

    Thank you for referencing and exploring the Bengio paper. It's great to get academic context on how models like this were developed, and very few people actually do this in contexts like this.

  • @arildboes
    @arildboes Рік тому

    As a programmer trying to learn ML, this is gold!

  • @kordou
    @kordou 11 місяців тому

    Andrey thank you for this great series of lectures. you are a great Educator! 100% GOLD Material to Learn

  • @oxente_aquarios
    @oxente_aquarios Рік тому

    The world needs to know about this youtube series. I already published it to my network on linkedin.

  • @anangelsdiaries
    @anangelsdiaries 8 місяців тому

    I am so happy people like you exist. Thank you very much for this video series.

  • @gleb.timofeev
    @gleb.timofeev 2 роки тому

    On 45:45 I was waiting fot Karpathy's constant to appear. Thank you for the lecture, Andrej

  • @ilyas8523
    @ilyas8523 Рік тому

    underrated series. Very informative. Watching this series before jumping into the Chatbot video. I am currently building my own poem-gpt

  • @alexanderliapatis9969
    @alexanderliapatis9969 2 роки тому

    I am into neural nets the last 2 years and i think i know some stuff about them (the basics at least) and i have taken a couple of courses and stuff about ml/dl. I was always wandring why do i need val and test set, why test the model on 2 different sets of the same data. So hearing that the val set is for finetuning of hyperparameters is a first for me and the fact that you use test set a few times in order to avoid overfitting on it as well. I am amazed by the content on your videos and the way you teach things. Keep up the good work, you are making the community a better place.

    • @tarakivu8861
      @tarakivu8861 Рік тому

      I dont understand the overuse of the test-set.
      I mean we are only forward-passing that to evaluate the performance, so we arent learning anything?
      I can maybe see it when the dev sees the result and changes the network to better fit the test-case? But thats good isnt it?

    • @debdeepsanyal9030
      @debdeepsanyal9030 8 місяців тому

      @@tarakivu8861 For the people later who will maybe stumble upon this comment and probably has the same doubt, here's an intuition i have that gives me a pretty thorough understanding.
      Say you are studying for an exam, and you use your textbooks for learning (note the use of learning here as well). Now, you want to know how good you're doing with the content you're learning from the textbooks, hence you give a mock exam, which kind of replicates the feeling of the final exam you're going to give. So you give test on the mock paper, and you note the mistakes or errors you are making on the mock paper, and you keep studying the text books and you give the mock test over and over again, periodically. After some time, you kind of have an estimate of how well you are going to do in the final exam based off the results you are getting on the mock exam.
      Here, learning from the textbooks is the model training on the train set. The mock exam, is the validation set. The final exam (which you just give once), is like the test set.
      Note that Dev set doesn't really change the network in any form or matter, it just gives us an estimate of how the model can perform on the test set. It's like if you are performing bad on the mock test, you know you can't make stuff better for the final exam.

  • @jeffreyzhang1413
    @jeffreyzhang1413 2 роки тому

    One of the best lectures on the fundamentals of ML

  • @AbhishekVaid
    @AbhishekVaid 5 місяців тому +1

    37:14, who would tell you this when you are reading from a book. Exceptional teaching ability

  • @JuanManuelBerros
    @JuanManuelBerros Рік тому +1

    Awesome stuff, even though I've been studying neural networks and NLP for the last couple of months, this feels like the first time I *truly* understand how stuff works. Amazing.
    PS. At 00:01:34 I was just uber curious about his previous searches, so I google them:
    proverbs 27:27
    >You will have plenty of goats’ milk to feed your family and to nourish your female servants.
    matthew 27:27-31
    >Then the governor’s soldiers took Jesus into the Praetorium and gathered the whole company of soldiers around him. They stripped him and put a scarlet robe on him, and then twisted together a crown of thorns and set it on his head. They put a staff in his right hand. Then they knelt in front of him and mocked him. “Hail, king of the Jews!” they said. They spit on him, and took the staff and struck him on the head again and again. After they had mocked him, they took off the robe and put his own clothes on him. Then they led him away to crucify him.

  • @ShadoWADC
    @ShadoWADC 2 роки тому +3

    Thank you for doing this. This is truly a gift for all the learners and enthusiasts of this field!

  • @pulkitgarg189
    @pulkitgarg189 6 місяців тому +3

    Also will be really helpful if you can turn your keyboard input visible as on. Would be really helpful to see what all shortcuts can be used. Thanks!

  • @arunmanoharan6329
    @arunmanoharan6329 Рік тому +1

    Thank you so much Andrej! This the best NN series. Hope you will create more videos:)

  • @svassilev
    @svassilev Рік тому

    Great stuff @AndrejKarpathy! I actually was typing in parallel in my own notebook, as I was training on a different dataset. Amazing!

  • @ernietam6202
    @ernietam6202 Рік тому

    Wow! I have longed to learn about hyper-parameters and training in a nutshell. Another Aha moment for me in Deep Learning. Thanks a trillion.

  • @DanteNoguez
    @DanteNoguez 2 роки тому

    I love the simplicity of your explanations. Thanks a lot!

  • @Democracy_Manifest
    @Democracy_Manifest Рік тому +1

    What an amazing teacher you are. Thank you

  • @LambrosPetrou
    @LambrosPetrou Рік тому +3

    Awesome videos, thank you for that! I have a question though about 45:00, "finding a good initial learning rate", which is either a mistake in the video or I misunderstood something.
    In the video, you iterate over the possible learning rates, while we do a single training iteration. This means that the learning rates towards the end are going to have a smaller loss (usually), than the first learning rates.
    This seems to be biasing towards choosing learning rates from the end.
    Shouldn't we have another outer loop that iterates the learning rates, and then an inner loop that does our actual training PER learning rate, and then keep track of the final loss per learning rate?
    Thanks!

    • @pranjaldasghosh2588
      @pranjaldasghosh2588 Рік тому

      i believe this may not prove necessary as he re-initialises the weights for every learning rate, thus starting with a fresh slate

  • @8eck
    @8eck 2 роки тому

    I like how Andrej is operating with tensors, that's super cool. I think that we need a separate video about that from Andrej. It is super important.

  • @ayogheswaran9270
    @ayogheswaran9270 Рік тому +1

    Thank you, Andrej!! Thanks a lot for all the efforts you have put in❤

  • @minhajulhoque2113
    @minhajulhoque2113 2 роки тому

    Such an amazing educational video. Learned a lot. Thanks for taking the time and explaining many concepts so clearly.

  • @aangeli702
    @aangeli702 Рік тому

    Andrej is the type of person that could make a video titled "Building a 'hello world' program in Python" which a 10x engineer could watch and learn something from it. The quality of these videos is unreal, please do make a video on the internals of torch!

  • @yogendramiraje8962
    @yogendramiraje8962 Рік тому

    If someone sorts all the NN courses, videos, MOOCs by their density of knowledge in descending order, this video will be at the top.

  • @ilovehowilikethat
    @ilovehowilikethat 2 роки тому +1

    I've never been this excited for a lecture video before

  • @GraTen
    @GraTen 5 місяців тому

    Great video!
    I improved the initialization and tweaked the hyperparameters. Finally, got 2.02 on the test set!🥳

  • @punto-y-coma7890
    @punto-y-coma7890 9 місяців тому

    That was really awesome explanation by all means!! thank you very much Andrej for educating us :)

  • @TheEbbemonster
    @TheEbbemonster Рік тому +1

    I really enjoy these videos! A little note is that to run through the tutorial, it requires a bit of memory, so it would be nice with an early discussion of batching :) I run out of memory when calculating the loss, so had to reduce the sample size significantly.

  • @flwi
    @flwi 2 роки тому

    Great lecture! I also appreciate the effort on putting it on google collab. Way easier to access for people not familiar with python and its adventurous ecosystem.
    Recently got a new mac with an m1 processor and it took me a while to get tensorflow to run locally with gpu support, since I'm no python expert and therefor not familiar with all their package managers :-)

  • @JohnDoe-ph6vb
    @JohnDoe-ph6vb Рік тому +1

    at 21:24 I think it's supposed to be first letter not first word. It's first word in the paper but first letter in the example

    • @jkscout
      @jkscout 7 місяців тому

      correct, but if you also inspect the vectors... this does not make sense... there are 32 of them, and remember these were the running characters for the first 5 words... so this has to be representative of more than the first character

  • @leiyang2176
    @leiyang2176 Рік тому

    Per my understanding of SGD, did the training normally has a tunable number of epochs (for example 1000 epochs), then for each epoch, we shuffle the training data, and take a mini batch of smaller size (32 for example), and for each mini batch, we calculate the gradient, update the parameters to reduce the loss ?
    It seems it is slightly different from the approach presented here. Looking at the 45:34, it looks like for each iteration, we randomly select a min batch of size 32 from the whole training set, and update the parameters, then go on to the next iteration.

  • @datou666
    @datou666 15 днів тому

    Question about 50:00, in the plot, y axis is the loss, and the x axis is learning rate, but x axis is also the step number. How do you know whether the y axis change is because of learning rate difference or step number increase?

  • @avishakeadhikary
    @avishakeadhikary Рік тому

    It is an absolute honor to learn from the very best. Thanks Andrej.

  • @mehdibanatehrani7100
    @mehdibanatehrani7100 13 днів тому

    Really enjoyed this part 2 of makemore. I think it would be instructive if you also explained why when you update parameters which is equal to [C, W1, b1, W2, b2], you don't need to manually update C,... and they gets automatically updated.

  • @afai264
    @afai264 11 місяців тому +1

    I'm confused at 56:17 why care must be taking with how many times you can use the test dataset as the model will learn from it. Is this because there is no equivalent of 'torch.no_grad()' for LLMs - will the LLM always update the weights when given data?

    • @ankushkothiyal5372
      @ankushkothiyal5372 6 місяців тому

      No, not really. It is about the general idea of train val test split. If you test your model on the "test dataset" multiple times that means you are retraining your model such that it performs better on the "test dataset" so although testing won't update the weights but the fact that you are retraining the model again and again for it to perform better on the test dataset means you are fitting the model on it and thus defeats the whole purpose of keeping a holdout/test dataset. That's why it is advisable to only score the test dataset a few times when you have a close to final model.

  • @ah1548
    @ah1548 6 місяців тому

    around 1:05:00 - the reason why we're not "overfitting" with the larger number of params might be the context size. with a context of 3, no number of params will remove the inherent uncertainty.

  • @abir95571
    @abir95571 Рік тому +1

    This is what true public service looks like ... kudos Andrej :)

  • @xDMrGarrison
    @xDMrGarrison 2 роки тому +3

    I finally beat 2.17, with 2.14.
    With context_size:4, embedding_dimension:5, hidden_dimension:300, total_iterations:200000, batch_size:800.
    And now for practice I am going to make a neural network to predict another kind of sequence. (I'm in the process of preparing/shaping the data, which is not easy) Fun stuff :P
    Really fiending for that next video though xD I'm excited to learn about RNNs and Convnets and especially transformers.

    • @hamza1543
      @hamza1543 2 роки тому

      Your batch size should be a power of 2

  • @alelmcity
    @alelmcity Рік тому +2

    Hi Andrej,
    Really amazing work, big thank you! However, I am a bit confused about specifying a good initial learning rate. because in the presented approach the recoded loss of a learning rate is affected by the previous optimization iterations. Shouldn't we for each learning rate optimize for a certain number of iterations, save the average loss and then reinitialize the params and off to the next learning rate? Finally, select the best-performing learning rate.

    • @styssine
      @styssine Рік тому

      I'm confused too. Shouldn't there be a double loop, with outer loop scanning the learning rates, and the inner loop doing a relatively small number of iterations?

  • @ggir9979
    @ggir9979 Рік тому +2

    I got the loss on the training set under 2 (1.9953 to be exact). But this was a clear case of overfitting as the regularisation loss actually increased to 2.1762 🙂
    HyperParameters:
    wordDim ( C ) = 15
    layerDim (W1 output/W2 input)= 500
    iterations = 400000
    batchSize = 100

  • @thegreattheone
    @thegreattheone Рік тому

    I have nothing to give you except my blessings. God bless Adnrej!

  • @mconio
    @mconio Рік тому

    re: using cross_entropy function around 34:05, it sounds like pytorch takes the derivate of each step of exponentiation then normalization instead of simplifying them before taking the derivative. is that a "soft" limitation of the implementation in that a procedure could be defined to overcome it, or is there a bit of an mathematical intuition needed to understand how to rewrite the function to produce a simpler derivative?