Deep Q Learning is Simple with PyTorch | Full Tutorial 2020

Поділитися
Вставка
  • Опубліковано 21 лис 2024

КОМЕНТАРІ • 136

  • @MachineLearningwithPhil
    @MachineLearningwithPhil  4 роки тому +17

    This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.

    • @PhilippDominicSiedler
      @PhilippDominicSiedler 3 роки тому +2

      Thank you very much for your content! I can't seem to find "from paper to code" course on Udemy in the description neither directly on your profile on Udemy. Is this not out yet?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +1

      Deep Q Learning:
      www.udemy.com/course/deep-q-learning-from-paper-to-code/?couponCode=DQN-AUG-2021
      Actor Critic Methods:
      www.udemy.com/course/actor-critic-methods-from-paper-to-code-with-pytorch/?couponCode=AC-AUG-2021
      Natural Language Processing from First Principles:
      www.udemy.com/course/natural-language-processing-from-first-principles/?couponCode=NLP1-AUG-2021

  • @eliasebner3595
    @eliasebner3595 3 роки тому +4

    this guy is so good he doesn't even need autocomplete.

  • @vijeta268
    @vijeta268 4 роки тому +8

    Thanks A LOT for making this tutorial!
    Coming from a non-CS background, coding is always a bottleneck for me but this video helped pass that phase with ease.

  • @GaetanoFavoino
    @GaetanoFavoino 4 роки тому +7

    Thank you for your clean tutorials, hope you'll make a new one on non-stationary environments soon.

  • @giannischochlakis4121
    @giannischochlakis4121 26 днів тому

    Hey Phil, first of all thanks for the tutorial.
    I have two questions regarding some differences between this code and your code on github.
    1)Why do you in your DQN implementation on Github convert the action_batch, after sampling, into tensors but in this implementation you don't? Is it because the pong game requires a multi-dimensional input, which you eventually flatten in your network code, while in this program because the input is a vector we don't need to do that?
    2)On Github you use the max operation on this step: q_next = self.q_next.forward(states_).max(dim=1)[0] but here you use T.max(q_next, dim=1)[0]. Why is that?

    • @giannischochlakis4121
      @giannischochlakis4121 26 днів тому

      Also I have trouble getting the network to converge consistently, it reaches an average of about 100 to 150 score in the 200 rounds mark and then it drops off.

  • @theuniversevoyager
    @theuniversevoyager Місяць тому

    Hi! Thanks for this tutorial Phil, it's nice to see a compacted DQN code. I have a question: when you convert arrays to tensors, is there any reason you insert the arrays in the form of lists? For example: "state = T.tensor([observation])..." instead of "state = T.tensor(observation)..."? Because Python print a warning saying that doing on "the list way" is extreme slowly.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Місяць тому +1

      It's not always necessary, but when you're working with CNNs they need a batch dimension.

  • @alirezamogharabi8733
    @alirezamogharabi8733 4 роки тому +4

    Thanks a lot Dr. Phil, please make some videos about multi agent reinforcement learning ❤️❤️🌹🌹

  • @nathanas64
    @nathanas64 3 роки тому +1

    Exceptionally clear presentation!! Pure genius! Will definitely take the course

  • @Darkness-l1b
    @Darkness-l1b 2 місяці тому

    Remember watching this 4 years ago and not understanding anything. We've come a long way.

  • @TIM6266
    @TIM6266 2 роки тому

    This is my master's degree savior.

  • @ahmedgamberli2250
    @ahmedgamberli2250 2 роки тому +2

    Thanks for making this tutorial. I just have a tiny question. Why do we q_nex[terminal_batch] = 0.0? Question may be a bit stupid. Sorry, for being a newbie :)

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  2 роки тому +1

      The terminal state has no future value, because no future rewards follow it.

  • @格瓦拉窃-s9h
    @格瓦拉窃-s9h Рік тому

    The video is very good, I hope there will be a version with Chinese subtitles

  • @burdescualexandru
    @burdescualexandru 4 роки тому +2

    Hey Phil! I'm looking forward to see a video where you show us how to define our own enviroment ! All the tutorials around are using gym, but i'd like to try reinforcement learning on some personal projects !

  • @sergiosanchez6377
    @sergiosanchez6377 9 місяців тому

    hi Phil! Is this code running in the current version of torch and gym? thank u for you work on this video!

  • @alisyedj
    @alisyedj 2 роки тому

    Thank you, Prof Phil. Very helpful! Can you expand on what Target networks do? I was reading the paper "Human - level control through deep reinforcement learning" where it talks about target network. Not clear what it is and what are the advantages of creating it. Thank you in advance

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  2 роки тому

      Target networks help to stabilize training. Using the same network to generate data and evaluate data each time step results in chasing a moving target. The target network changes more slowly so it's a more stable Target

  • @sounakmojumder5689
    @sounakmojumder5689 6 місяців тому

    Hi thank you I just have a request, if you can do this in colab , actually loading and saving model part in colab is bit messy , you can guide us

  • @yigitsevim7741
    @yigitsevim7741 Рік тому

    great tutorial, thanks. as a small criticism, please slightly move away from the microphone when coughing.

  • @noamabadi6482
    @noamabadi6482 2 роки тому

    Hi! Why do you use Q_eval.forward(state) instead of Q_eval(state)? I read that it's not good because the hooks aren't deployed, although I have no clue what hooks are.
    Thanks for the tutorial!

  • @walterjonathan8947
    @walterjonathan8947 2 місяці тому

    Hello Phil, I could not find the repo, please direct me where to find it

  • @pratheeps3972
    @pratheeps3972 4 роки тому +2

    Amazing and perfect timing too. I was looking at your older code for my project and you just gave the better version. My only issue is that my environment returns a matrix(image). How do I modify your code to get it to work?

    • @chunchunmaru3644
      @chunchunmaru3644 3 роки тому

      Make the output the shape of an image

    • @juleswombat5309
      @juleswombat5309 3 роки тому +1

      Sounds as though you need pytorch Convolutional layers at the front end of the Q neural network, if you have image, video based inputs. I suspect you may need to stack a few observations together, if you expect to detect motion from video.

  • @padraopv
    @padraopv 2 роки тому

    Thank you for this amazing content, Phil!

  • @RabeeQasem
    @RabeeQasem 3 роки тому +1

    is there a possibility to do a tutorial on multi-agent DQN ?
    I know there is a tutorial on A3C but in some cases, DQN is more suitable for gird wards environment more thatn a3c

  • @abolfazlzakeri6822
    @abolfazlzakeri6822 3 роки тому +1

    Very well. Thank you.

  • @haneulkim4902
    @haneulkim4902 3 роки тому

    Thanks Phil for an amazing tutorial!

  • @jjschnyder
    @jjschnyder 3 роки тому

    very nice tutorial. why do you make a memory-array for every element ( state, new state, reward etc..), couldnt you just make one overally memory array and store named Tuples in the form (state, action, reward, newstate, done) ?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому

      Yup, that's another way to do it. I use the named arrays because it's easier(for me) to keep track of where everything is stored.

  • @trenvert123
    @trenvert123 2 роки тому

    Thank you for this tutorial!

  • @gabrielvalentim197
    @gabrielvalentim197 Рік тому

    Hey Phil, how can I solve local minimum problems in PPO?
    I try to solve Luna Lander with PPO agent (with and without bonus entropy) but my agent stop in local minimum.
    I really appreciate your videos and I using them to improve my skills!!
    Tkss!!

  • @shashisuman8302
    @shashisuman8302 4 роки тому +10

    Please don't tell people "you don't need any exposure fo deep learning etc". This is why people jump from projects to projects without understanding as they get excited.

    • @kontra_21
      @kontra_21 4 роки тому

      In fairness, you don't need exposure to deep learning in order to follow this tutorial. However I can agree it may have been a little misleading as people may have assumed this was a top-down easy-to-digest intro video where it would all be explained in simple terms.

  • @ahmetfurkanaknc8959
    @ahmetfurkanaknc8959 3 роки тому

    Thanks, excellent tutorial !

  • @0hunnaa74
    @0hunnaa74 2 роки тому

    is it working??? the result is different from yours
    I got an avg from -300 ~ -500
    do other people run well??

  • @masoncoles402
    @masoncoles402 3 роки тому

    Hey, how would I go about saving/loading this model?I adapted your network for a different game

  • @2ndgenfsdbetatester315
    @2ndgenfsdbetatester315 3 роки тому

    life-saving video

  • @mathmo
    @mathmo 4 роки тому

    Hi Phil, any reason you are using the forward() method on your neural net instead of calling it directly As Q_eval() I.e. using __call__()? I believe in general calling forward() is unsafe, since there’s potentially some necessary magic involving hooks going on under the surface that you might miss.

  • @9841580948
    @9841580948 4 роки тому +1

    How can we save DeepQ Model after full episodes of training? Thank you

  • @rahuldhanasiri
    @rahuldhanasiri 4 роки тому +1

    Thankyou Dr. Phil for an amazing video. When I try to run this on colab, I get this error : "expected scalar type Float but found Double" at either 18th line or 23rd line of main**.py. I am trying it on cartpole environment and I have also tried to change the observation(line 16) to float 32 but it didn't work.

    • @SenselessTalk
      @SenselessTalk 3 роки тому +1

      btw, to answer this:
      def forward(self, state):
      state = state.to(torch.float32)
      x = F.relu(self.fc1(state))
      x = F.relu(self.fc2(x))
      actions = self.fc3(x)
      return actions

  • @ImDadidu
    @ImDadidu 3 роки тому

    Great video! Helped me a lot with my bachelor thesis. I'm working on a private project now where the agent needs to predict a x_action between -1.0 and 1.0 and a y_action between -1.0 and 1.0. How can I manage the action indices in the learn()-method if I have multiple floats which describe one action? Or do I need a completely differen model for that? Thanks in advance :)

  • @CustomDabber360
    @CustomDabber360 2 роки тому

    Amazing! I love your video.

  • @thelaconicguy
    @thelaconicguy 4 роки тому +2

    Hey Phil!, You are a class apart from others in explaining all these topics. I have a request for you. Since, reinforcement learning takes a lot of time when implemented on real world problems. Isn't it good to move your videos towards new techniques like 'Imitation learning', 'GANs' etc ?

    • @曹晨-s9s
      @曹晨-s9s 4 роки тому

      Thanks a lot Phil,I am your big fan,by the way
      Can u make some video about ppo and imitation learning

  • @jasonpeloquin9950
    @jasonpeloquin9950 Рік тому

    This video is very helpful. Did something change with the store_transition function? I am getting an array mismatch saying the requested array would exceed the maximum number of dimension of 1

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому

      If you're using the latest version of gym, the API has changed. Reset returns observation and info and the step function returns observation, reward, done, truncated, info

    • @jasonpeloquin9950
      @jasonpeloquin9950 Рік тому

      ah, can you just take the first argument of observation now with the new api? Also, I just bought your course, this tutorial was very helpful

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому

      Yup, you can discard the debug info.

    • @saifal-wahaibi6448
      @saifal-wahaibi6448 Рік тому

      Hey, how did you resolve the error?

    • @jasonpeloquin9950
      @jasonpeloquin9950 Рік тому

      @@saifal-wahaibi6448 you can just take the first element of that output. I can’t remember if I did it by indexing or doing .item

  • @happyduck70
    @happyduck70 2 роки тому

    A question: Is it really needed to make the terminal_batch a Tensor? Since null the q-values for terminal states on q_next, you could also use a np.array? is that correct?

  • @mickpress6718
    @mickpress6718 4 роки тому

    Hi Phil. Just found this channel, nice :) I may be wrong, but i think there may be a problem in the learn process, mem_counter is never reset, so once its hit batchsize it will learn every time the learn function is called.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      Nope, functioning as intended.

    • @kontra_21
      @kontra_21 4 роки тому +1

      That is intended. As he explain in the course, this is because at first there is no information in the state memories due to having just been initialized. So we need the agent to run through X amount of games (where X is your batch size) at a minimum before the agent can start to properly learn. After that it's never supposed to stop learning :)

  • @bradduy7329
    @bradduy7329 3 роки тому

    can you explain that why we don't need call forward function in DeepQNetwork?
    E.g: def forward()
    forward()

    • @marcoss147
      @marcoss147 3 роки тому

      Pytorch takes care of calling the function. If you call it anything other than forward it won't work. You should check the pytorch docs if you want to learn more

  • @Salehalanazi-7
    @Salehalanazi-7 4 роки тому +1

    Genius. Appreciate you 💜

  • @kontra_21
    @kontra_21 4 роки тому +5

    I really appreciate this simple agent walkthrough. I find it easy to digest compared to other courses I've seen, and doesn't try to explain the math behind it TOO much, which for novices is pretty nice.
    My concern though is that because our agent is learning every step of every episode, it is also decaying epsilon every step as well. This leads to a much more rapid and unpredictable descent of epsilon (due to each episode having varying number of steps) for the lifetime of the agent vs other agents I have seen. (Full decay by episode 15-25)
    Is this intentional? If so, is there any way you could elaborate on why we would want epsilon to be fully decayed within 5% of the Agents training time?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому +2

      Good question. It turns out that the epsilon decay schedule isn't super critical to learning, at least from my experience. You can get away with a rapid decay as long as epsilon is left sufficiently large as to allow exploration. If it were going all the way to zero(which you should never do unless you want to evaluate performance) then such an aggressive schedule would be a problem for sure.

  • @mehuljan26
    @mehuljan26 3 роки тому

    Love your videos. I have a question though, if i want to implement the same code on games with pixel as observation space, how do i do that? I am getting multiple errors while trying to implement breakout-V2

  • @billallen9251
    @billallen9251 Рік тому

    I followed and built the tensorflow 2 version of this yesterday and it ran great. I haven't been able to get the pytorch version to ever get above 0. I've scoured the code looking for bugs, I've tried every combination of hyper parameters. Has something changed in pytorch that needs to be reflected in this code. My version is 1.13.1.

  • @anthonysu71
    @anthonysu71 4 роки тому +2

    Hi, Dr. Phil. Great work for a deep Q network implementation and demo. I have been following your tutorial for a while. I am recently doing a DQN for a "multi-agent" collection, which means there is more than 1 agent in the system but we consider them all as a collection. Correspondingly, state(agent1, agent2,...)and action(action1, action2, ...) as collections are used to describe this collection. But the trick is we don't know the number of agents for sure, which gives me a hard time describing n_actions(if 1 agent has 8 actions, 2 would have 64). Does the DQN framework still apply here? If it does, is it possible for you to give me some suggestions about how to modify this framework? Thanks in advance!!!

  • @FoxGameing148
    @FoxGameing148 4 роки тому

    thank for the help

  • @jose-alberto-salazar-jimenez
    @jose-alberto-salazar-jimenez 6 місяців тому

    I have a question... Say, one trains a model, and save its model state for later use... How would one go about loading the model state and performing testing of the agent?.... I've tried coding something (following what I've found on the internet, being, in a nutshell, loading the model state, changing it to eval model, then with torch no grad, selecting the actions greedily), which during training does pretty well at the end of its training (learning was expected), but when I try testing (for instance, to show others its performance), it performs horribly... can anybody help me?

  • @miriamramstudio3982
    @miriamramstudio3982 4 роки тому

    Hi Phil, is it correct that epsilon already reaches the eps_min of 0.01 after only 11 episodes ? Does it mean that we have almost no exploration anymore after 11 episodes ?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому +1

      Mostly correct. Only 1% of actions will be exploratory but that's sufficient for learning.

  • @mT4945
    @mT4945 4 роки тому

    Hi Phill,
    I just found your channel and I really like your content.
    Do you think reinforcement learning future compared to text mining and image recognition?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      I think we'll see more applications of RL to those other fields. None of them will get us close to AGI.

  • @andreamassacci7942
    @andreamassacci7942 4 роки тому

    Nice video. Well explained.

  • @hackathonhacks4119
    @hackathonhacks4119 3 роки тому

    ValueError: maximum supported dimension for an ndarray is 32, found 10000 ... from writing all code from here. What might be the issue here ?

  • @hossein_haeri
    @hossein_haeri 3 роки тому

    Why did you set the epsion to 1?

  • @qhieu195
    @qhieu195 4 роки тому

    Great tutorial!
    Can you make a video that builds a DQN from scratch using Numpy?

  • @nonago725
    @nonago725 10 місяців тому

    the line "self.state_memory[index] = state" in the store_transition() function is giving "ValueError: setting an array element with a sequence. The requested array would exceed the maximum number of dimension of 1."
    my code didn't work and then i copy-and-pasted your code, it's still getting the same error. why is this?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  10 місяців тому

      Because the latest version of gym changed the interface. Reset now returns observation and info, and step returns observation, reward, done, truncated, info.

    • @nonago725
      @nonago725 10 місяців тому

      @@MachineLearningwithPhil ah, okay. i changed the env line to "observation, _ = env.reset()" and everything works now. thank you

  • @MrEvilyogurt
    @MrEvilyogurt 3 роки тому

    does anyone have issues trying to load checkpoints after training? when i load the checkpoints my graph doesnt properly plot. It keeps a score of -21 at all episodes

  • @ΧρήστοςΠαλάσκας-π4ν
    @ΧρήστοςΠαλάσκας-π4ν 7 місяців тому

    Nice!

  • @nandans2506
    @nandans2506 4 роки тому

    Great content

  • @yemiyesufu5745
    @yemiyesufu5745 4 роки тому +2

    is the udemy course done with pytorch or tensorflow?

  • @Penguin134
    @Penguin134 4 роки тому

    How did you know to use [8] as input dims?

  • @n00bxl71
    @n00bxl71 Рік тому

    I tried implementing this, I implemented it exactly, but it just gets worse and worse. It's hovering at around -500 average score, it seems to just press as many buttons as possible and stay up in the air as soon as epsilon reaches minimum. Any thoughts?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  Рік тому

      Are you decaying epsilon and over time?

    • @n00bxl71
      @n00bxl71 Рік тому

      Not entirely sure what you were trying to say, but yes, epsilon is decreasing over time.

    • @n00bxl71
      @n00bxl71 Рік тому

      Could you tell me what version each library is supposed to be at, so that I can better recreate your setup?

  • @mgr1282
    @mgr1282 4 роки тому

    Hi Mr Phil, I have some issues with your code in the previous video with tf2. I used it for CartPole-v0 and FrozenLake-v0 of gym. for cartpole it did very well but for frozenlake was very very weak. I don't know why.
    BTW, in your code, in the body of build_dqn function, you didn't use input_dims; why?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      Regarding the input dims, they're inferred with Keras.
      Define poor performance for frozen lake? In my course we get 70% win rate using regular Q learning.

    • @mgr1282
      @mgr1282 4 роки тому

      @@MachineLearningwithPhil In which of your courses? I've got 70% win rate without neural network. I've expected much more with your tf2 code in the previous video but got under 10%. It was great for cartpole.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      Why do we use neural networks? What are their use cases and limitations?

    • @mgr1282
      @mgr1282 4 роки тому

      @@MachineLearningwithPhil I don't know exactly, I'm a beginner in reinforcement learning. I expected it could help our agent to learn better. Deep neural networks needs a lot of data.I know it is one of their limitations.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому +1

      Neural nets are designed to work for large / continuous state spaces. They don't handle the small discrete ones very well. Tabular Q learning is far better suited for an environment like the frozen lake.

  • @IsaacPFranco
    @IsaacPFranco 4 роки тому

    wondering how you got pytorch to recognize np.bool for self.terminal_memory, brought up an error for me. I had to change dtype to np.uint8

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому +1

      Older versions of PyTorch used np.uint8. The newer (1.4) version requires np.bool and throws an error with np.uint8

  • @spinity8468
    @spinity8468 4 роки тому

    I thought Q and Q* use a different NN, but it seems not the case here. Am I wrong?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      I omit the use of the target network in this tutorial. Hence the "simple" part of the title. It's intended to be the simplest implementation that actually works in a non trivial sense.

    • @spinity8468
      @spinity8468 4 роки тому

      @@MachineLearningwithPhil You did a nice job! I am wondering if you have a similar video using two different networks for Q and Q*. Do you have such thing?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      ua-cam.com/video/a5XbO5Qgy5w/v-deo.html

    • @spinity8468
      @spinity8468 4 роки тому

      @@MachineLearningwithPhil I am not familiar at all with Keras or Tensorflow. Do you have the equivalent with Pytorch?

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  4 роки тому

      If you check out my github (linked in description), the repo for my course is there. You can see the PyTorch equivalent.

  • @alexandrefournier-ahizoune8098
    @alexandrefournier-ahizoune8098 2 роки тому

    what does "fc1" stands for ?

  • @haneulkim4902
    @haneulkim4902 3 роки тому

    eps_dec = 5e-4 and each time learning happens it substract current epsilon by eps_dec, so starting from 1 it should output epsilon 1, 0.9995, 0.9990, 0.9985, etc... This is not true when I run main_py for lunar_lander. Why is that so? it shrinks like follows 0.99, 0.95, 0.89, 0.84, etc... seems like it decrease by 0.05.

    • @MachineLearningwithPhil
      @MachineLearningwithPhil  3 роки тому +1

      The decrement happens each time step; the print is at the end of every episode.

    • @haneulkim4902
      @haneulkim4902 3 роки тому

      @@MachineLearningwithPhil Oh hahah my bad, thanks Phil!

  • @SalvatorePellitteri
    @SalvatorePellitteri 4 роки тому +1

    Next time use font size 22 at least.

  • @abrahamloha3050
    @abrahamloha3050 2 роки тому

    best

  • @patrickphillips7009
    @patrickphillips7009 4 роки тому

    At 33:54 "is our children learn..., is our agent learning" funny

  • @shivg2519
    @shivg2519 4 роки тому

    nice

  • @kutilkol
    @kutilkol 2 роки тому +1

    dude, start using some ide from this millennium omg

  • @emanuelepapucci59
    @emanuelepapucci59 2 роки тому

    Finally here I see for the first time the fucking plotLearning function ... god... I don't know how many videos I saw without know what that function was and why I couldn't use it ... now finally I know ... you made it ... next time remember to put ALWAYS a refer link under your videos regarding functions that you use and are not inside the packages... otherwise is no sense follow your tutorial... I'm saying this to you for the next time, because I'm a beginner and I can't understand that a function is not inside a package or less, if you don't explain it...

  • @abhijiths2918
    @abhijiths2918 3 роки тому +6

    Good Tutorial. But man if you could just open your mouth when you speak! I had to enable subtitles just to understand what you're saying, and half the time subtitles were wrong because it can't understand what you're saying!