Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning
Вставка
- Опубліковано 25 лип 2024
- This tutorial contains step by step explanation, code walkthru, and demo of how Deep Q-Learning (DQL) works. We'll use DQL to solve the very simple Gymnasium FrozenLake-v1 Reinforcement Learning environment. We'll cover the differences between Q-Learning vs DQL, the Epsilon-Greedy Policy, the Policy Deep Q-Network (DQN), the Target DQN, and Experience Replay. After this video, you will understand DQL.
Want more videos like this? Support me here: www.buymeacoffee.com/johnnycode
GitHub Repo: github.com/johnnycode8/gym_so...
Part 2 - Add Convolution Layers to DQN: • Get Started with Convo...
Reinforcement Learning Playlist: • Gymnasium (Deep) Reinf...
Resources mentioned in video:
How to Solve FrozenLake-v1 with Q-Learning: • How to Use Q-Learning ...
Need help installing the Gymnasium library? • Install Gymnasium (Ope...
Solve Neural Network in Python and by hand: • How to Calculate Loss,...
00:00 Video Content
01:09 Frozen Lake Environment
02:16 Why Reinforcement Learning?
03:12 Epsilon-Greedy Policy
03:55 Q-Table vs Deep Q-Network
06:51 Training the Q-Table
10:10 Training the Deep Q-Network
14:49 Experience Replay
16:03 Deep Q-Learning Code Walkthru
29:49 Run Training Code & Demo - Наука та технологія
Please like and subscribe if this video is helpful for you 😀
Check out my DQN PyTorch for Beginners series where I explain DQN in much greater details and show the whole process of implementing DQN to train Flappy Bird: ua-cam.com/video/arR7KzlYs4w/v-deo.html
Subscribed. You have really understood the concept of DQN. I was trying to implement DQN for a hangman game (Guessing characters to finally guess the word in less than 6 attempts), and your explanation helped drastically.
Although I need your opinion regarding the input size and dimensions for hangman game since in this Frozen lake game, the grid is predefined.
What could we do in case of word guessing game where each word have a different length ? Thanks for the help in advance.
Hi, not sure if this will work, but try this:
- Use a vector of size 26 to represent the guessed/unguessed letters: 0 = available to use for guessing, 1 = correct guess, -1 = wrong guess.
- As you'd mentioned, words are variable length, but maybe set a fixed max length of say 15, so use a vector of size 15 to represent the word: 0 = unrevealed letter, 1 = revealed letter, -1 = unused position.
Concatenate the 2 vectors into 1 for the input layer. Try training with really short words and a fixed max length of maybe 5, so you can quickly see if it works or not.
@@johnnycode Thanks man, the concatenation advice worked. You are really good.
@@hrishikeshhGreat job getting your game to train!🎉
The best teachers are those who teach a difficult lesson simply. thank you
Johnny, you explained what I've been trying to wrap my head around for 9 months in a few minutes. Keep up the good work.
Thank you for making this video. Your explanations were clear👍 , and I learned a lot. Also, I find your voice very pleasant to listen to.
Thanks for this tutorial. It helped me to understand DQN .
You are a great teacher, thank you so much and Merry Christmas 2023
Wonderful Explanation!
These videos on the new gymnasium version of gym are great. ❤ Could you do a video about thr bipedal walker environment?
This was a really well explained example
awesome video!
Excellent!
Thanks again for your good work to help me understand reinforcement learning better.
You’re welcome, thanks for all the donations😊😊😊
merci beaucoup pour votre réponse
thank you so much.can u please implement any env with dqn showing the forgetting problem?
the goat
Thank you for making this video, may I ask a question about the Q network, why you set the input space for the network 16 input at 5:19 rather than 1 that represents the state index only?
I think 1 input is enough
Hi, good observation. You can change the input to 1 node of 0-15, rather than 16 nodes of 0 or 1, and the training will work.
Currently, the trained Q-values will not work if we were to reconfigure the locations of the ice holes. That is because we are not passing in the map configuration to the network and reconfiguring the map during training. I was thinking of encoding the map configuration into the 16 nodes, but I left that out of the video to keep things simple. I hope this answers why I had 16 nodes instead of 1.
@@johnnycode thank you for this quick reply which I was't expect
This makes sense to me
Hey Johnny! Thanks so much for these videos! I have a question, is it possible to apply this algorithm to a continuous action space? For example, select a number in a range between [0, 120] as an action, or should I investigate other algorithms?
Hi, DQN only works on discrete actions. Try a policy gradient type algorithm. My other video talks about choosing an algorithm: ua-cam.com/video/2AFl-iWGQzc/v-deo.html
sry, may i ask how can i find the max_step in the training of every epsiode? i do i know that max action is 200?
You actually have to add a few lines of code to enable the max step truncation. Like this:
from gymnasium.wrappers import TimeLimit
env = gym.make("FrozenLake-v1", map_name="8x8")
env = TimeLimit(env, max_episode_steps=200)
@@johnnycode tyvm sryy may i ask another question if now my agent have a health state =16, every step took one blood and how can I let the agent know this? I mean.. let the agent also consider the health state in training of DQN. Can u give me some thought ? sry , im a newbe, and this may be stupid to ask.. thanks
You are talking about changing the reward/penalty scheme, that is something you have to change in the environment. You have to make a copy of the FrozenLake environment and then make your code changes. If you are on Windows, you can find the file here: C:\Users\\.conda\envs\gymenv\Lib\site-packages\gymnasium\envs\toy_text\, find frozen_lake .py
For example, I made some visual changes in the FrozenLake environment in the following video, however, I did not modify the reward scheme: ua-cam.com/video/1W_LOB-0IEY/v-deo.html
Thanks for the video! Can you make a example with a BattleShips game? im trying, but the action (ex. position 12) its the same that the new state (12)😢
I actually never played Battle Ship (I'm assuming the board game?) before :D
Do you have a link to the environment?
@@johnnycode i dont have the environment finish yet
merci d'avoir dit moi comment crier l'environement en preier lieu , puisque je voudrais crier un environement tels que votre 4 x 4 mais avec d'autre images .
how to start learning reinforcement learning? i knew panda numpy matplotlib and basic ml algo?
Great video, thanks. Please I am from cyber security background, please do you have idea on Network Attack Simulator (NASim) which also uses Deep Q-learning and openai gym? If you don't, please can you guide on where to find tutorials on it? I have checked youtube for weeks but couldn't get any. THANKS
Are you referring to networkattacksimulator.readthedocs.io/ ?
Did you try the tutorial in the documentation? What are you looking to do?
great video! Maybe a next env to try is mario?
Thanks, I’ll consider it.
Any chance you can show us how to use Keras and RL on Tetris?
Thanks for the interest, but I'm no RL expert and totally not qualified to give private lessons on this subject. I'm just learning myself and making the material easier to understand for others. As for Tetris, I have not tried it but may attempt it in the future.
I don't quite understand, because if you change the position of the puddles, the trained model will no longer be able to find the reward, right? What is the purpose of Qlearning then?
Q-Learning provides a general way to find the "most-likely-to-succeed" set of actions to the goal. You are correct that the trained model only works on a specific map. In order for the model to solve (almost) any map, the agent has to be trained on as many map layouts as possible. The input to the neural network will probably need to include the map layout.
@@johnnycode Do you know other AI systems other than QLearning? I wouldn't like to pass on the layout since that would be like 'cheating' (talking about the battle ships board game for example)
@TutorialesHTML5 I'm not sure what other learning algorithm would work for Battleship, but Deep Q-Learning should work. Your "input" to the neural network would be your Target Grid, i.e. the shots fired/missed/hit. When there is a hit, the model should guess that the next highest chance of hitting is one of the adjacent squares. This video is for understanding the underlying algorithm, you might want to use a Reinforcement Library like what I show in this video: ua-cam.com/video/OqvXHi_QtT0/v-deo.html
@@johnnycode yes but every game change the boats position.. will work? If you like make video.. 😂😊
I might give it a shot, but no guarantees 😁
Just curious how you learned all of this? Did you just read the documentation or watch other videos?
These 2 resources were helpful to me:
huggingface.co/learn/deep-rl-course/unit3/deep-q-algorithm
pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
for i in range(1000);
print("thankyou")
😁
Hi Johnny...Do you know where I can find an example of applying a DQN to the Taxi-V3 environment?
The Taxi env can be solved with regular q-learning. If you take my code from the Frozen Lake video and swap in Taxi, it should work.
@@johnnycode Hi Johnny...I made it work with Q-Learning, howerver I was wanting to replace the Q table with a simple DQN. Seems like this would be possible. I tried searching and even asked ChatGPT to help but cannot quite get it to work.
How did you encode the input to the policy/target network?
@@johnnycode Can I send you my code? It is a very short script that attempts to use a CNN to solve the Taxi-V3 problem.
Sorry, I can’t review your code. If you have specific questions, I can try to answer them.
Hi there. Thanks for the video guide. Super interesting and useful to better understand the topic. I'm trying to implement Tabular-Q-Learning and Deep-Q-Learning in the context of atari environment (especially pitfall), but I'm struggling to understand how to handle the number of possible states. I must use ram observation space but in this observation space a state is represented by an array of 128 cells each of type uint8 and with values that range from 0 to 255. So I cannot know a priori what is the exact number of states, since changing just one value of a cell will result in a different new state. Have you got any suggestions or do you know some guide to better understand how to manage this environment?
Tabular q-learning can not solve Atari games. You have to use deep q-learning (DQN) or another type of advanced algorithm.
For Pitfall, you should use rgb or grayscale instead of ram. Watch my video that explains some basics on how to use DQN on atari games: ua-cam.com/video/qKePPepISiA/v-deo.html
However, Pitfall is probably too difficult to start with. You should try training non-atari environments first. For example, my series on training Flappy Bird: ua-cam.com/video/arR7KzlYs4w/v-deo.html
@@johnnycode I wish I could change to another environment but I must do this because it is a project for university, so I need to stick with this one.
For Tabular I though the same thing, but my professor suggested that I could do it using a map, instead of a matrix, which accepts arrays or tuples as index, and whenever there is a state which was not initialized before, then init it to the default.
As for dqn I will happily check out your video to see if I can figure this up. Thank you.
And Ram is another condition I should maintain on observation space
I don't mean the challenge your professor, but it is impossible to solve pitfall with tabular q-learning. 128 combinations of 256 is 128 to the power of 256, which no computer memory can hold. You need to use a neutral network-based algorithm like dqn.
@@johnnycode I agree with you. I had the same thought about it and I even passed the last two days trying to train it with Tabular with this approach. I managed to make it work (in terms of structure of table and writing the q values), but of course I had just some episodes in which it maintained a reward of 0 and a lot of other episodes where the reward was negative (and it could never get to collect a single treasure).
So I don't know, maybe he wants to make me do it anyway and demonstrate that it can't be implemented like that. I will ask for a meeting with him and see what comes out.
Thank you so much for your time and your considerations. When I finish with the tabular I will start the dql with cnn. If I have some doubts, would you mind if I still ask you some things? I don't mean to disturb you further, but I found this debate and your videos about the topic really useful to better approach the problem.
sir i have one doubt.
sir we are training the policy network first and then copy pasting it as the target network
then again we are letting our agent to go through the policy network and updating it based on the target network
but again you mentinoed target network uses the dqn formula
i am totally confused sir.can you give it in crisp steps?
My video series on implementing DQN to train Flappy Bird has much more detailed explanations of the end-to-end process, check it out: ua-cam.com/video/arR7KzlYs4w/v-deo.html
@@johnnycode ok sir thanks a lot
@@johnnycode sir can i use the deep q n to generate the pixel positions of stones in a snake game , where the stones acts as obstacles?
please guide me sir
@World-Of-Mr-Motivater See if this video answers your question: ua-cam.com/video/AoGRjPt-vms/v-deo.html
please , give me a code source of this algorithm DQN , and thank you for this explanation
github.com/johnnycode8/gym_solutions
@@johnnycode merci beaucoup pour votre réponse, je travail sur colab ( python enline ) puisque j'ai des contraintes matérielles, et je comprend pas comment utiliser les codes sources que vous avez m'envoyé pour faire d'autres implémentation sur ce code , merci d'avoir expliquer comment faire cette opération , , et c'est possible par un simple vidéo ....., merci infiniment d'autre fois .
Here are some suggestions:
How to modify this code for MountainCar: ua-cam.com/video/oceguqZxjn4/v-deo.html
You might be interest in using a library like Stable Baselines3: ua-cam.com/video/OqvXHi_QtT0/v-deo.html
how can we solve the problem using a single DQN instead of 2?
You can use the same policy network in place of the target network. Training results may be worst than using 2 networks.
And which is the best solution between the single DQN and the Q-learning method?
@@ElisaFerrari-q5i dqn is the advanced version of q-learning.