I've been programming board game engines for 25 years and I've followed the development of CNNs to play go quite closely. This video is a really good description of the AlphaGo Zero paper, with very clear explanations. Well, the explanation of MCTS was completely wrong, but other than that this video was great. I'll make sure to check out more from this channel.
7:24 Your explanation of MCTS is not correct. For one instance of simulation: It picks the top move recommended by the network (greedy) most of the time, with random moves some of the time (epsilon). Then it walks into that move and repeats the same. It does it to completion. Then it backs up and keeps track of win vs visit ratio for each state as shown in the picture. It repeats this whole process 1600 times. As it is performing these walkthroughs it trains the networks and updates the values. So eventually, the more often you see a state, it will statistically converge to optimal value. MCTS runs to completion, its not a depth pruning algorithm. Temporal Difference stops somewhere in the middle, this was not used in AGZ. MCTS algorithm is discussed by David Silver in his lecture #8 towards the end.
I checked the paper and you are indeed correct! The used MCTS doesn't always play out every thread until the very end of the game (they use value tresholds for early stopping) but I did misinterpret the meaning of the '1600 simulations', thanks for pointing this out!
So I am going down the search tree (based on an algorithm that takes exploration and other things into account), until I reach a leaf node. I put the leaf node position into my neural net and get a policy and an evaluation as a result. The policy adds new leaf nodes to my current position and the value function gives me an evaluation of my current position. Is this correct ?
We, humans, run simulations in our heads all the time because sometimes simple intuitions are not enough... So, I guess, it isn't surprising that inclusion of Monte Carlo Tree Search would always drastically improve performance no matter how good the value function estimates are, even with the help of deep learning... The question is how to search more efficiently and also how to build an efficient model...
You explanation skills are fantastic! I like how he has an outline at the begging of his video, very simple thing yet very effective when it comes to teaching a subject, yet so few educational videos do that. If I were to figure out the paper by myself, it would have taken me personally ~2x longer. Subscribed.
Thank you for taking the time to explain it so well. Still difficult for me that I'm not familiar with the matter yet, but you did really a good job of showing it clearly!
This is cool, but after the third random jumpscare sound I couldn't pay attention to what you were saying--all I could think about was when the next one would be. Gave up halfway through since it was stressing me out
Brilliant - thanks for this! Really enjoyed watching and I think it takes away all the right information from the paper. Just a quick point: is there any chance you could quieten down the background music for your next video? It was slightly distracting and I think it detracted a bit from your great explanation! Merry Christmas!
Thanks a lot, great to hear :) And for the background music: I got the same feedback from a few different people! This was my first video (every other video you'll find on my channel has this fixed) :p
Thank you very much for the effort to educate the ignorants about the mechanisms linked to Alpha Go that is taken from a lot of ignorants like a "beast" like "terminator-like machine"... for me to simplify, I would say that the Go champion played versus 50000 professional go players -> no chance to win at all. As kasparov failed winning against 50000 human amateur players in this last decade.For me this is the massive parallel processes and recursive functions that beated the champion, technically beaten, this is definetly NOT intelligence but MASSIVELY PARALLEL processes clustered on thousands of CPU and GPU (floating point operations).
Thank you for this valuable explanation! I just want to request, if you would like to highlight the parts of your images more, that you are talking about? E.g. in the diagrams, you show. This would make it easier to follow!
10 Nov 2022 - I just discovered this video and your channel. Fantastic explanation of granted a difficult subject to even tackle. Did you mention what kind of computer hardware the newest AlphaGo system uses? I assume it’s a mainframe of some type. Also, I wonder if the system can decide in advance to play a peaceful game or a highly combative game? I have seen games where they were very few prisoners taken off the board. Otherwise called the peaceful game. Still there is a winner nonetheless. Anyway bravo for an excellent video.
The part I don't understand, is how they dispense with rollout in MCTS. It seems like this is the only way to get a ground truth value ( by reaching a terminal state) which can then be propagated back up the chain. If you reach a non terminal state, you're back propagating a value from the policy network which won't have useful values until it can be trained on useful values from the tree search. It seems like it's pulling itself up by it's bootstraps. Is it the case that the true values come from the odd time that a simulation reaches a terminal state? Or am I missing something fundamental?
7:27 u said "certain depth", did u mean "certain width"? btw. I'd say this is one of the very best channels on the DL topic I've ever seen! thanks so much!
Thanks for the great explanation! Im still wondering how alpha go zero learns that certain moves are obviously bad like playin in the corner for example without playing a game till the end?
Does anybody know if the shape of the output layer changes for every phase of the game? In the video, he explains that the network produces probability distribution over possible moves and the number of possible moves is dynamic. Does that mean the output layer's dimension is also dynamic? If so, how is it achieved? Can anyone help me understand? Thanks!
A little question on the AlphaGo Zero MCTS. The Monte-carlo aspect of the Alpha Zero MCTS seems to be gone AFAIK. Can't see random number or random choices in that MCTS. It seem to be replaced by the CNN calculating the probability of a board position leading to victory. What's your take on it ?
Well, it uses deep neural nets (value estimate + policy net) + self-play training (Reinforcement Learning) to make the Monte Carlo Tree Search tractable on the exponentially scaling Go search space. So yes it's number crunching, but that's what AI is all about...
You could use that to speed up training in the beginning for version 2.0, but eventually performance will saturate and you wont do better.. And if you're building a version 2.0 you're hoping to do better than 1.0, so bootstrapping on gameplay that is worse than what you want to achieve doenst really makes sense. Similarly AlphaGoZero got better than AlphaGo by NOT bootstrapping on human games...
Is it hard to implement this algorithm by myself? Could I create a super-human go Player on say a 7x7 board with just my laptop? How big could I make the board using just a normal laptop?
There's a ton of open source implementations on GitHub: github.com/topics/alphago-zero but I know that many people are having issues reproducing the full strength of DeepMinds version. I don't know if the 'interesting game mechanics' of Go also emerge on a small board like 7x7 but I would guess that you can definitely train a decent model on a laptop for such a small game-state. Additionally, you could also apply the algorithm to chess, which has a much smaller branching factor so it's easier to train, although again I think in order to get decent results you would have to throw some cloud computing power in the mix :)
7:37 "... to play about 1600 simulations for every single board evaluation ... " I have a question. How do they do this? Even if it's not 19*19*17, let's say it's just 19*19*2 around 700, there would be (2**700) "board evaluation"s (maybe less if there are some illegal board states, but not much less). How could they even just play one simulation for so many "board evaluation"s? Guess I'm missing something...
To build out the search tree, potential actions are sampled from the policy network (both for white and black moves) (so this hugely constrains the tree rollout to the most-likely / best moves). And then they also do pruning according to the value network, so whenever a part of the search tree results in a very low chance of winning (below some threshold according to the value network) it is discarded and not explored further. Combining these two approaches they build out the search tree and finally decide what move to play at the top of the tree by looking at the average value estimates for each of the tree's branches.
Good info but you should consider losing those annoying and jarring scene transition guitar strums+kick drum sounds. They detract very much from the presentation.
dougdevine27 haha very true, this was my first video :p I removed them in all my other content ;) Unfortunately, once uploaded UA-cam doesn't let you change anything anymore..
Ihor Menshykov yeah had the same thought at first. Apparently including the history let's the network learn a form of attention over the important/active parts of the game. But I agree that theoretically, it shouldn't really be necessary... See the Reddit Q&A for more details!
I've been programming board game engines for 25 years and I've followed the development of CNNs to play go quite closely. This video is a really good description of the AlphaGo Zero paper, with very clear explanations. Well, the explanation of MCTS was completely wrong, but other than that this video was great. I'll make sure to check out more from this channel.
This is a valuable explanation, this channel is a great discovery
noranta4 thanks man, just started two weeks ago ;) More vids coming up :p
7:24 Your explanation of MCTS is not correct. For one instance of simulation: It picks the top move recommended by the network (greedy) most of the time, with random moves some of the time (epsilon). Then it walks into that move and repeats the same. It does it to completion. Then it backs up and keeps track of win vs visit ratio for each state as shown in the picture. It repeats this whole process 1600 times. As it is performing these walkthroughs it trains the networks and updates the values. So eventually, the more often you see a state, it will statistically converge to optimal value. MCTS runs to completion, its not a depth pruning algorithm. Temporal Difference stops somewhere in the middle, this was not used in AGZ. MCTS algorithm is discussed by David Silver in his lecture #8 towards the end.
I checked the paper and you are indeed correct! The used MCTS doesn't always play out every thread until the very end of the game (they use value tresholds for early stopping) but I did misinterpret the meaning of the '1600 simulations', thanks for pointing this out!
Do I get it correct: with average depth ~300 moves and no early stops that will be ~1600*300 network queries just for the first move?
so it plays out and if he wins then those moves he picked are trained to be 1 and value 1? and if he losses everything 0?
AGZ doesn't really use MCTS as it doesn't use rollouts, it doesn't play the game out to the end.
So I am going down the search tree (based on an algorithm that takes exploration and other things into account), until I reach a leaf node.
I put the leaf node position into my neural net and get a policy and an evaluation as a result. The policy adds new leaf nodes to my current position
and the value function gives me an evaluation of my current position.
Is this correct ?
We, humans, run simulations in our heads all the time because sometimes simple intuitions are not enough... So, I guess, it isn't surprising that inclusion of Monte Carlo Tree Search would always drastically improve performance no matter how good the value function estimates are, even with the help of deep learning... The question is how to search more efficiently and also how to build an efficient model...
You explanation skills are fantastic! I like how he has an outline at the begging of his video, very simple thing yet very effective when it comes to teaching a subject, yet so few educational videos do that.
If I were to figure out the paper by myself, it would have taken me personally ~2x longer.
Subscribed.
Thank you! This is the one of the clearest and most concise explanations of any paper I've found thus far.
Clearest and most informative video I've seen on AlphaGo. Thanks!
Excellent explanation, thanks!! I'm going to make my own 9*9 alphago zero version
I couldn't find it on your GitHub friend
Best explanation I found about AlphaGo Zero
You explained technical stuff very clearly. Thanks Arxiv Insights
Awesome explination! (And, you're greenscreen work looks great!)
Thank you for taking the time to explain it so well. Still difficult for me that I'm not familiar with the matter yet, but you did really a good job of showing it clearly!
It's very clear, thank you! I can't wait to discover the other videos :)
Fantastic explanation! Few people balance simplicity with thoroughness as well as you do.
That's the goal indeed, thx for the feedback :)
This is the best video regarding Alpha GO paper. Just Amazing !!!
the transition 'dhkk' hits hard
Xander, you look like "The Hoff" (David Hasslehoff) and that's a great look!
germans love david hasslehoff
cant wait for this thing to perform in sc2
So what did you think?
This is cool, but after the third random jumpscare sound I couldn't pay attention to what you were saying--all I could think about was when the next one would be. Gave up halfway through since it was stressing me out
This helps a lot for those who need insights of machine learning trends :)
Thank you, finally I found a good video on this paper.
that's some giga chad jaw u have there
Brilliant - thanks for this! Really enjoyed watching and I think it takes away all the right information from the paper.
Just a quick point: is there any chance you could quieten down the background music for your next video? It was slightly distracting and I think it detracted a bit from your great explanation!
Merry Christmas!
Thanks a lot, great to hear :) And for the background music: I got the same feedback from a few different people! This was my first video (every other video you'll find on my channel has this fixed) :p
Damn great video! Carry on ! Makes it very easy to get into these advanced subjects :)
Excellent video
wow, i see some mistakes and also I didn't watch to much of your videos, but i find this channel is definitely underrated
Great summary of the paper! Thank you :)
You should open this up to community captioning
Done! Good suggestion :)
Excellent video, thanks for making it!
Thank you for this great explanation!
Excellent explanation, thanks
Thank you very much for the effort to educate the ignorants about the mechanisms linked to Alpha Go that is taken from a lot of ignorants like a "beast" like "terminator-like machine"... for me to simplify, I would say that the Go champion played versus 50000 professional go players -> no chance to win at all. As kasparov failed winning against 50000 human amateur players in this last decade.For me this is the massive parallel processes and recursive functions that beated the champion, technically beaten, this is definetly NOT intelligence but MASSIVELY PARALLEL processes clustered on thousands of CPU and GPU (floating point operations).
GREAT WORK
Excellent Presentation
Thank you for this valuable explanation!
I just want to request, if you would like to highlight the parts of your images more, that you are talking about? E.g. in the diagrams, you show. This would make it easier to follow!
Great explanation. Thank you!
Wow, this is really a great explanation
Loving your channel!
10 Nov 2022 - I just discovered this video and your channel. Fantastic explanation of granted a difficult subject to even tackle. Did you mention what kind of computer hardware the newest AlphaGo system uses? I assume it’s a mainframe of some type. Also, I wonder if the system can decide in advance to play a peaceful game or a highly combative game? I have seen games where they were very few prisoners taken off the board. Otherwise called the peaceful game. Still there is a winner nonetheless. Anyway bravo for an excellent video.
The part I don't understand, is how they dispense with rollout in MCTS. It seems like this is the only way to get a ground truth value ( by reaching a terminal state) which can then be propagated back up the chain. If you reach a non terminal state, you're back propagating a value from the policy network which won't have useful values until it can be trained on useful values from the tree search. It seems like it's pulling itself up by it's bootstraps. Is it the case that the true values come from the odd time that a simulation reaches a terminal state? Or am I missing something fundamental?
maaaan you are doing a great work keep up
The history (extra 7 layers) is also used to identify ko (kind of similar to threefold repetition in Chess)
7:27 u said "certain depth", did u mean "certain width"? btw. I'd say this is one of the very best channels on the DL topic I've ever seen! thanks so much!
How is the value representation trained?
Thanks for the great explanation! Im still wondering how alpha go zero learns that certain moves are obviously bad like playin in the corner for example without playing a game till the end?
did u guys teach alphago how to beat security systems yet? and take over teh stock market and all nuclear launch codes?
Does anybody know if the shape of the output layer changes for every phase of the game? In the video, he explains that the network produces probability distribution over possible moves and the number of possible moves is dynamic. Does that mean the output layer's dimension is also dynamic? If so, how is it achieved? Can anyone help me understand? Thanks!
No, the output layer shape is static. You need to zero out the illegal moves from the output and then renormalize the probabilities to sum to 1.
I am wondering what the output "policy vector" is like in the neural network
distracting music
I think it's just fine
I could barely notice it
Do Alphafold 2 when paper comes out)
A little question on the AlphaGo Zero MCTS. The Monte-carlo aspect of the Alpha Zero MCTS seems to be gone AFAIK. Can't see random number or random choices in that MCTS. It seem to be replaced by the CNN calculating the probability of a board position leading to victory. What's your take on it ?
yo this video is dope. it's super fire. Just letting u know i'm a dan player
I want to know more about this i hope your my school teacher
11:39 two of four of your "very popular moves that stands for thousands of years" was dismissed by alphago after 50 hours of training.
Thanks for the impressive explanation, where can I learn the source code ?
Danger Will Robinson!!!
i'm confused as your explanation contradict the points you mentionned in the introduction
So not really AI, just number crunching using statistical analysis (montecarlo tree).
Well, it uses deep neural nets (value estimate + policy net) + self-play training (Reinforcement Learning) to make the Monte Carlo Tree Search tractable on the exponentially scaling Go search space. So yes it's number crunching, but that's what AI is all about...
very nice video ! put a microphone closer to you so we don't have that annoying reverb please
Interesting video :-)
What if instead of self training, the ai is trained with the data of matches of a previously trained alpha zero ai?
You could use that to speed up training in the beginning for version 2.0, but eventually performance will saturate and you wont do better.. And if you're building a version 2.0 you're hoping to do better than 1.0, so bootstrapping on gameplay that is worse than what you want to achieve doenst really makes sense. Similarly AlphaGoZero got better than AlphaGo by NOT bootstrapping on human games...
Ok. So how do you play and what is the idea of it?
Keep on! It's great.
Is it hard to implement this algorithm by myself? Could I create a super-human go Player on say a 7x7 board with just my laptop?
How big could I make the board using just a normal laptop?
There's a ton of open source implementations on GitHub: github.com/topics/alphago-zero but I know that many people are having issues reproducing the full strength of DeepMinds version. I don't know if the 'interesting game mechanics' of Go also emerge on a small board like 7x7 but I would guess that you can definitely train a decent model on a laptop for such a small game-state. Additionally, you could also apply the algorithm to chess, which has a much smaller branching factor so it's easier to train, although again I think in order to get decent results you would have to throw some cloud computing power in the mix :)
well, basically like humans acquire skills ... from scratch.
cool
so really it's just one big ass flow chart, if this then that
It’s too difficult to explain how great of alphago to the people who don’t know how to play wei qi.
my brain feels sexually abused after watching this video...
🤣🤣
Go play with yourself, so you can learn Go! XD
7:37 "... to play about 1600 simulations for every single board evaluation ... " I have a question. How do they do this? Even if it's not 19*19*17, let's say it's just 19*19*2 around 700, there would be (2**700) "board evaluation"s (maybe less if there are some illegal board states, but not much less). How could they even just play one simulation for so many "board evaluation"s? Guess I'm missing something...
To build out the search tree, potential actions are sampled from the policy network (both for white and black moves) (so this hugely constrains the tree rollout to the most-likely / best moves). And then they also do pruning according to the value network, so whenever a part of the search tree results in a very low chance of winning (below some threshold according to the value network) it is discarded and not explored further. Combining these two approaches they build out the search tree and finally decide what move to play at the top of the tree by looking at the average value estimates for each of the tree's branches.
Good info but you should consider losing those annoying and jarring scene transition guitar strums+kick drum sounds. They detract very much from the presentation.
dougdevine27 haha very true, this was my first video :p I removed them in all my other content ;) Unfortunately, once uploaded UA-cam doesn't let you change anything anymore..
*Guitar Noise*
Nice video. But your sound effects and music are VERY loud. Maybe normalize a bit?
Anybody knows wtf they use a move history for? Aside from obfuscating learning and computation by many many times over? Seems like nonesense.
Ihor Menshykov yeah had the same thought at first. Apparently including the history let's the network learn a form of attention over the important/active parts of the game. But I agree that theoretically, it shouldn't really be necessary... See the Reddit Q&A for more details!
You're...gwern? What?
exactly! Is he?
Epic👌👌👌👌
😊
+1 sub
🔥🧠🔥
Get rid of that sound effect. It’s weird and jarring. Do a graphical transition instead.
So big spreadsheet and not AI - got it.
You explain shi@ too complicated
i am not getting a single thing out of this
sry bad made