This is brilliant, as a physicist turned pseudo-statistician, the physics-motivated approach of MCMC makes more sense to me than the statistical approach. Fun story: despite bearing his name, Nicholas Metropolis didn't contribute at all to the algorithm. The concept was originally developed by physicist Marshall Rosenbluth at Los Alamos for studying thermonuclear explosions. His wife Arianna Rosenbluth (also a physicist) turned the concept into code. The Rosenbluths are the true pioneers. They struggled with a consistent way to pick an appropriate acceptance function. One day having dinner with Edward Teller and his wife Augusta Teller, they described the problem, the implementation, and the challenge. The Tellers being a couple of physicist+mathematician suggested the obvious choice for a physicist: simply use the Boltzmann distribution. And the first MCMC algorithm was born. What did Metropolis do? He had the keys of the computer lab, and the alphabetical order used for naming authors made his last name the name of the algorithm. I personally call it the Rosenbluth algorithm.
Other fun fact: they were actually inspired by the Buffon needle problem where one can determine pi by throwing needles on the ground. Ps: funily enough it actually does not work directly on a computer because we need pi to generate the needle orientation (there is a way to go over that though)
That "coming clean" bit really highlights the difference between theory and reality; to operationalize Metropolis-Hastings requires a lot of extra work.
I really respect the way the 'like' and 'sub' requests weren't explicit. It was made clear they were also important and prioritized them by making their consideration a natural part of the exposition. Bravo.
Not a statistician, but a machine learning person -- one of the more fascinating things about this algorithm to me is you can use it to explore a solution space by turning it on its head and inventing a probability distribution that makes better solutions more likely. It's not super widely used but it has applications. Important work that deserves better visibility!
That’s cool! This channel’s eventually going to have to get into machine learning at some point, and it sounds like there’s some cool stuff I’ve never heard of
That’s cool! This channel’s eventually going to have to get into machine learning at some point, and it sounds like there’s some cool stuff I’ve never heard of
That was the best explanation of MCMC I have heard. I'm going to listen to it ten more times and see I start to understand. This stuff is really hard for people with no stats or higher math background, but it's fascinating and fun to learn, however slowly
Happy to see you covering this topic. MCMC is what made Bayesian statistics viable beyond conjugate priors and simple distributions. There's a lot more to dig into including Hamiltonian Monte Carlo (what NUTS is based on) and Gibbs sampling, as well as convergence metrics. Bayesian Data Analysis 3rd ed. (Gelman, Carlin, Stern, Dunson, Vehtari, & Rubin) is the go to book for this.
@@mickaiba71 I'd say it's targeted towards someone with a Master's. There's a lot more besides just MCMC. It's really meant to be used more like a reference than something you read cover to cover. "Statistical Rethinking" by McElreath gives one of the best introductions to Bayesian Statistics out there. There's a few chapters on MCMC in that one as well.
@@mickaiba71 you can also tackle from the point of view of some of its applications. I had first learned of MH algorithm through learning about probabilistic graphical models. Bayesian Reasoning and Machine Learning by David Barber did a great job for me introducing and connecting these topics together
I've never seen the acceptance rate expression explained/derived so intuitively - great job! By the way, to get your RWM to work, judging by the traceplots it seems that you would have needed to add some sort of pre-conditioning to change the variance of the proposal distribution for each dimension of your target separately, since alpha and beta seem to be on a different "scale" to the p variables, which is why finding a common "h" that allowed the chain to mix fast enough in all directions without getting stuck would be borderline impossible. RWM (even with a Gaussian proposal rather than uniform) tends to struggle with hierarchical models and high-dimensional models more generally because of this idea, but thankfully we have things like STAN using HMC. Would love to see an explainer on NUTS/HMC in your explanatory style if you ever feel like it!
yeah, I saw someone else implement a hierarchical beta-binomial, but they chose to update alpha and beta separately. Based on what you said, I can definitely see why now. At least it was a good learning experience on my part 😅
Thank you for a stimulating video on an important statistical method. The hand-written equation appearing on-screen from 0:10 to 0:22 and briefly captioned “Despite how it looks, this is just an average,” is a fundamental building block of thermodynamics. It was at the core of my doctoral dissertation project in polymer physical chemistry in the 1960s. A generation earlier the average geometric properties of a polymer such as its end-to-end distance or radius of gyration (average distance of component atoms from center of mass) were simulated with pencil and paper as a Brownian motion type random walk with a fixed step length representing the distance between atoms along the connected chain, often constrained to a 2- or 3-dimensional spatial lattice. Their mathematical properties were delineated by Einstein and others. However, a Brownian particle can visit the same lattice point indefinitely, while two real atoms cannot occupy the same point in space. This gave rise to the concept of the self-avoiding random walk, a much more difficult mathematical problem. The 1950s saw the proliferation of electronic computers like the Illiac, that made it possible to simulate more realistic average polymer properties with self-avoiding walks, but it was quickly found that the proportion of walks without self-intersections decreased rapidly as the chain length (no. of steps) increased. A few clever biasing techniques involving re-starting a self-intersecting walk improved things slightly but these were only minor tweaks. In the 1960s Moti Lal, a chemist at Unilever Labs in the UK, became aware of Metropolis’s paper and applied it to the polymer problem. However, his available computing power (IBM 360/50) confined his polymers to 30 monomer units on a 2-dimensional spatial lattice. As a graduate student at NYU in 1968 I had access to a Control Data Corp. CDC 6600 Cray Supercomputer at the Courant Institute of Mathematical Sciences (they let us chemists use their facilities). I used the Metropolis-Lal method to generate polymers with free rotation about their backbone bonds, i.e., not restricted to an artificial lattice. I used your hand-written equation with Boltzmann weighting exp(-E/kT) to generate average geometric and thermodynamic properties from thousands of Monte Carlo samples. The Metropolis importance sampling algorithm generated samples from more “important” regions of polymer conformation space so it took fewer samples and far less computer time to get stable averages. I could even generate numerical distributions of end-to-end distances of polymers with sufficient accuracy to discriminate among several distribution functions proposed by theoreticians.
Awesome job at explaining the Metropolis-Hastings algorithm! For the hyperparameter tuning problem of finding the best half-width h of your proposal distribution there is the classic paper "Weak convergence and optimal scaling of random walk Metropolis algorithms" by A Gelman, WR Gilks, GO Roberts from 1997, where they assume a Gaussian proposal distribution and in the limit of a high dimensional parameter space, the optimal acceptance ratio comes out at ~0.23. So you can tune your h so that the average acceptance ratio is about 1/4 as a good heuristic to start with. It can get more complicated if the different parameter dimensions have vastly different optimal standard deviations for the gaussian proposal distribution and there are lots of elaborate papers about the topic that came afterwards, but it's a place to start and a nice simple heuristic.
There's a cool application of this underlying principle with markov chains in that you can coordinate uncoordinated agents without having to track or communicate with them in real time. Taking your Portland Example, you would get your 100 people stationed at each activity in roughly the correct proportions by giving them a list of the other activities and the probability that they should stay put or go to each other place. They'll all shuffle around each hour, but they'll adhere closely to the target. This has applications for things like, say, a swarm of Drones spelling out a word in the sky. Sure, you could calculate a specific 3d coordinate for each specific drone and transmit that info to each of them 1 by 1 every time you want to change the word. Or you could just broadcast the 'pixel' locations you want the drones to occupy, and how densely filled each pixel is, and let them individually, randomly pick which space to occupy. Especially if you let them constantly cycle their positions, it'll do a good job with the task.
Great video! At 6:35 it was said that one can take a continuous function and divide it by it’s integral to get a probability distribution. I think we are implicitly taking two more assumptions here to make it work: 1. The function is non-negative (which we can usually assume if it has a minimum, as we can just subtract the minimum to make it non-negative (a function is a pdf if it is non-negative and it integrates to 1)). But more importantly 2. The integral of the function needs to be a finite constant, dividing by infinity makes little sense. This is a given if the state space is bounded, but if it’s the real number line, it’s not. (The integral over the reals of exp(x) is divergent.)
Yes, but that seems to me like another way to say the function you're trying to convert to a pdf simply ... isn't suitable as one. To me it indicates that in deriving a function that you think should represent a probability space in some way - if you end up with negative numbers, or the area under the curve diverges to infinity - something has gone very wrong with your thinking or modeling. Possibly as simple as neglecting to specify that your actual problem domain is not infinite but has a known, fixed minimum and maximum.
@@ps.2 Yes, I just wanted to add this because - as we are in mathematics here - it's good to be precise about the assumptions we are making. If the continuous function is non-negative and its integral over the sample space Ⲭ exists then one can always do this normalisation (if I didn't miss anything). Otherwise, we'd need to make some adjustments (which will always be possible if Ⲭ is bounded). (afaik)
i didn’t really understand what was going on in the last part of the video with your hierarchical model. what should i search for to learn about that stuff?
I couldn’t get it to run with Metropolis’ original implementation, so I had to use another algorithm to do the analysis. I used Stan, which implements what’s called the algorithm called the No U-Turn Sampler. It’s an advanced descendent of Metropolis
Nice explanation of the algorithm! Looking at your code for the acceptance ratio, it appears you are computing the ratio of the log_posterior’s instead of computing their difference and exponentiating. Maybe adjusting that would improve performance? I worked on a project for which I implemented an MCMC sampler and also found it was difficult to tune it properly. One thing that helped was to use the adaptive Metropolis Hastings Algorithm from (Haario et al., 2001) so that the proposal distribution, a multivariate normal, is updated at each iteration using the covariance matrix of the previous draws. The variance term for each parameter should eventually settle at an appropriate value automatically without you having to play around with each one, which is nice when you’re dealing with a lot of them. I also found I needed a burn-in period of about 20,000 draws and about 70,000 total draws to get stable inference, so just running your sampler for longer may improve things. The computational burden is definitely a drawback of this algorithm. Thank goodness for Stan!
I like your style of explaining. You could maybe slow down a little during the math parts but pairing it with core made it understandable for me again as I am a software engineer by trade.
I heard Portland is pretty nice this time of year. I even think they did a show about Portland called Portlandia. Also based on what I've heard the "drinking craft brew" percentage should be increased by roughly 50 points.
I’m a student working with estimation of covariance data, this video helped me take a step further into applied Bayesian statistics. It had just the right amount of information for me. Very thankful! If you’re familiar with it, are there researchers linking Bayesian statistics with copulas ? (Since you seem very eclectic on statistics I would gladly push for a video on copulas ^^)
I’m just starting to get into copulas myself! Still starting out, but I’ve been working through the copula textbook by Czado, if that helps any. I do recall a small section being dedicated to Bayesian stuff, so it might be worth it to check it out
havent watched the entire thing but it reminds me of that thing where you can survey everyone a random thing (eg: how many jelly beans in a jar) and you'll get very close to the answer
10:51 : is the setting it to 1 on the right side, the min(1, •) part of the eventual formula obtained at 11:27 ? So, it will only get capped at 1 if it is the one that ordinarily wouldn’t be likely enough, and this is fine because of it being handled by the other side in that case?
Yea you have it about right. You might see other versions of the algorithm that don’t calculate a minimum and just the ratio. It’ll still be the same algorithm in the end so long as you generate an independent standard uniform.
So in context of solving the rendering equation there fairly recently were a couple of papers that upgraded Multiple Importance Sampling to Continuous Multiple Importance Sampling (which is often intractable), Stochastic Multiple Importance Sampling (which makes CMIS practical at the cost of, I think, higher variance) and Marginal Multiple Importance Sampling (which further broadens the scope of distributions it can handle) Ever since I heard those, knowing MCMC has this tradeoff involving the half-width parameter h to direct how well things are explored, I wondered if any of those techniques could be used to basically Multiple Importance Sample over all choices of h. Do you happen to know whether this is feasible?
I don't understand why the detailed balance is so strict(10:00), shouldn't the balance rule support the global static distribution instead of the local one? i.e. Maybe X goes to Y but Y doesn't go directly to X to balance it out, instead they go to Z for a few mins then to X and sit there for as long as needed to balance the global distribution.
In doing the research for the video, I saw that there are MCMC implementations that do use the weaker balance condition instead of detailed balance. But, I’m not sure as to why one would prefer one assumption over the other
But if we’re using Metropolis for importance sampling for MC estimation, we have to weight each contribution in the sum by it’s 1/probability, which means we still need to estimate the normalizing constant???
Statistics is so hard... I appreciate your effort and the quality of your work but I'm definitely not ready for this content. Still subscribed and liked though. I'll try checking your channel for simpler videos I might be able to follow. Statistics is tremendously useful to understand our world but I'm still a Calculus boy who enjoys vectors.
5:32 Does this mean we are not assuming independent values of the samples? or I am thinking too much like frequentist and you don't need to keep these assumptions in mind when using bayesian statistics? and/or these random number generation algorythm do not have any effect on the assumptions on the data?
Yes, in a Markov chain the samples aren't independent. The magic of Markov chains is that, despite the dependent samples, you can accurately estimate properties of the target distribution (given assumptions, etc.).
the most example of markov chain is weather. today's weather depends on yesterday. but it doesn't depend on days ago. because we already considered previous effects by the weather of yesterday. likewise this some various is related to its previous state then we can use markov chain. So it's totally different concept with the bayesian of course it could possible their stochastic distribution follows bayesian but im not studied this deeply so im not sure that
I would like to note that the metropolis algorithm is a bit different than the metropolis hastings algorithm, it is a more general case and has a different acceptance function
Dang! I got accepted to present at JSM but currently interning at NASA. I look forward to seeing you there next year in Nashville! Fantastic as always by the way, love your work!
lol I was partially convinced that no one actually looks for those. You can find it in this Github repository in the "metropolis.Rmd" file: github.com/very-normal/explained
a distribution that doesn’t change over time. For example, if I’m losing weight, the mean weight I’d see on my scale would change over time, so it wouldn’t be stationary. If I’m maintaining, then it would be stationary
Possible error: Your list of Portland activities doesn't include "going to Powell's Books". But maybe that's not something locals do, only tourists. But maybe that's not something tourists do either, only really nerdy tourists.
How are the population units initialised? You couldn't have every single person you ask have an identical preference, so how does the algorithm decide some people prefer coffee to hiking, wheras aothers prefer hiking to coffee?
My population is Portland locals, but I also I made a pretty strong assumption that all Portland locals would have the same preference distribution. If I assume this, then any local I ask would generally give me the same type of answers. Of course this isn’t really a reasonable assumption, just something to further the metaphor
Great video! I love the way you are explaining. Somehow hierarchical Bayesian models always confuse me. The choice of the model structure and the choice of the distributions often seems arbitrary to me. Do you have a good reference (or did you already make a video) that explains this? Keep up the good work (and check out my SoME video if you like 😁)
The heart of the Metropolis algorithm---the accept/reject step---is still widely used in physics simulations today, though typically we do not make such simple proposals. Hamiltonian Monte Carlo (also invented by physicists!), for example, still includes a Metropolis accept/reject test to guarantee that the generated distribution matches the target.
Some parts of the video were really good, but at some point you keep introducing new ideas before I had the chance to understand the previous ideas. When you moved on to implementing the algorithm on your like/view ratio I thought this would be an example that will help me understand what you discussed previously, but all of a sudden you started introduction priors and hyper priors and I just completely lost you. I feel like I almost learned something very interesting, and at the last moment I was left confused...
@@very-normal hey, all the other comments here are very positive, don't feel bad because of my opinion. I wrote my comment because I hope you'll pay more attention next time to what a novice might get from a video, but you tackled a specialized subject, and it's also fair to make content for people with a background
oh no sorry, I definitely took your comment to heart, the video’s not perfect. I saw what you were talking about and agreed it’s not the smoothest transition. I appreciate you taking the time to tell me
I don't understand, if this algorithm needs to already know the probability distribution to work then you don't really gain any new information from it. All it does is create fake data?
Even if you know the form of the distribution, it doesn’t mean you can derive anything useful from it, like an expectation or a credible interval. In easy problems you can do this straight from the PDF, but with harder ones you can’t do this. With MCMC, you generate a sample that you know comes from it, and then use it to calculate said expectations. It’s not “fake” data
I just wish you didnt use so much mathematical notation and used normal words. I majored in math and I still dont pick up the meaning that quickly. I dont find it difficult at all to translate concepts to normal language, but maybe I'm unusually gifted at it
@@very-normalSyun in Hillsboro to the West. Best sushi outside of Japan. Powell's book store. Le Pigeon on Burnside for slightly fancy food. Elephant deli for food stuff.
Your metropolis algorithm had drift because the hyperprior creates a prior distribution that is overparameterized. Despite saying the prior needs a hyper prior, it in fact does not. If you really wanted one, it would have been better fix the value of alpha+beta, rather than placing a hyper prior on it.
until the sample it was accesible - then you lost me; too many assumptions (seemingly not related to the actual metropolis algorithm) and too quick succession. a way simpler sample would be better for pedagogical reasons.
I dont like your system for finding the average likes. It will likely be dependent on how many views you have, or how many non-subscribers view. The model should be dependent on total views
lol no one will really care if a model with data from a statistics UA-cam channel is misspecified; there are lots of different ways to approach the modeling and I chose the one where I could use the data I had
I felt overwhelmed by this video. Too much information coming too quickly. I guess I’m not your target audience. By the end of it, I was wondering how it could be used in a real world example.
Sorry about that, the algorithm was also relatively new to me so I didn’t have the time to refine it further. I appreciate you gave some of your time to it tho You could get a more concrete real life example by replacing the videos in my examples with different groups of people who are taking a drug to cure a disease. What we’d be interested in is seeing how well the drug works overall. This type of model suits a Bayesian analysis, but it wouldn’t be possible to do it before algorithm like Metropolis.
Good there’s a million channels full of pop science already And if you genuinely had trouble with this video then pop science really has deluded the every man into thinking he’s actually intelligent
@@hypebeastuchiha9229 I feel that you seem to be spending energy to be condescending. Did I miss something? I think you might want to ask yourself, ‘how many friends do I have?’ Then maybe read a book on the subject. Feel free to get the last word in.
I did like and subscribe, but you were talking about it, so you should remove this video from future analysis and meta-analysis, but definitely include it for meta-meta-analysis
This is brilliant, as a physicist turned pseudo-statistician, the physics-motivated approach of MCMC makes more sense to me than the statistical approach.
Fun story: despite bearing his name, Nicholas Metropolis didn't contribute at all to the algorithm. The concept was originally developed by physicist Marshall Rosenbluth at Los Alamos for studying thermonuclear explosions. His wife Arianna Rosenbluth (also a physicist) turned the concept into code. The Rosenbluths are the true pioneers. They struggled with a consistent way to pick an appropriate acceptance function. One day having dinner with Edward Teller and his wife Augusta Teller, they described the problem, the implementation, and the challenge. The Tellers being a couple of physicist+mathematician suggested the obvious choice for a physicist: simply use the Boltzmann distribution. And the first MCMC algorithm was born. What did Metropolis do? He had the keys of the computer lab, and the alphabetical order used for naming authors made his last name the name of the algorithm. I personally call it the Rosenbluth algorithm.
JORGE!!! ❤
@@brian.westersauce 😮
"Having the keys" is important!
Other fun fact: they were actually inspired by the Buffon needle problem where one can determine pi by throwing needles on the ground.
Ps: funily enough it actually does not work directly on a computer because we need pi to generate the needle orientation (there is a way to go over that though)
That "coming clean" bit really highlights the difference between theory and reality; to operationalize Metropolis-Hastings requires a lot of extra work.
I really respect the way the 'like' and 'sub' requests weren't explicit. It was made clear they were also important and prioritized them by making their consideration a natural part of the exposition. Bravo.
Not a statistician, but a machine learning person -- one of the more fascinating things about this algorithm to me is you can use it to explore a solution space by turning it on its head and inventing a probability distribution that makes better solutions more likely. It's not super widely used but it has applications. Important work that deserves better visibility!
That’s cool! This channel’s eventually going to have to get into machine learning at some point, and it sounds like there’s some cool stuff I’ve never heard of
That’s cool! This channel’s eventually going to have to get into machine learning at some point, and it sounds like there’s some cool stuff I’ve never heard of
Sounds very interesting. When would you choose to use metropolis over other optimisation techniques like simulated annealing?
That was the best explanation of MCMC I have heard. I'm going to listen to it ten more times and see I start to understand. This stuff is really hard for people with no stats or higher math background, but it's fascinating and fun to learn, however slowly
Happy to see you covering this topic. MCMC is what made Bayesian statistics viable beyond conjugate priors and simple distributions. There's a lot more to dig into including Hamiltonian Monte Carlo (what NUTS is based on) and Gibbs sampling, as well as convergence metrics. Bayesian Data Analysis 3rd ed. (Gelman, Carlin, Stern, Dunson, Vehtari, & Rubin) is the go to book for this.
Is that book readable for someone who is just starting out in that area? Cause this got me hooked
@@mickaiba71 I'd say it's targeted towards someone with a Master's. There's a lot more besides just MCMC. It's really meant to be used more like a reference than something you read cover to cover. "Statistical Rethinking" by McElreath gives one of the best introductions to Bayesian Statistics out there. There's a few chapters on MCMC in that one as well.
@@berjonah110thanks a bunch, I’ll check it out!
@@mickaiba71 you can also tackle from the point of view of some of its applications. I had first learned of MH algorithm through learning about probabilistic graphical models. Bayesian Reasoning and Machine Learning by David Barber did a great job for me introducing and connecting these topics together
@@cameronkhanpour3002
Yea actually I am taking a course in statistical modelling right now 🙂↕️
Thank you also for the recommendation
I've never seen the acceptance rate expression explained/derived so intuitively - great job!
By the way, to get your RWM to work, judging by the traceplots it seems that you would have needed to add some sort of pre-conditioning to change the variance of the proposal distribution for each dimension of your target separately, since alpha and beta seem to be on a different "scale" to the p variables, which is why finding a common "h" that allowed the chain to mix fast enough in all directions without getting stuck would be borderline impossible. RWM (even with a Gaussian proposal rather than uniform) tends to struggle with hierarchical models and high-dimensional models more generally because of this idea, but thankfully we have things like STAN using HMC. Would love to see an explainer on NUTS/HMC in your explanatory style if you ever feel like it!
yeah, I saw someone else implement a hierarchical beta-binomial, but they chose to update alpha and beta separately. Based on what you said, I can definitely see why now. At least it was a good learning experience on my part 😅
@@very-normal Learning the "hard way" is almost always the most valuable in that regard :)
BRILLIANT EXPLANATION! FINALLY I UNDERSTOOD WHERE THE INTEGRATION WENT!!! THANK YOU :)
The audio quality on your videos is progressively better. Keep up the good work!
Thank you for a stimulating video on an important statistical method. The hand-written equation appearing on-screen from 0:10 to 0:22 and briefly captioned “Despite how it looks, this is just an average,” is a fundamental building block of thermodynamics. It was at the core of my doctoral dissertation project in polymer physical chemistry in the 1960s. A generation earlier the average geometric properties of a polymer such as its end-to-end distance or radius of gyration (average distance of component atoms from center of mass) were simulated with pencil and paper as a Brownian motion type random walk with a fixed step length representing the distance between atoms along the connected chain, often constrained to a 2- or 3-dimensional spatial lattice. Their mathematical properties were delineated by Einstein and others. However, a Brownian particle can visit the same lattice point indefinitely, while two real atoms cannot occupy the same point in space. This gave rise to the concept of the self-avoiding random walk, a much more difficult mathematical problem. The 1950s saw the proliferation of electronic computers like the Illiac, that made it possible to simulate more realistic average polymer properties with self-avoiding walks, but it was quickly found that the proportion of walks without self-intersections decreased rapidly as the chain length (no. of steps) increased. A few clever biasing techniques involving re-starting a self-intersecting walk improved things slightly but these were only minor tweaks.
In the 1960s Moti Lal, a chemist at Unilever Labs in the UK, became aware of Metropolis’s paper and applied it to the polymer problem. However, his available computing power (IBM 360/50) confined his polymers to 30 monomer units on a 2-dimensional spatial lattice. As a graduate student at NYU in 1968 I had access to a Control Data Corp. CDC 6600 Cray Supercomputer at the Courant Institute of Mathematical Sciences (they let us chemists use their facilities). I used the Metropolis-Lal method to generate polymers with free rotation about their backbone bonds, i.e., not restricted to an artificial lattice. I used your hand-written equation with Boltzmann weighting exp(-E/kT) to generate average geometric and thermodynamic properties from thousands of Monte Carlo samples. The Metropolis importance sampling algorithm generated samples from more “important” regions of polymer conformation space so it took fewer samples and far less computer time to get stable averages. I could even generate numerical distributions of end-to-end distances of polymers with sufficient accuracy to discriminate among several distribution functions proposed by theoreticians.
"Eniac"... (not Illiac).
Regards.
Finally the MCMC video! I like your explanation of the acceptance function and the comind clean section
You explain everything so clearly, It's crazy
Awesome job at explaining the Metropolis-Hastings algorithm! For the hyperparameter tuning problem of finding the best half-width h of your proposal distribution there is the classic paper "Weak convergence and optimal scaling of random walk Metropolis algorithms" by A Gelman, WR Gilks, GO Roberts from 1997, where they assume a Gaussian proposal distribution and in the limit of a high dimensional parameter space, the optimal acceptance ratio comes out at ~0.23. So you can tune your h so that the average acceptance ratio is about 1/4 as a good heuristic to start with. It can get more complicated if the different parameter dimensions have vastly different optimal standard deviations for the gaussian proposal distribution and there are lots of elaborate papers about the topic that came afterwards, but it's a place to start and a nice simple heuristic.
👏👏👏
excellent lecture.
Thank you, Doctor, your scientists and your colleagues.
and
With luck and more power to you.
hoping for more videos.
There's a cool application of this underlying principle with markov chains in that you can coordinate uncoordinated agents without having to track or communicate with them in real time.
Taking your Portland Example, you would get your 100 people stationed at each activity in roughly the correct proportions by giving them a list of the other activities and the probability that they should stay put or go to each other place. They'll all shuffle around each hour, but they'll adhere closely to the target.
This has applications for things like, say, a swarm of Drones spelling out a word in the sky. Sure, you could calculate a specific 3d coordinate for each specific drone and transmit that info to each of them 1 by 1 every time you want to change the word. Or you could just broadcast the 'pixel' locations you want the drones to occupy, and how densely filled each pixel is, and let them individually, randomly pick which space to occupy.
Especially if you let them constantly cycle their positions, it'll do a good job with the task.
Dude, I love this. Keep doing what your doing.👍
Thank you for the video! Extremely insightful and well done. Keep up the good work!
Great video!
At 6:35 it was said that one can take a continuous function and divide it by it’s integral to get a probability distribution.
I think we are implicitly taking two more assumptions here to make it work:
1. The function is non-negative (which we can usually assume if it has a minimum, as we can just subtract the minimum to make it non-negative (a function is a pdf if it is non-negative and it integrates to 1)).
But more importantly
2. The integral of the function needs to be a finite constant, dividing by infinity makes little sense. This is a given if the state space is bounded, but if it’s the real number line, it’s not. (The integral over the reals of exp(x) is divergent.)
Yes, but that seems to me like another way to say the function you're trying to convert to a pdf simply ... isn't suitable as one. To me it indicates that in deriving a function that you think should represent a probability space in some way - if you end up with negative numbers, or the area under the curve diverges to infinity - something has gone very wrong with your thinking or modeling. Possibly as simple as neglecting to specify that your actual problem domain is not infinite but has a known, fixed minimum and maximum.
@@ps.2 Yes, I just wanted to add this because - as we are in mathematics here - it's good to be precise about the assumptions we are making.
If the continuous function is non-negative and its integral over the sample space Ⲭ exists then one can always do this normalisation (if I didn't miss anything).
Otherwise, we'd need to make some adjustments (which will always be possible if Ⲭ is bounded). (afaik)
i didn’t really understand what was going on in the last part of the video with your hierarchical model. what should i search for to learn about that stuff?
I couldn’t get it to run with Metropolis’ original implementation, so I had to use another algorithm to do the analysis. I used Stan, which implements what’s called the algorithm called the No U-Turn Sampler. It’s an advanced descendent of Metropolis
Grest video, I hope see the explanation of Gibs Sampling 🙌🏻
Nice explanation of the algorithm!
Looking at your code for the acceptance ratio, it appears you are computing the ratio of the log_posterior’s instead of computing their difference and exponentiating. Maybe adjusting that would improve performance?
I worked on a project for which I implemented an MCMC sampler and also found it was difficult to tune it properly. One thing that helped was to use the adaptive Metropolis Hastings Algorithm from (Haario et al., 2001) so that the proposal distribution, a multivariate normal, is updated at each iteration using the covariance matrix of the previous draws. The variance term for each parameter should eventually settle at an appropriate value automatically without you having to play around with each one, which is nice when you’re dealing with a lot of them. I also found I needed a burn-in period of about 20,000 draws and about 70,000 total draws to get stable inference, so just running your sampler for longer may improve things. The computational burden is definitely a drawback of this algorithm. Thank goodness for Stan!
I like your style of explaining. You could maybe slow down a little during the math parts but pairing it with core made it understandable for me again as I am a software engineer by trade.
At 6:34: your f(X) must be non negative. Also, care must be taken about how f(X) behaves at infinity (see heavy tail distributions).
I heard Portland is pretty nice this time of year. I even think they did a show about Portland called Portlandia. Also based on what I've heard the "drinking craft brew" percentage should be increased by roughly 50 points.
This algorithm is amazing AF. I'm gonna use it in the travel example for sure
I’m a student working with estimation of covariance data, this video helped me take a step further into applied Bayesian statistics. It had just the right amount of information for me. Very thankful!
If you’re familiar with it, are there researchers linking Bayesian statistics with copulas ?
(Since you seem very eclectic on statistics I would gladly push for a video on copulas ^^)
I’m just starting to get into copulas myself! Still starting out, but I’ve been working through the copula textbook by Czado, if that helps any. I do recall a small section being dedicated to Bayesian stuff, so it might be worth it to check it out
havent watched the entire thing but it reminds me of that thing where you can survey everyone a random thing (eg: how many jelly beans in a jar) and you'll get very close to the answer
Great video. I really like the walkthorugh of the technicals details. A nearby topic could be the Gibbs-sampling? :)
7:08 - 7:17 is top quality
10:51 : is the setting it to 1 on the right side, the min(1, •) part of the eventual formula obtained at 11:27 ?
So, it will only get capped at 1 if it is the one that ordinarily wouldn’t be likely enough, and this is fine because of it being handled by the other side in that case?
Yea you have it about right. You might see other versions of the algorithm that don’t calculate a minimum and just the ratio. It’ll still be the same algorithm in the end so long as you generate an independent standard uniform.
So in context of solving the rendering equation there fairly recently were a couple of papers that upgraded Multiple Importance Sampling to Continuous Multiple Importance Sampling (which is often intractable), Stochastic Multiple Importance Sampling (which makes CMIS practical at the cost of, I think, higher variance) and Marginal Multiple Importance Sampling (which further broadens the scope of distributions it can handle)
Ever since I heard those, knowing MCMC has this tradeoff involving the half-width parameter h to direct how well things are explored, I wondered if any of those techniques could be used to basically Multiple Importance Sample over all choices of h. Do you happen to know whether this is feasible?
I like how at the end you first rejected the transition from "That's it for this one." to the next line.
totally planned on my part
What does lowercase yi represent?
I don't understand why the detailed balance is so strict(10:00), shouldn't the balance rule support the global static distribution instead of the local one?
i.e. Maybe X goes to Y but Y doesn't go directly to X to balance it out, instead they go to Z for a few mins then to X and sit there for as long as needed to balance the global distribution.
In doing the research for the video, I saw that there are MCMC implementations that do use the weaker balance condition instead of detailed balance. But, I’m not sure as to why one would prefer one assumption over the other
But if we’re using Metropolis for importance sampling for MC estimation, we have to weight each contribution in the sum by it’s 1/probability, which means we still need to estimate the normalizing constant???
Very cool Fact to learn GREAT VIDEO! Liked and subscribed ❤
Next gibbs sampling, exponential families
Statistics is so hard... I appreciate your effort and the quality of your work but I'm definitely not ready for this content. Still subscribed and liked though. I'll try checking your channel for simpler videos I might be able to follow. Statistics is tremendously useful to understand our world but I'm still a Calculus boy who enjoys vectors.
5:32 Does this mean we are not assuming independent values of the samples? or I am thinking too much like frequentist and you don't need to keep these assumptions in mind when using bayesian statistics? and/or these random number generation algorythm do not have any effect on the assumptions on the data?
Yes, in a Markov chain the samples aren't independent. The magic of Markov chains is that, despite the dependent samples, you can accurately estimate properties of the target distribution (given assumptions, etc.).
Oh, and this isn't really anything to do with frequentist vs. Bayesian viewpoints.
the most example of markov chain is weather. today's weather depends on yesterday. but it doesn't depend on days ago. because we already considered previous effects by the weather of yesterday. likewise this some various is related to its previous state then we can use markov chain. So it's totally different concept with the bayesian
of course it could possible their stochastic distribution follows bayesian but im not studied this deeply so im not sure that
@@AubreyBarnard thanks
@@minjae92 thanks
He can't keep getting away with it! (making these cool videos)
its sort of like the way u can break down signals into sin and cos waves u can break samples into well behaved distributions
I would like to note that the metropolis algorithm is a bit different than the metropolis hastings algorithm, it is a more general case and has a different acceptance function
oh!! please make a Markov's chains video next!. Some ppl doesn't know that they even exist.
I have no clue what is happening
this was me two weeks ago
Dang! I got accepted to present at JSM but currently interning at NASA. I look forward to seeing you there next year in Nashville! Fantastic as always by the way, love your work!
the acceptance function looks more like a "likelihood" than a "probability"
Just accept that “guaranteed” means “most likely”, in any context. Yes, that can be a huge struggle.
Fantastic Spongebob reference!
The code and data are not in the description?
lol I was partially convinced that no one actually looks for those. You can find it in this Github repository in the "metropolis.Rmd" file: github.com/very-normal/explained
I hope you got a chance to go to Powell's!
powells was amazing, i got lost
whats a stationary distribution
a distribution that doesn’t change over time. For example, if I’m losing weight, the mean weight I’d see on my scale would change over time, so it wouldn’t be stationary. If I’m maintaining, then it would be stationary
Have you tried brms as a frontend for stan??? It makes it MUCH easier, you should try it out!!
YES! I try to use brms when I can, but this was thankfully easy enough to implement
STAN MENTIONED !!!!
Nice video. You should explain NUTS next.
Possible error: Your list of Portland activities doesn't include "going to Powell's Books".
But maybe that's not something locals do, only tourists.
But maybe that's not something tourists do either, only really nerdy tourists.
Or really nerdy locals?
I’ll check if there are any nerdy tourists here and report back
How are the population units initialised? You couldn't have every single person you ask have an identical preference, so how does the algorithm decide some people prefer coffee to hiking, wheras aothers prefer hiking to coffee?
My population is Portland locals, but I also I made a pretty strong assumption that all Portland locals would have the same preference distribution. If I assume this, then any local I ask would generally give me the same type of answers. Of course this isn’t really a reasonable assumption, just something to further the metaphor
Nice 🎉
7:10 missing ( literally unwatchable. Joking great video
you caught me, im a fraud
Great video! I love the way you are explaining. Somehow hierarchical Bayesian models always confuse me. The choice of the model structure and the choice of the distributions often seems arbitrary to me. Do you have a good reference (or did you already make a video) that explains this? Keep up the good work (and check out my SoME video if you like 😁)
The heart of the Metropolis algorithm---the accept/reject step---is still widely used in physics simulations today, though typically we do not make such simple proposals. Hamiltonian Monte Carlo (also invented by physicists!), for example, still includes a Metropolis accept/reject test to guarantee that the generated distribution matches the target.
Thnx
Is it just me or y’all hear “halfwit” every time he said “half-width” 😂😂😂
lol you ain’t wrong
EM algo plz
this is my current nemesis 💀
Statistics: the mathematics of eventually.
Some parts of the video were really good, but at some point you keep introducing new ideas before I had the chance to understand the previous ideas.
When you moved on to implementing the algorithm on your like/view ratio I thought this would be an example that will help me understand what you discussed previously, but all of a sudden you started introduction priors and hyper priors and I just completely lost you.
I feel like I almost learned something very interesting, and at the last moment I was left confused...
rip
@@very-normal hey, all the other comments here are very positive, don't feel bad because of my opinion.
I wrote my comment because I hope you'll pay more attention next time to what a novice might get from a video, but you tackled a specialized subject, and it's also fair to make content for people with a background
oh no sorry, I definitely took your comment to heart, the video’s not perfect. I saw what you were talking about and agreed it’s not the smoothest transition. I appreciate you taking the time to tell me
An Edward Teller paper. 😀
I don't understand, if this algorithm needs to already know the probability distribution to work then you don't really gain any new information from it. All it does is create fake data?
Even if you know the form of the distribution, it doesn’t mean you can derive anything useful from it, like an expectation or a credible interval. In easy problems you can do this straight from the PDF, but with harder ones you can’t do this. With MCMC, you generate a sample that you know comes from it, and then use it to calculate said expectations. It’s not “fake” data
I just wish you didnt use so much mathematical notation and used normal words. I majored in math and I still dont pick up the meaning that quickly. I dont find it difficult at all to translate concepts to normal language, but maybe I'm unusually gifted at it
you’d probably be better than me tbh, you should use your talent with SoME! We need more good math content in yt
I'm a local in Portland. I nevee go hiking. I drink a lot of coffee though.
do you have a few go-tos you like?
@@very-normalSyun in Hillsboro to the West. Best sushi outside of Japan. Powell's book store. Le Pigeon on Burnside for slightly fancy food. Elephant deli for food stuff.
Am Portlander. Can Agree. Drink Coffee.
Don't ask the bartender!
Your metropolis algorithm had drift because the hyperprior creates a prior distribution that is overparameterized. Despite saying the prior needs a hyper prior, it in fact does not. If you really wanted one, it would have been better fix the value of alpha+beta, rather than placing a hyper prior on it.
it worked fine with Stan tho lol
did it tho lol
So you're saying I can roll a RNG that tells me to study 90% of the time and then I'll stop procrastinating?
the logic is ironclad
15:37 so your are telling me you NUTTED your data?
that’s the technical term for it, yes
❤❤❤❤❤
alpha and beta too big, try between 0.1 and 10
until the sample it was accesible - then you lost me; too many assumptions (seemingly not related to the actual metropolis algorithm) and too quick succession. a way simpler sample would be better for pedagogical reasons.
I dont like your system for finding the average likes. It will likely be dependent on how many views you have, or how many non-subscribers view. The model should be dependent on total views
lol no one will really care if a model with data from a statistics UA-cam channel is misspecified; there are lots of different ways to approach the modeling and I chose the one where I could use the data I had
Can you explain the Deez NUTS algo next? I'm trying to understand testicular torsion
Bruhhh
63
Lots of drugs in Portland
hi
hey
I felt overwhelmed by this video. Too much information coming too quickly. I guess I’m not your target audience. By the end of it, I was wondering how it could be used in a real world example.
Sorry about that, the algorithm was also relatively new to me so I didn’t have the time to refine it further. I appreciate you gave some of your time to it tho
You could get a more concrete real life example by replacing the videos in my examples with different groups of people who are taking a drug to cure a disease. What we’d be interested in is seeing how well the drug works overall. This type of model suits a Bayesian analysis, but it wouldn’t be possible to do it before algorithm like Metropolis.
I like the feeling of drowning with information not understanding most of it like a masochistic tendency.
@@very-normalI think you should link people to a video on bayesian statistics at the start of such a video
Good there’s a million channels full of pop science already
And if you genuinely had trouble with this video then pop science really has deluded the every man into thinking he’s actually intelligent
@@hypebeastuchiha9229 I feel that you seem to be spending energy to be condescending. Did I miss something? I think you might want to ask yourself, ‘how many friends do I have?’ Then maybe read a book on the subject. Feel free to get the last word in.
Props do t makes sense, more like 80% doing nothing 😊
far too many (hyper-)parameters to work IRL. overfitting nightmare.
lol how many parameters would you recommend
I did like and subscribe, but you were talking about it, so you should remove this video from future analysis and meta-analysis, but definitely include it for meta-meta-analysis
will definitely consider it for the metameta-analysis