Crash Course on Monte Carlo Simulation
Вставка
- Опубліковано 7 лют 2025
- 5 years of statistical trial and error summarized in 30 minutes.
If you want the code, let me know in the comments
OTHER CHANNEL LINKS
🗞️ Substack - verynormal.sub...
☕ Buy me a Ko-fi! - ko-fi.com/very...
🏪 My digital products - very-normal.se...
Code for this video: github.com/very-normal/videos
ADDENDUM:
- At 14:14, I'm showing the wrong code to compile all of the results. The correct code can be found in the Github link.
Thank you very much. I asked ChatGPT to translate the R code into Python as I understand R, but I need lots of help to write. Fan fact: The Python code (3.12.2) with the same for loop was 10 times slowed. Conversation to full vectorisation by [from scipy.stats import ttest_ind] made it 4 times faster than R.
Can’t believe this content is free 😭, thanks so much for this kind of valuable content.
same 💖
🫡
Great video. Seeing code was really a turning point for me in understanding statistics, making the formulas and charts in textbooks more tangible. Now seeing the differences in 27:28 makes me want to dive deeper into the tests to see whats causing these differences.
One thing to keep in mind about saving the results is that disk operations are orders of magnitude slower than ordinary computation. Depending on the complexity of the calculation, you might be better of saving an intermediate result, like the chunked mean or something similar.
Not sure if this is possible in R but you can also write the results in a non-blocking way (in a separate dedicated worker thread for example).
Fantastic exposition. You have inspired me to spend more time on Monte Carlo simulations.
I've been trying to teach myself simulation-based calibration for a while now and this video is honestly the exact tutorial I needed. Not having to figure all this out on my own saved me so much time (and pain). Can't thank you enough!
This is a super perfect video… some more in R programming, marvellous!!!
Amazing man! Keep it up, this is just the kind of videos that people like me need to get into the field.
Waited for soooo long for a good MC explanation video (series) YT is surprisingly poor on this topic. Thank you!
Hi Christian, great video! I love your explanations and have been following for a little while. In this implementation, I couldn't help but feel like there are some significant optimizations / tool choices that would drastically improve your runtime speed here.
Firstly, I think writing each result to a separate file is going to lead to significant I/O overhead that is likely not trivial to the overall time.
Another idea is switching to something like JAX in Python, where you can easily compile and parallelize your code. I took a quick crack at reimplementing the array of tests you're doing here and on my personal computer was able to run this in (2.5s compile time + 3.0s runtime) 5.5 seconds with 1e6 samples per permutation of shift, distribution, and test type.
I know JAX can be a little daunting at first, but I highly recommend getting over the initial hump, since you can really scale up your analysis!
I can definitely see the I/O getting in the way with such a large amount of simulations. I even feel it when I need to read everything again and compile it for making graphs/tables 💀I haven’t heard of JAX before, but I think it’d be worth looking into. Thanks for the recommendation, I appreciate you taking the time out to tell me
@@very-normalJAX is a machine learning framework created by Google (was used to train Gemini). It has increasing use in probabilistic programming which is what I use it for in my research. Can’t recommend it enough.
Would you be so kind to share your code, or point us to helpful guidance to convert the code to JAX?
do you use JAX with GPU?
@@ElJuanchoCarrancho No, not yet. I heart of JAX, but wasn’t forced to jump the learning hurdle.
I've been diving into Rand Wilcox's robust estimation textbook and it's been a really compelling statistical book. It presents the argument that we care most about the core of the data, so we can use trimmed means and winsorized variances (for those that don't know those terms, you cut the outer percentiles for the mean, and for the variance, you set the outer percentiles to the new min and max from trimming). This seems to allow the robustness of rank statistics while maintaining the interpretability of regular statistics. I've worked with it on some of my data and it seems to capture similar patterns as the rank statistics but I can compute effect sizes and communicate it more effectively.
Also, DO NOT RUN COMPUTATIONS ON THE LOGIN NODE. That is the ONE WAY you can break a cluster. I cannot emphasize that enough xD The amount of times I've had to postpone work because some person somewhere in the country ran a "10 minute job" on a login node....
Also I know there are tools to help write SLURM files for you, just remember your own cluster will have its own limits. It's also nice practice to request the minimum CPU/memory (usually the two are tied, so may as well use all memory you are allowed) necessary, since these are shared systems.
And I will testify to the value of clusters. I took an analysis that would have taken 6 months on a high end consumer-grade desktop and shrunk it to a day and a half using cluster resources. If you are doing research/big data anything, this is a critical skill to your success.
Awesome video! As someone else who has had to self-learn Slurm, your explanation was great. I think if you focus a video more on Slurm and using the code then you'd have a pretty successful video on your hands!
Would be happy to help out with it as well.
Fantastic stuff! I wish this were available 7 years ago haha when I did my PhD haha!
lvl3 actually gets some computer science involved but I think this should be the case for anyone working with data in these times. super funny vim mention btw and thanks for the great content !
Great video, I can now see where the power of the Bayesian statistics come from allowing to compute a simulation from a single model.
Thanks! But just as a heads up, none of the methodology here is Bayesian 😅
@very-normal that's true, but the underlying concept is widely used in Bayesian models (unless you're working with a sequential Gibbs sampler).
Python has common sampling distributions in the "random" module of the standard library. Numpy/Scipy have a much wider selection of distributions though and will usually be faster if you know how to use them.
I created an R package, written in C++, where I (tried to) unify all the probability distributions, whether you can calculate the density, CDF, quantiles, and Random Number Generation. I wanted to try the end users create their own Random Variables, useful for probability theories, but this is not added yet. I can't release this in public, yet, because it is not optimized and I don't know how to run the code in GPU, for now.
Edit: This is quarterly true. I am only in RNG part for now.
You can simulate any distribution if you "know how" just as you can do it more efficiently. The main problem for me with Python is the lack of backwards compatibility. It is unnecessarily complicated to debug old code that used to work unless "you know what you are doing".
Fantastic video as always. Subject is passionating, pedagogy is top tier, and the overall quality is excellent. My only constructive criticism: a bit unfortunate not to discuss the possibility to derive uncertainty estimates when doing MC simulation, imo it's one of its biggest advantages,
That’s true, I should’ve brought in a standard error estimate, that would’ve been relatively easy to do. Thanks for pointing it out
Amazing video . Thanks a million !
I realy love your videos. If there is anything that I can do to help you produce them, I would love to do so (for example translation, since I am german, or writing short code chunks for viewers to download and play with)
Thank you for a great video
This content is amazing, like incredible amazing
Thansk!
You should try Julia. I suspect you’d love it for this kind of work.
Great video! Love your stuff!
When you're displaying code blocks, and transitioning between them, could you not morph the entire block? It makes it impossible to track what's being changed.
I love the way you talk about these things!
Thanks! Yeah, I’m not a big fan of how manim handles the code changes either. I’ll try to look into this more or get better at an alternative way of showing code!
Great video, thanks. I really like the aesthetics of your channel. It’s clean and tidy.
Can you recommend some resources to get a deep understanding of statistics? I have no background in mathematics, but i’m self studying, because I feel like just intuition is not enough to excel at the subject. I would like to self study statistics from scratch as well. Do a lot of problems, calculations myself. Like a good entry level textbook, or an online resource. Thanks!
To answer your recommendation question, I recommend Fundamentals of Biostatistics by Ratner. I feel that Biostat is convenient because the examples are easier to grasp, so that should serve you well. I believe you’ll be able to find it online, so you don’t need to worry about the edition
To be honest, I feel that a deep understanding on statistics does require some form of mathematical background. There are many areas in statistics where you’ll wonder why something is the way it is or why advice like “sample size of 30” exists, and it almost always boils down to a mathematical detail.
That being said, I think it’s still perfectly acceptable to be adept at models that are commonly used in the knowledge are you work in. Understanding the math is not necessarily the same as the ability to translate what different models mean in different situations, and this is often the skill that people need more.
Very nice!
Great video bro! Fuck yeahh i love this so much
If speed was a major concern, you could use matrix algebra to perform all of these tests much more quickly, though you would need to write your own versions of t.test() and wilcox.test(). This is generally going to be more performative than immediately resorting to parallelization. And this may be more of a personal preference, but you could use switch() instead of the separate if statements, like so (replace ellipses with function arguments): hypothesis_test
There is also the dqrng package for fast random sampling and generating values from probability distributions, if that were to become a bottleneck.
@@azure-hawk but no Cauchy Distribution, though. But yeah, dqrng is fast because it supports PCG64 and Xoroshiro128++.
By the way, at 14:14, I do not think that code shows how to read in the results of the simulations. Rather it shows the code that runs 10,000 simulations of the non-parallelized kind.
Oops you’re right, that’s my bad. I’ll add an addendum to the comments when I get back to my desktop. Thanks for the heads up!
"iykyk" \*shows sea lion central where ive walked to from my dorms\* oh hey, turns out we go to the same school!
🫡
Thanks
Monte Carlo Simulation Situation is Crazy
We have assumed different statistical models. One is the one that generates the data end the other ones are the ones assumed by the hypothesis tests. However, the monte carlo simulation evaluates only the statistical models of the hypostetis tests, right?
Yes thats right! Just the hypothesis tests
You can embed your R script in the submit script using here docs. Assuming you're using bash.
how you make the animation? which software are you using? really nice visualization!
Thanks! I use manim to make most of the math-based animations and for the code transitions
@@very-normal beautiful, may i know how long it takes for you to make this video?
Ooof this one took maybe 10 hours to edit? It’s longer than my usual videos
@ Thank you again for your contribution to the world!
I would love to have the code
A very interesting video! I can' help but wonder though whether it would be possible to run the third simulation on a laptop in a reasonable amount of time. Maybe I'm missing something, but shouldn't it be 180 000 * 27 * 2 samples of the distributions and then some math on those from which the final result can be derived. The number of calculations will be in the hundreds of millions probably, but that should't be too bad if I imagine writing some multithreaded c++. Might take a couple of seconds. Am I dreaming?
You’re right! It’s totally possible to still do the third one on a personal machine.
The guiding example I chose here isn’t that good for demonstrating how much of a speed up you can get from the HPC, but the essential structure of the simulation should be about the same. That’s what I was hoping to get across and that people could twist it to their needs
@very-normal Yep. Of course there will be a point where it's no longer possible or practical to run a an experiment locally.
I think there's an editing hiccup just before 5:07
Changing the shift and other parameters is it called sensitive analysis?
I hadn’t thought about that, but you’re right, a sensitivity analysis is a very similar idea to using different parameters in a simulation study.
The only nuance I can think of is that a sensitivity analysis is moreso doing the same analysis/model on different subsets of data that’s already been collected to see how it would affect results.
On the other hand, a simulation study is moreso done with the purpose of planning an analysis, before any data is collected
This is great and all, but view count distributions are lognormal
Hmmm... I think it's great
i want code
what's the magic word
nah, it's R
Bro I think you are wrong, it's 2 sigma per side at 0.95 confidence
Oops yeah you’re right, thanks for catching that lol. I’ll an addendum for this
500th like.
Really cool vidéo hahaha
I don't understand the conclusion. Could you rephrase it. 27:07
The t-tests work better than the nonparametric test when all the assumptions needed to support the t-test are met. Furthermore, there’s not much of a difference between Student and Welch’s t-test.But this study shows that its power is severely decreased in the presence of outliers, which are a natural occurrence in UA-cam. In either case of data generating distribution, power is higher when the two groups are more different, since it’s much easier to detect the difference in their means/typical values.
Since the power of the nonparametric test is not that much lower even in the Normal setting, it’s my opinion that it’s a safer hypothesis to use if I want to analyze this type of data in the future.
@@very-normal The equal variance (Student) t-test is only going to out-perform Welch's when the sample sizes are really small, which is where you get the benefit of pooling the variances. If you run these simulations with n=3 and data coming from normal distributions, that's where the pooled test will really shine compared to the others. with n=27, you'll be able to estimate each variance separately very well anyways, so you don't really get an advantage from pooling which is why the power is the same. Making sample size (in addition to sample size) a variable in the simulation would let you see when the pooled test outperforms the unpooled test. Adding various degrees of heteroskedasticity would also probability highlight the reverse cases where the unpooled test does better.
Too much work mate, just run a permutation test and call it a day
To bad this is done using R…
R is most commonly used among actual statisticians.