Crash Course on Monte Carlo Simulation

Поділитися
Вставка
  • Опубліковано 7 лют 2025
  • 5 years of statistical trial and error summarized in 30 minutes.
    If you want the code, let me know in the comments
    OTHER CHANNEL LINKS
    🗞️ Substack - verynormal.sub...
    ☕ Buy me a Ko-fi! - ko-fi.com/very...
    🏪 My digital products - very-normal.se...

КОМЕНТАРІ • 84

  • @very-normal
    @very-normal  5 днів тому +6

    Code for this video: github.com/very-normal/videos
    ADDENDUM:
    - At 14:14, I'm showing the wrong code to compile all of the results. The correct code can be found in the Github link.

    • @bokehbeauty
      @bokehbeauty 4 дні тому

      Thank you very much. I asked ChatGPT to translate the R code into Python as I understand R, but I need lots of help to write. Fan fact: The Python code (3.12.2) with the same for loop was 10 times slowed. Conversation to full vectorisation by [from scipy.stats import ttest_ind] made it 4 times faster than R.

  • @esosaagbongiator8870
    @esosaagbongiator8870 8 днів тому +45

    Can’t believe this content is free 😭, thanks so much for this kind of valuable content.

  • @d_b_
    @d_b_ 8 днів тому +14

    Great video. Seeing code was really a turning point for me in understanding statistics, making the formulas and charts in textbooks more tangible. Now seeing the differences in 27:28 makes me want to dive deeper into the tests to see whats causing these differences.

  • @berjonah110
    @berjonah110 7 днів тому +13

    One thing to keep in mind about saving the results is that disk operations are orders of magnitude slower than ordinary computation. Depending on the complexity of the calculation, you might be better of saving an intermediate result, like the chunked mean or something similar.

    • @senhalil
      @senhalil 5 днів тому

      Not sure if this is possible in R but you can also write the results in a non-blocking way (in a separate dedicated worker thread for example).

  • @ziggle314
    @ziggle314 8 днів тому +6

    Fantastic exposition. You have inspired me to spend more time on Monte Carlo simulations.

  • @xavierlarochelle2742
    @xavierlarochelle2742 3 дні тому

    I've been trying to teach myself simulation-based calibration for a while now and this video is honestly the exact tutorial I needed. Not having to figure all this out on my own saved me so much time (and pain). Can't thank you enough!

  • @weitzun9691
    @weitzun9691 3 дні тому +1

    This is a super perfect video… some more in R programming, marvellous!!!

  • @Ocelus379
    @Ocelus379 6 днів тому

    Amazing man! Keep it up, this is just the kind of videos that people like me need to get into the field.

  • @olipito
    @olipito 7 днів тому

    Waited for soooo long for a good MC explanation video (series) YT is surprisingly poor on this topic. Thank you!

  • @ahiser1117
    @ahiser1117 7 днів тому +2

    Hi Christian, great video! I love your explanations and have been following for a little while. In this implementation, I couldn't help but feel like there are some significant optimizations / tool choices that would drastically improve your runtime speed here.
    Firstly, I think writing each result to a separate file is going to lead to significant I/O overhead that is likely not trivial to the overall time.
    Another idea is switching to something like JAX in Python, where you can easily compile and parallelize your code. I took a quick crack at reimplementing the array of tests you're doing here and on my personal computer was able to run this in (2.5s compile time + 3.0s runtime) 5.5 seconds with 1e6 samples per permutation of shift, distribution, and test type.
    I know JAX can be a little daunting at first, but I highly recommend getting over the initial hump, since you can really scale up your analysis!

    • @very-normal
      @very-normal  7 днів тому +2

      I can definitely see the I/O getting in the way with such a large amount of simulations. I even feel it when I need to read everything again and compile it for making graphs/tables 💀I haven’t heard of JAX before, but I think it’d be worth looking into. Thanks for the recommendation, I appreciate you taking the time out to tell me

    • @ahiser1117
      @ahiser1117 7 днів тому

      @@very-normalJAX is a machine learning framework created by Google (was used to train Gemini). It has increasing use in probabilistic programming which is what I use it for in my research. Can’t recommend it enough.

    • @bokehbeauty
      @bokehbeauty 6 днів тому

      Would you be so kind to share your code, or point us to helpful guidance to convert the code to JAX?

    • @ElJuanchoCarrancho
      @ElJuanchoCarrancho 6 днів тому

      do you use JAX with GPU?

    • @bokehbeauty
      @bokehbeauty 6 днів тому +1

      @@ElJuanchoCarrancho No, not yet. I heart of JAX, but wasn’t forced to jump the learning hurdle.

  • @jeffreychandler8418
    @jeffreychandler8418 8 днів тому +1

    I've been diving into Rand Wilcox's robust estimation textbook and it's been a really compelling statistical book. It presents the argument that we care most about the core of the data, so we can use trimmed means and winsorized variances (for those that don't know those terms, you cut the outer percentiles for the mean, and for the variance, you set the outer percentiles to the new min and max from trimming). This seems to allow the robustness of rank statistics while maintaining the interpretability of regular statistics. I've worked with it on some of my data and it seems to capture similar patterns as the rank statistics but I can compute effect sizes and communicate it more effectively.

    • @jeffreychandler8418
      @jeffreychandler8418 8 днів тому

      Also, DO NOT RUN COMPUTATIONS ON THE LOGIN NODE. That is the ONE WAY you can break a cluster. I cannot emphasize that enough xD The amount of times I've had to postpone work because some person somewhere in the country ran a "10 minute job" on a login node....
      Also I know there are tools to help write SLURM files for you, just remember your own cluster will have its own limits. It's also nice practice to request the minimum CPU/memory (usually the two are tied, so may as well use all memory you are allowed) necessary, since these are shared systems.
      And I will testify to the value of clusters. I took an analysis that would have taken 6 months on a high end consumer-grade desktop and shrunk it to a day and a half using cluster resources. If you are doing research/big data anything, this is a critical skill to your success.

  • @shane_0_mack962
    @shane_0_mack962 3 дні тому

    Awesome video! As someone else who has had to self-learn Slurm, your explanation was great. I think if you focus a video more on Slurm and using the code then you'd have a pretty successful video on your hands!
    Would be happy to help out with it as well.

  • @Gigano
    @Gigano 8 днів тому +4

    Fantastic stuff! I wish this were available 7 years ago haha when I did my PhD haha!

  • @paper_cut9457
    @paper_cut9457 5 днів тому

    lvl3 actually gets some computer science involved but I think this should be the case for anyone working with data in these times. super funny vim mention btw and thanks for the great content !

  • @ElJuanchoCarrancho
    @ElJuanchoCarrancho 6 днів тому

    Great video, I can now see where the power of the Bayesian statistics come from allowing to compute a simulation from a single model.

    • @very-normal
      @very-normal  6 днів тому +1

      Thanks! But just as a heads up, none of the methodology here is Bayesian 😅

    • @ElJuanchoCarrancho
      @ElJuanchoCarrancho 6 днів тому

      @very-normal that's true, but the underlying concept is widely used in Bayesian models (unless you're working with a sequential Gibbs sampler).

  • @sethsims7414
    @sethsims7414 8 днів тому +4

    Python has common sampling distributions in the "random" module of the standard library. Numpy/Scipy have a much wider selection of distributions though and will usually be faster if you know how to use them.

    • @joshstat8114
      @joshstat8114 5 днів тому +1

      I created an R package, written in C++, where I (tried to) unify all the probability distributions, whether you can calculate the density, CDF, quantiles, and Random Number Generation. I wanted to try the end users create their own Random Variables, useful for probability theories, but this is not added yet. I can't release this in public, yet, because it is not optimized and I don't know how to run the code in GPU, for now.
      Edit: This is quarterly true. I am only in RNG part for now.

    • @pastramiking
      @pastramiking 3 години тому

      You can simulate any distribution if you "know how" just as you can do it more efficiently. The main problem for me with Python is the lack of backwards compatibility. It is unnecessarily complicated to debug old code that used to work unless "you know what you are doing".

  • @7embersVeryOwn
    @7embersVeryOwn 6 днів тому

    Fantastic video as always. Subject is passionating, pedagogy is top tier, and the overall quality is excellent. My only constructive criticism: a bit unfortunate not to discuss the possibility to derive uncertainty estimates when doing MC simulation, imo it's one of its biggest advantages,

    • @very-normal
      @very-normal  6 днів тому +1

      That’s true, I should’ve brought in a standard error estimate, that would’ve been relatively easy to do. Thanks for pointing it out

  • @realshmel
    @realshmel 7 днів тому

    Amazing video . Thanks a million !

  • @BOKRalavac
    @BOKRalavac 7 днів тому +1

    I realy love your videos. If there is anything that I can do to help you produce them, I would love to do so (for example translation, since I am german, or writing short code chunks for viewers to download and play with)

  • @Terracotta-warriors_Sea
    @Terracotta-warriors_Sea 8 днів тому +1

    Thank you for a great video

  • @julioccorderoc
    @julioccorderoc 7 днів тому

    This content is amazing, like incredible amazing
    Thansk!

  • @6PrettiestPics
    @6PrettiestPics 6 днів тому

    You should try Julia. I suspect you’d love it for this kind of work.

  • @therealawesomequest
    @therealawesomequest 8 днів тому

    Great video! Love your stuff!
    When you're displaying code blocks, and transitioning between them, could you not morph the entire block? It makes it impossible to track what's being changed.
    I love the way you talk about these things!

    • @very-normal
      @very-normal  8 днів тому +1

      Thanks! Yeah, I’m not a big fan of how manim handles the code changes either. I’ll try to look into this more or get better at an alternative way of showing code!

  • @lex494
    @lex494 7 днів тому

    Great video, thanks. I really like the aesthetics of your channel. It’s clean and tidy.
    Can you recommend some resources to get a deep understanding of statistics? I have no background in mathematics, but i’m self studying, because I feel like just intuition is not enough to excel at the subject. I would like to self study statistics from scratch as well. Do a lot of problems, calculations myself. Like a good entry level textbook, or an online resource. Thanks!

    • @very-normal
      @very-normal  7 днів тому

      To answer your recommendation question, I recommend Fundamentals of Biostatistics by Ratner. I feel that Biostat is convenient because the examples are easier to grasp, so that should serve you well. I believe you’ll be able to find it online, so you don’t need to worry about the edition
      To be honest, I feel that a deep understanding on statistics does require some form of mathematical background. There are many areas in statistics where you’ll wonder why something is the way it is or why advice like “sample size of 30” exists, and it almost always boils down to a mathematical detail.
      That being said, I think it’s still perfectly acceptable to be adept at models that are commonly used in the knowledge are you work in. Understanding the math is not necessarily the same as the ability to translate what different models mean in different situations, and this is often the skill that people need more.

  • @danvartan
    @danvartan 21 годину тому

    Very nice!

  • @panizzutti
    @panizzutti 6 днів тому

    Great video bro! Fuck yeahh i love this so much

  • @azure-hawk
    @azure-hawk 7 днів тому

    If speed was a major concern, you could use matrix algebra to perform all of these tests much more quickly, though you would need to write your own versions of t.test() and wilcox.test(). This is generally going to be more performative than immediately resorting to parallelization. And this may be more of a personal preference, but you could use switch() instead of the separate if statements, like so (replace ellipses with function arguments): hypothesis_test

    • @azure-hawk
      @azure-hawk 7 днів тому

      There is also the dqrng package for fast random sampling and generating values from probability distributions, if that were to become a bottleneck.

    • @joshstat8114
      @joshstat8114 4 дні тому

      @@azure-hawk but no Cauchy Distribution, though. But yeah, dqrng is fast because it supports PCG64 and Xoroshiro128++.

  • @Gigano
    @Gigano 8 днів тому +2

    By the way, at 14:14, I do not think that code shows how to read in the results of the simulations. Rather it shows the code that runs 10,000 simulations of the non-parallelized kind.

    • @very-normal
      @very-normal  8 днів тому +5

      Oops you’re right, that’s my bad. I’ll add an addendum to the comments when I get back to my desktop. Thanks for the heads up!

  • @stanleymines
    @stanleymines 3 дні тому

    "iykyk" \*shows sea lion central where ive walked to from my dorms\* oh hey, turns out we go to the same school!

  • @nj-bz8pv
    @nj-bz8pv 8 днів тому

    Thanks

  • @mrgomelonsolaris
    @mrgomelonsolaris 5 днів тому

    Monte Carlo Simulation Situation is Crazy

  • @creepyghoul2163
    @creepyghoul2163 7 днів тому

    We have assumed different statistical models. One is the one that generates the data end the other ones are the ones assumed by the hypothesis tests. However, the monte carlo simulation evaluates only the statistical models of the hypostetis tests, right?

    • @very-normal
      @very-normal  7 днів тому +1

      Yes thats right! Just the hypothesis tests

  • @sethsims7414
    @sethsims7414 7 днів тому

    You can embed your R script in the submit script using here docs. Assuming you're using bash.

  • @jiahao2709
    @jiahao2709 4 дні тому

    how you make the animation? which software are you using? really nice visualization!

    • @very-normal
      @very-normal  4 дні тому

      Thanks! I use manim to make most of the math-based animations and for the code transitions

    • @jiahao2709
      @jiahao2709 4 дні тому

      @@very-normal beautiful, may i know how long it takes for you to make this video?

    • @very-normal
      @very-normal  4 дні тому +1

      Ooof this one took maybe 10 hours to edit? It’s longer than my usual videos

    • @jiahao2709
      @jiahao2709 4 дні тому

      @ Thank you again for your contribution to the world!

  • @LionelHutz
    @LionelHutz 13 годин тому

    I would love to have the code

  • @onnio7998
    @onnio7998 7 днів тому

    A very interesting video! I can' help but wonder though whether it would be possible to run the third simulation on a laptop in a reasonable amount of time. Maybe I'm missing something, but shouldn't it be 180 000 * 27 * 2 samples of the distributions and then some math on those from which the final result can be derived. The number of calculations will be in the hundreds of millions probably, but that should't be too bad if I imagine writing some multithreaded c++. Might take a couple of seconds. Am I dreaming?

    • @very-normal
      @very-normal  7 днів тому

      You’re right! It’s totally possible to still do the third one on a personal machine.
      The guiding example I chose here isn’t that good for demonstrating how much of a speed up you can get from the HPC, but the essential structure of the simulation should be about the same. That’s what I was hoping to get across and that people could twist it to their needs

    • @onnio7998
      @onnio7998 7 днів тому

      @very-normal Yep. Of course there will be a point where it's no longer possible or practical to run a an experiment locally.

  • @toofardoug2188
    @toofardoug2188 6 днів тому

    I think there's an editing hiccup just before 5:07

  • @creepyghoul2163
    @creepyghoul2163 7 днів тому

    Changing the shift and other parameters is it called sensitive analysis?

    • @very-normal
      @very-normal  7 днів тому +1

      I hadn’t thought about that, but you’re right, a sensitivity analysis is a very similar idea to using different parameters in a simulation study.
      The only nuance I can think of is that a sensitivity analysis is moreso doing the same analysis/model on different subsets of data that’s already been collected to see how it would affect results.
      On the other hand, a simulation study is moreso done with the purpose of planning an analysis, before any data is collected

  • @KilgoreTroutAsf
    @KilgoreTroutAsf День тому

    This is great and all, but view count distributions are lognormal

  • @khoipham2413
    @khoipham2413 5 днів тому

    Hmmm... I think it's great

  • @MeghrajYashwantkar
    @MeghrajYashwantkar 6 днів тому +1

    i want code

  • @PRiKoL1ST1
    @PRiKoL1ST1 5 днів тому

    nah, it's R

  • @stefano8936
    @stefano8936 4 дні тому

    Bro I think you are wrong, it's 2 sigma per side at 0.95 confidence

    • @very-normal
      @very-normal  4 дні тому +1

      Oops yeah you’re right, thanks for catching that lol. I’ll an addendum for this

  • @sitrakaforler8696
    @sitrakaforler8696 7 днів тому

    500th like.
    Really cool vidéo hahaha

  • @TM-fk4gt
    @TM-fk4gt 8 днів тому +2

    I don't understand the conclusion. Could you rephrase it. 27:07

    • @very-normal
      @very-normal  8 днів тому +3

      The t-tests work better than the nonparametric test when all the assumptions needed to support the t-test are met. Furthermore, there’s not much of a difference between Student and Welch’s t-test.But this study shows that its power is severely decreased in the presence of outliers, which are a natural occurrence in UA-cam. In either case of data generating distribution, power is higher when the two groups are more different, since it’s much easier to detect the difference in their means/typical values.
      Since the power of the nonparametric test is not that much lower even in the Normal setting, it’s my opinion that it’s a safer hypothesis to use if I want to analyze this type of data in the future.

    • @statswithbrian
      @statswithbrian 8 днів тому

      ​@@very-normal The equal variance (Student) t-test is only going to out-perform Welch's when the sample sizes are really small, which is where you get the benefit of pooling the variances. If you run these simulations with n=3 and data coming from normal distributions, that's where the pooled test will really shine compared to the others. with n=27, you'll be able to estimate each variance separately very well anyways, so you don't really get an advantage from pooling which is why the power is the same. Making sample size (in addition to sample size) a variable in the simulation would let you see when the pooled test outperforms the unpooled test. Adding various degrees of heteroskedasticity would also probability highlight the reverse cases where the unpooled test does better.

  • @Aldotronix
    @Aldotronix 6 днів тому

    Too much work mate, just run a permutation test and call it a day

  • @andyvirus2300
    @andyvirus2300 6 днів тому +2

    To bad this is done using R…

    • @pastramiking
      @pastramiking 3 години тому

      R is most commonly used among actual statisticians.