UMAP Dimension Reduction, Main Ideas!!!

Поділитися
Вставка
  • Опубліковано 17 чер 2024
  • UMAP is one of the most popular dimension-reductions algorithms and this StatQuest walks you through UMAP, one step at a time, so that you will have a solid understanding of how UMAP works.
    NOTE: This StatQuest is based on the original UMAP manuscript...
    arxiv.org/pdf/1802.03426.pdf
    ...specifically Appendix C, From t-SNE to UMAP, which is also here...
    jlmelville.github.io/uwot/uma...
    ...and the UMAP user documentation...
    umap-learn.readthedocs.io/en/...
    For a complete index of all the StatQuest videos, check out...
    app.learney.me/maps/StatQuest
    ...or...
    statquest.org/video-index/
    If you'd like to support StatQuest, please consider...
    Buying my book, The StatQuest Illustrated Guide to Machine Learning:
    PDF - statquest.gumroad.com/l/wvtmc
    Paperback - www.amazon.com/dp/B09ZCKR4H6
    Kindle eBook - www.amazon.com/dp/B09ZG79HXC
    Patreon: / statquest
    ...or...
    UA-cam Membership: / @statquest
    ...a cool StatQuest t-shirt or sweatshirt:
    shop.spreadshirt.com/statques...
    ...buying one or two of my songs (or go large and get a whole album!)
    joshuastarmer.bandcamp.com/
    ...or just donating to StatQuest!
    www.paypal.me/statquest
    Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
    / joshuastarmer
    0:00 Awesome song and introduction
    1:07 Motivation for UMAP
    2:55 UMAP main ideas
    5:22 Calculating high-dimensional similarity scores
    10:41 Getting started with the low-dimensional graph
    12:37 Calculating low-dimensional similarity scores and moving points
    15:49 UMAP vs t-SNE
    #StatQuest #UMAP #DimensionReduction

КОМЕНТАРІ • 163

  • @statquest
    @statquest  2 роки тому +5

    To learn more about Lightning: github.com/PyTorchLightning/pytorch-lightning
    To learn more about Grid: www.grid.ai/
    Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/

  • @markmalkowski3695
    @markmalkowski3695 2 роки тому +7

    This is awesome, thanks for explaining UMAP so well, and clearly explaining when to use! Love the topics you’re covering

  • @aiexplainai2
    @aiexplainai2 2 роки тому +6

    I can't appreciate how much this channel helped me - so clearly explained!!

    • @statquest
      @statquest  2 роки тому

      Thank you very much! :)

  • @EthanSalter3
    @EthanSalter3 2 роки тому +38

    This is such perfect timing, I'm supposed to learn and perform a UMAP reduction tomorrow. Thank you!

    • @statquest
      @statquest  2 роки тому +5

      BAM! :)

    • @Dominus_Ryder
      @Dominus_Ryder 2 роки тому +3

      You should buy a couple of songs to really show your appreciation!

  • @offswitcher3159
    @offswitcher3159 2 роки тому +1

    Great Video, Thank you! You are with me since first semester and I am so happy to see a video by you on a topic that is relevant to me

  • @akashkewar
    @akashkewar 2 роки тому +2

    Not sure if I can hold my breath for long enough before the video starts, Amazing work!! @StatQuest

  • @terezamiklosova104
    @terezamiklosova104 2 роки тому +10

    I really appreciated the UMAP vs t-SNE part. Thanks for the video! Really helpful when one tries to get the main idea behind all the math :)

    • @statquest
      @statquest  2 роки тому

      Thank you very much! :)

    • @smallnon-codingrnabioinfor3792
      @smallnon-codingrnabioinfor3792 Рік тому

      I totally agree! The part starting at 16'10 is worth to look at back! Thanks a lot for this great and simple explanation!

  • @kennethm.4998
    @kennethm.4998 2 роки тому +2

    Dude... Dude... You have a gift for explaining stats. Superb.

  • @dexterdev
    @dexterdev 2 роки тому +2

    I was waiting for this. thank you. best dimensionally reduced visual explanation out there.

    • @statquest
      @statquest  2 роки тому +1

      Thank you very much! :)

  • @evatosco-herrera8978
    @evatosco-herrera8978 Рік тому +13

    I just found this channel. I'm currently doing my PhD in Bioinformatics and this is helping me immensely to save a lot of time and to learn new methods faster and better (I have a graphical brain so :/) Thank you so much for this!!

    • @statquest
      @statquest  Рік тому +1

      Good luck with your PhD! :)

  • @codewithbrogs3809
    @codewithbrogs3809 2 місяці тому +3

    After three days of coming back to this video, I think I finally got it... Thanks Josh. When I'm in a place to support, I will

  • @JulietNovember9
    @JulietNovember9 2 роки тому +2

    New StatQuest always gets me amped. High yield, low drag material!!!

  • @user-hg4jk2q
    @user-hg4jk2q Місяць тому +1

    This will help me greatly for my MS project.

  • @abramcadabros1755
    @abramcadabros1755 2 роки тому +3

    Wowie, I can finally learn what UMAP stands for and how it reduces dimensionality AFTER I analysed my scRNA-seq data with it's help!

  • @saberkazeminasab6142
    @saberkazeminasab6142 Рік тому +1

    Thanks so much for the great presentation!

  • @agentgunnso
    @agentgunnso Місяць тому +1

    Thank you so much!!! Love the sound effects and the jokes

    • @statquest
      @statquest  Місяць тому +1

      Glad you like them!

  • @MegaNightdude
    @MegaNightdude 2 роки тому +1

    Great content. As always!

  • @dataanalyticswithmichael8931
    @dataanalyticswithmichael8931 2 роки тому +1

    Nice esplanation, i want to use this as my references for my projects

  • @brucewayne6744
    @brucewayne6744 2 роки тому +3

    Amazing video!! Hope there is a statquest on ICA coming soon :)

  • @shubhamtalks9718
    @shubhamtalks9718 2 роки тому +2

    Yayy. I was waiting for it.

  • @92marjoh
    @92marjoh 2 роки тому

    Hey Josh,
    Your videos have made my learning curve exponential and i truly appreciate the videos you make! I wonder, have you ever considered making a video about Bayesian target encoding (and other smart categorical encoders)?

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind.

  • @floopybits8037
    @floopybits8037 2 роки тому +1

    Thank you so much for this video

  • @VCC1316
    @VCC1316 2 роки тому +1

    I'd love to see a cross-over episode between StatQuest and Casually Explained.
    Big bada-bam.

  • @danli1863
    @danli1863 2 роки тому +1

    I must say this channel is amazing! I must say this channel is amazing! I must say this channel is amazing!
    Important things 3 times. :)

  • @Pedritox0953
    @Pedritox0953 Рік тому +1

    Great video!

  • @kiranchowdary8100
    @kiranchowdary8100 2 роки тому +1

    ROCKINGGGG!!!! As always.

  • @samuelivannoya267
    @samuelivannoya267 2 роки тому +1

    You are amazing!! Thanks!!!

  • @rajanalexander4949
    @rajanalexander4949 2 місяці тому +1

    Great video; especially liked the echo on the full exposition of 'UMAP' 😂

  • @siphosakhemkhwanazi6042
    @siphosakhemkhwanazi6042 Місяць тому +1

    The intro made me to subscribe😂😂

  • @meenak722
    @meenak722 8 місяців тому +1

    Thank you very much!

  • @RelaxingSerbian
    @RelaxingSerbian 2 роки тому +1

    Your little intros are so silly and charming! ^_^

  • @THEMATT222
    @THEMATT222 2 роки тому +3

    New video!!!! Very Noice 👍

  • @veronicacastaneda6274
    @veronicacastaneda6274 2 роки тому +3

    Hey! I love your videos! Can you do one on Weighted correlation network analysis? I share your videos with my friends and we want to learn about it :)

    • @statquest
      @statquest  2 роки тому +1

      I'll keep tat in mind.

  • @junaidbutt3000
    @junaidbutt3000 2 роки тому +3

    Hey Josh,
    Great work as always, this StatQuest came at a great time for me because I've been looking into UMAP myself. I had a few questions apologies if they're covered in the mathematical details video:
    1. Is there an additional constraint for the curve used to compute the high dimensional similarity score to make the scores what they are? In the example where you computed the distance of points B and C relative to A, you had 1.0 and 0.6. This is because the scores must sum to 1.6. But why not 1.3 and 0.3 or 1.59 and 0.01? Is there an additional consideration which locks them to be 1.0 and 0.6?
    2. Will there be an explanation about spectral embedding? This may be outside of the scope of the video but I thought I'd ask!
    3. Could you please check my understanding for what is happening when we move point D closer to point E? The discussion starts at 14:48 in the video. As I understand it, moving D closer to E (we want this) also moves D closer to C (we don't want this). So we compute a tradeoff and find that the cost of moving D closer to C is lower than the benefit of moving D closer to E. Therefore we move D to E. Is this correct? If so, is there an equation or rule that allows us to quantify this such that we can determine the exact distance to move D closer to E?
    I suspect that most of these will be included in the mathematical details follow up video but I thought I'd ask just in case they aren't.

    • @statquest
      @statquest  2 роки тому +1

      1) You'll see the answer to this in the follow up video. However, to give you a head start - the similarity score for the closest point is always 1, and this limits what the score for the second point can be (since only have 2 points as neighbors).
      2) Unfortunately I'm not going to dive into spectral embedding (not yet at least!)
      3) You're understanding is correct and you'll see the equation that makes this work in the follow up video (which will be available very soon!)

  • @user-mv3im2fi4f
    @user-mv3im2fi4f Місяць тому

    Ele explica como se eu fosse uma acéfala.
    Só assim eu entendi, obrigada!

  • @Murphyalex
    @Murphyalex 2 роки тому +1

    Great video (as always). You might want to calm it down with the BAMs though. It used to be quirky and fun but having them literally every minute or two is a bit much and forced. Your video creation skills are seriously awesome. I wish I had even half your skills at making these concepts accessible for the YT audience. 👏

  • @nbent4607
    @nbent4607 4 місяці тому +1

    Thank you!!

  • @cytfvvytfvyggvryd
    @cytfvvytfvyggvryd 2 роки тому

    Thank you for your terrific video! If you got time, could you made a relevant video about densMAP? Again appreciate your wonderful work! Thank you!

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind.

  • @paulclarke4548
    @paulclarke4548 2 роки тому +1

    Great video! Thank you!! Do you have any plans to clearly explain Generative Topographic Mapping (GTM)? I'd love that!

    • @statquest
      @statquest  2 роки тому

      Not right now, but I'll keep it in mind.

  • @sumangare1804
    @sumangare1804 Рік тому +1

    Thank your the explanation! If possible Could you do a video on HDBSCAN algorithm

  • @abdoualgerian5396
    @abdoualgerian5396 11 місяців тому

    With this amazing explanation way, please consider doing a Deep TDA quest starting with the paraparapepapara funny thing instead of the songs

  • @whitelady1063
    @whitelady1063 2 роки тому +1

    Best comment section in UA-cam
    Also now I get why people on office won't stop praising you
    BAM!

  • @jatin1995
    @jatin1995 2 роки тому +1

    Perfect!

  • @ashfaqueazad3897
    @ashfaqueazad3897 2 роки тому

    It will be great if you do some videos on sparse data if you get the time. Would love it. Thanks.

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind.

  • @Friedrich713
    @Friedrich713 2 роки тому +1

    Great quest, Josh! First time I noticed the fuzzy parts on the circles and arrows. What tool are you using to make the slides? Looks damn fine!

    • @statquest
      @statquest  2 роки тому

      Thanks! I draw everything in keynote.

  • @emiyake
    @emiyake 3 місяці тому +1

    PaCMAP dimension reduction explanation video would be very appreciated!

    • @statquest
      @statquest  3 місяці тому +1

      I'll keep that in mind.

  • @AU-hs6zw
    @AU-hs6zw Рік тому +1

    Thanks!

  • @AkashKumar-qe5jk
    @AkashKumar-qe5jk 2 роки тому

    Great video!!!
    One query: What characteristics of the features/dataset we would be analyzing when we choose a smaller value of neighbors? Same question with larger values?

    • @statquest
      @statquest  2 роки тому

      The number of nearest neighbors we use does not affect how the features are used. The features are all used equally no matter what.

  • @davidhodson6680
    @davidhodson6680 Рік тому +1

    Adding a comment for the cheery ukelele song at the start, I like it.

  • @grace6228j
    @grace6228j 2 роки тому +1

    Thanks for your amazing video! I am a little bit confused, it seems that UMAP is able to do clustering (based on the similarity scores) and dimensionality reduction visualization at the same time, why do researchers usually only use UMAP for visualization?

    • @statquest
      @statquest  2 роки тому +1

      That's a great question. I guess the big difference between UMAP and a clustering algorithm is that usually a clustering algorithm gives you a metric to determine how good or bad the clustering is. For example, with k-means clustering, we can compare the total variation in the data for each value for 'k'. In contrast, I'm not sure we can do that with UMAP.

  • @LazzaroMan
    @LazzaroMan Рік тому +1

    Love you

  • @lamourpaspourmoi
    @lamourpaspourmoi Рік тому

    Thank you! Could you do one with self organizing maps?

  • @wlyang8787
    @wlyang8787 Рік тому

    Hi Josh, would you please make a video about DiffusionMap? Thank you very much!

  • @andreamanfron3199
    @andreamanfron3199 2 роки тому +1

    i just love you

  • @cssensei610
    @cssensei610 2 роки тому +1

    can you cover Locality Sensitive Hashing, and do a clustering implementation in PySpark

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind.

  • @Dominus_Ryder
    @Dominus_Ryder 2 роки тому

    StatQuest please do a UMAP tutorial in R next!

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind. However, I'm doing the mathematical details next.

  • @leamon9024
    @leamon9024 2 роки тому

    Hello sir, would you cover a dimension reduction technique which uses hierarchical or k-means clustering if possible?
    Thanks in advance.

    • @statquest
      @statquest  2 роки тому

      I'll keep that in mind.

  • @Chattepliee
    @Chattepliee 2 роки тому

    I've read that UMAP is better at preserving inter-cluster distance information relative to tSNE, what do you think? Is it reasonable to infer relationships between clusters on a UMAP graph? I try to avoid doing so with tSNE.

    • @statquest
      @statquest  2 роки тому +1

      To be honest, it probably depends on how you configure the n_neighbors parameter. However, to get a better sense of the differences (and similarities) between UMAP and t-SNE, see the follow up video: ua-cam.com/video/jth4kEvJ3P8/v-deo.html

    • @samggfr
      @samggfr Рік тому

      Concerning distance information, initialization and parameters are important. Read "The art of using t-SNE for single-cell transcriptomics" pubmed.ncbi.nlm.nih.gov/31780648/ and "Initialization is critical for preserving global data structure in both t-SNE and UMAP" dkobak.github.io/pdfs/kobak2021initialization.pdf

  • @flc4eva
    @flc4eva 2 роки тому

    I might have missed this, but how does UMAP initializes a low-dimensional graph? Is it randomized as done in tSNE?

    • @statquest
      @statquest  2 роки тому

      This is answered at 16:43

  • @ranjit9427
    @ranjit9427 2 роки тому +2

    Can you make some videos on recommender systems??

    • @4wanys
      @4wanys 2 роки тому

      complete list for recommender systems
      ua-cam.com/play/PLsugXK9b1w1nlDH0rbxIufJLeC3MsbRaa.html

    • @statquest
      @statquest  2 роки тому +1

      I hope too soon!

  • @hiankun
    @hiankun 2 роки тому +1

    The big picture is ❤️
    😃

  • @ammararazzaq132
    @ammararazzaq132 Рік тому

    As PCA required correlation between features to find new principal components, does UMAP approach require correlation between features to project data onto lower dimensional space?

    • @statquest
      @statquest  Рік тому

      no

    • @ammararazzaq132
      @ammararazzaq132 Рік тому

      @@statquest So we can still see clusters even when data is not correlated?

    • @statquest
      @statquest  Рік тому

      @@ammararazzaq132 That I don't know. All I know is that UMAP does not assume correlations.

    • @ammararazzaq132
      @ammararazzaq132 Рік тому

      @@statquest Okay thankyou. I will look into it a bit more.

  • @user-of6ev3ej8z
    @user-of6ev3ej8z 2 роки тому

    I have a question. After moving d closer to e, do we still consider moving d to c? Or, would c be moved to d? The direction in the video confuses me.

    • @statquest
      @statquest  2 роки тому

      When we move 'd', we consider both 'e' and 'c' at the same time. In this case, moving 'd' closer to 'e' and closer to 'c' will increase the neighbor score for 'e' a lot but only increase the score for 'c' a little, so we will move 'd'. For details, see: ua-cam.com/video/jth4kEvJ3P8/v-deo.html

  • @mericknal8752
    @mericknal8752 Рік тому +1

    echoing UMAP part is amazing 😂

  • @alexlee3511
    @alexlee3511 3 місяці тому

    Complicated dataset you referring to is the dataset that cannot be explained by one or two PC?

  • @ali-om4uv
    @ali-om4uv 2 роки тому

    How does umap know which high dimensional datapoint belongs to which cluster?

    • @statquest
      @statquest  2 роки тому

      The similarity scores.

  • @prashantsharma-sr5dl
    @prashantsharma-sr5dl 3 місяці тому

    how did the low dimensional plot came just after the similarity score?

    • @statquest
      @statquest  3 місяці тому

      At 4:14 I talk about how the main idea is that we start with an initial (somewhat random) low dimensional plot that we then optimize based on the high dimensional similarity scores.

  • @joejohnoptimus
    @joejohnoptimus 2 місяці тому

    How does UMAP identify these initial clusters to begin with?

    • @statquest
      @statquest  2 місяці тому

      You specify the number of neighbors. I talk about this at various times, but 17:18 would be a good review.

  • @juanete69
    @juanete69 Рік тому

    But how do you "decide" that a cluster is a distant cluster?
    PS: I guess you consider a point as a distant point if it's not among the k neighbors.

    • @statquest
      @statquest  Рік тому

      correct

    • @juanete69
      @juanete69 Рік тому

      @@statquest But do you keep "adding" new points to the cluster if they are within the k neighbors of the next point, and so on?
      Or in order to define the cluster you only consider the k neighbors of the first point?

    • @statquest
      @statquest  Рік тому +1

      @@juanete69 We start with a single point. If it has k neighbors, we call it a cluster and the neighbors to the cluster. Then, for each neighbor that has k neighbors, we add those neighbors and repeat until the cluster is surrounded by points that have fewer than k neighbors.

  • @gama3181
    @gama3181 2 роки тому +1

    Hi-dimentional BAAAMM!

  • @TheEbbemonster
    @TheEbbemonster 2 роки тому

    Seems very convoluted compared to K-means or hclust.

    • @statquest
      @statquest  2 роки тому

      UMAP uses a weighted clustering method, so that points that are closer together in high-dimensional space will get higher priority to be put close together in the low dimensional space.

  • @AHMADKELIX
    @AHMADKELIX Рік тому +1

    Permissionntomlearn sir

  • @sapito169
    @sapito169 2 роки тому

    i think he will sing all the video XD

  • @TJ-hs1qm
    @TJ-hs1qm 2 роки тому +1

    auto-like 👍

  • @connorfrankston5548
    @connorfrankston5548 Рік тому

    Thanks, I appreciate the information. However, I think your videos would be easier to watch with a reduction of the "bam" dimension.

  • @dummybro499
    @dummybro499 2 роки тому +2

    Don't say bam....!! It irritates

  • @ScottSummerill
    @ScottSummerill 2 роки тому +1

    UMAP is a MESS. No thank you.