How to Create a Viral Song: Spotify Stream Data Analysis with K-Fold, Regression, Feature Importance

Поділитися
Вставка
  • Опубліковано 28 вер 2024
  • THIS IS A VIDEO FOR MY DEGREE
    (AI GEN)
    Explore a detailed analysis of Spotify's streaming data where we uncover the essential elements that contribute to the viral success of songs. Delve into our comprehensive study to understand the intricate factors shaping music popularity and virality. This video provides a deep dive into our findings, revealing actionable insights for musicians, data enthusiasts, and music industry professionals looking to understand the dynamics behind viral hits on Spotify. Gain valuable knowledge about trends, algorithms, and strategies that impact song virality in today's digital music landscape.
    #DataAnalysis #BigData #ComputerScience #MusicIndustry #ViralSongs #StreamingData

КОМЕНТАРІ • 114

  • @kofuku5949
    @kofuku5949 3 місяці тому +57

    Are you doing this for a school/college project?

    • @wigglecollective
      @wigglecollective  3 місяці тому +45

      this was for uni, but i am researching and learning various ai methods for my personal project, (the video featured on my channel) ua-cam.com/video/CBewV_akO9M/v-deo.html

    • @andreilaiter1233
      @andreilaiter1233 3 місяці тому +1

      @@wigglecollective is it something that gives the probability of popularity of an audio?

  • @AbhinavXevents
    @AbhinavXevents 3 місяці тому +75

    Hey Alexa play regression

  • @maxskoryk1466
    @maxskoryk1466 3 місяці тому +31

    One things that might be interesting to check out is to bin the songs to years (or longer periods) when they were published. Humanity's cultural preferences are changing with time and so are trends. Perhaps binning will let you better identify some prominent song features that were indicative of viral songs during a given "cultural era".

    • @wigglecollective
      @wigglecollective  3 місяці тому +4

      oh yes im sure we could create some really interesting data visualisations of how genres have changed and branched over the years

    • @XEQUTE
      @XEQUTE 3 місяці тому +1

      interesting!

  • @axoid
    @axoid 3 місяці тому +5

    Music and data! Two of my favourite things. Great analysis. Could you do another video that goes deeper into your process and/or add some links to the description?

    • @wigglecollective
      @wigglecollective  3 місяці тому +3

      I will be updating this video after my next one with an improved study and comparison!

  • @Djgab04100
    @Djgab04100 3 місяці тому +18

    it's pretty cool, don't hesitate going a little further, maybe pca, maybe doing a model by subgenre etc... going more in depth will be really instructionnal for you

    • @wigglecollective
      @wigglecollective  3 місяці тому +5

      ty, i am expanding my learning in preparation for an ai trail camera project that can automatically monitor populations of endangered animals. do you have any ideas what i should practice for this?

    • @harryfindlay2089
      @harryfindlay2089 3 місяці тому +1

      @@wigglecollectiveCNNs and (mini)batch processing 😘

    • @TheMrN4R3K
      @TheMrN4R3K 3 місяці тому

      Pca doesn't improve the models performance.

    • @Djgab04100
      @Djgab04100 3 місяці тому +1

      @@TheMrN4R3K bs, it absolutely can help prevent overfitting with correlated features

    • @Djgab04100
      @Djgab04100 3 місяці тому

      @@wigglecollective deep learning and machine learning with tabular data differ a lot in practise tbh, if you haven't done them try playing with the digits dataset I guess

  • @AlirezaJalouli-qt8gk
    @AlirezaJalouli-qt8gk 3 місяці тому +5

    awesome project, good job! 👏

  • @Siroitin
    @Siroitin 3 місяці тому +3

    More traditional way would have been to choose a criterion (AIC, BIC, etc...) and compare models with good scores.

  • @anaxstazia
    @anaxstazia 3 місяці тому +3

    :O Very interesting

  • @the6thelementbeats784
    @the6thelementbeats784 3 місяці тому +1

    It would be great if you included a github repo or ipynb notebook link, would love to go through the code!

    • @wigglecollective
      @wigglecollective  3 місяці тому

      available on my other videos - as this is a uni piece im not really allowed to share it :/

  • @aaomms7986
    @aaomms7986 3 місяці тому +1

    I mean this kind of vid so cool

  • @BooleanDisorder
    @BooleanDisorder 2 місяці тому

    I'd love a study on how much people like a song that's AI generated when knowing it is versus not knowing it's AI generated. I bet knowing it will remove some amount of enjoyment.

    • @wigglecollective
      @wigglecollective  2 місяці тому +1

      hahah sounds like a fun experiment maybe will do me vs ai musician who can make a better song

  • @ponjiroo4115
    @ponjiroo4115 3 місяці тому +1

    May I know where you got this dataset from, I would like to build this project as well!

    • @wigglecollective
      @wigglecollective  3 місяці тому

      www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs

    • @ponjiroo4115
      @ponjiroo4115 3 місяці тому

      @@wigglecollective thank you

  • @ndiphiwekwakweni7973
    @ndiphiwekwakweni7973 3 місяці тому +4

    Very interesting analysis. Where did you get your data set from?

    • @wigglecollective
      @wigglecollective  3 місяці тому +2

      www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs - kaggle is a great source for usable data sets

    • @ndiphiwekwakweni7973
      @ndiphiwekwakweni7973 3 місяці тому +1

      @@wigglecollective Okay thank you

  • @RodrigoCoinCurvo
    @RodrigoCoinCurvo 3 місяці тому

    Very interestin stuff! Do you have data about the song's release year, when it was popular, and the listener's age? If your data is just popularity among all songs of any year over all time for all ages, then it might be difficult, because there might not be anything that makes a song popular "universally". If you have data about the song's year, when it was popular, and the listener's age, then you could have a higher change of finding correlation, because then you would have information about what makes a song popular in their context. Or even just having when the song was popular might give you ability to predict what's going to be popular next, e.g. features of popularity in a given year might indicate features of popularity in the following years.

    • @wigglecollective
      @wigglecollective  3 місяці тому +1

      Hi, Its likely i will revisit this with updates around comments and new knowledge. Im not sure wether this information is publically available but we defo could do some web scraping to infer!

  • @muhammadahmednizamani4524
    @muhammadahmednizamani4524 3 місяці тому

    can you share the data, or attach it in the description if it is open-source?
    Thanks !

    • @wigglecollective
      @wigglecollective  3 місяці тому

      www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs !

  • @Suhan-zg8ng
    @Suhan-zg8ng 2 місяці тому

    Hey, if it's alright could you share the dataset for this?

  • @ToasterBathPog
    @ToasterBathPog 8 місяців тому +1

    THIS WAS SO EPIC AND AWSOME LOVE IT!

  • @danibitt59
    @danibitt59 3 місяці тому

    VERY good.

  • @GAZ___
    @GAZ___ 3 місяці тому

    Very impressive! but having all these features, maybe Random Forest Regression would be a better model.

  • @PJ-hi1gz
    @PJ-hi1gz 3 місяці тому +1

    I think removing the non-popular songs might be an error, since it will give you data on what doesn't work, and will balance out all the stuff from the popular songs. My 2 cents, great job overall

    • @theobeevers369
      @theobeevers369 3 місяці тому

      True, maybe removing songs that have been released less than a year and leave the rest

    • @wigglecollective
      @wigglecollective  3 місяці тому

      yes I think it was an oversight to remove completely, maybe focusing on the nan data and studying that will give me better context to create the model

  • @rugonge
    @rugonge 3 місяці тому

    how did you obtain your data? is it available from spotify?

  • @beaverbuoy3011
    @beaverbuoy3011 3 місяці тому

    Perhaps: Training an ML model on this data for better insight?

  • @daylight8296
    @daylight8296 3 місяці тому

    one hot encoding could be skewing some of the features, if your applying a scaler to all the other features but not the one-hot encoded features, the weight of the one-hot encoding could skew and be the reason for such a high mse

    • @daylight8296
      @daylight8296 3 місяці тому +1

      to get around this, you could drop all the one-hot encoded data (and model it separately) or scale all the features after you’ve done your one-hot encoding together which will balance some of the weight of the one-hot features

  • @antonbordwine
    @antonbordwine 3 місяці тому +1

    At 6:18 you barely say anything about residual plot, however it is quite a significant plot overall. It's clear that the linear regression model is not the best here, since the residuals are not exogenous. Different techniques can be used to minimise effects of the endogenous residuals, you definitely should check them out.

    • @wigglecollective
      @wigglecollective  3 місяці тому

      I tried a few methods but struggled to make any of them work very well without muddling the data a lot. What sortof things would you recommend ill give them a go in my next project :)

    • @antonbordwine
      @antonbordwine 3 місяці тому +2

      @@wigglecollective You can still work with linear regression models just add more parameters to explain the data. For instance instead of y = a*x, use y = a*x + b*x^2. Training this linear regression and many other variations will help to understand what kind of parameters are needed, hopefully reducing the error.

    • @wigglecollective
      @wigglecollective  3 місяці тому +2

      @@antonbordwine Ah very interesting this was not mentioned in my lectures or in any reading material i was using - I will try this approach on my new dataset - going to be looking at yt virality next, hopefully my skill will be improved by then! TY

  • @framee1795
    @framee1795 3 місяці тому

    How to get data bro?

    • @wigglecollective
      @wigglecollective  3 місяці тому

      www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs

  • @davidbellamy3522
    @davidbellamy3522 3 місяці тому +1

    PSA: discarding NaN's, 0's and examining correlations among the remaining songs will lead you to very wrong conclusions about causality. This analysis suffers from selection bias and mistaking correlation for causation. If you want to make a claim that "feature X *causes* a song to go viral" it does not suffice to do the analysis you did.
    To learn more, read the book Causal Inference: What If by my ex-PhD advisors at Harvard.

    • @wigglecollective
      @wigglecollective  3 місяці тому +1

      Yes absolutely, this was one of my conclusions for the write up of this data. I will give the book a read and hopefully my next attempt will be stronger!

    • @wigglecollective
      @wigglecollective  3 місяці тому +1

      thankyou for your comment its really important to strive for accurate and strong conclusions when analysing data and it annoys me to no end when newspapers and the like publish studies that are not conclusive or have a forced conclusion

    • @davidbellamy3522
      @davidbellamy3522 3 місяці тому +1

      That’s a great attitude, congrats for having that outlook! With that outlook you’ll perform better analyses than 99% of data scientists in the long run.

    • @taha5754
      @taha5754 3 місяці тому

      Can you name the book?

    • @davidbellamy3522
      @davidbellamy3522 3 місяці тому

      @@taha5754 did you read my comment to the end?? It’s right there.

  • @cameryngallardo
    @cameryngallardo 3 місяці тому

    Lesson learned: just make rap and one day you'll top the spoitfy charts

    • @wigglecollective
      @wigglecollective  3 місяці тому

      quite possibly, famously rappers tend to come from underprivileged backgrounds and historically have been able to break through to virality despite the lack of funding of the genre!

    • @wigglecollective
      @wigglecollective  3 місяці тому

      the soundcloud generation is a good example of this!

  • @abe3qyli637
    @abe3qyli637 3 місяці тому

    That linear regression means nothing. Need non parametric statistics yo make the topic of virality interesting. Or else it means nothing. Having a viral song is like winning the lottery aka fat tailed distributions.

    • @wigglecollective
      @wigglecollective  3 місяці тому

      ah it was a video for uni we had to show examples of things that werent important for the grade ¯\_(ツ)_/¯

  • @DannyGerst
    @DannyGerst 3 місяці тому

    Did you released the code somewhere? Especially the cluster analysis I found very interesting.

  • @BoHorror
    @BoHorror 3 місяці тому

    Since it’s a classification task, why not use a CNN instead of a linear regression model?

  • @JeromeRivera-s3p
    @JeromeRivera-s3p 2 місяці тому

    I had a similar idea in mind, difference is, is that it takes the chords and notes that would make a viral song.

  • @tatomans1982
    @tatomans1982 3 місяці тому

    awesome project, could you share the code?

  • @ProdByGhost
    @ProdByGhost 3 місяці тому

    all that data and no correlation to the chord/notes.........

  • @adiutasglodeanu9855
    @adiutasglodeanu9855 3 місяці тому +3

    maybe try a penalised linear regression, like lasso or Elastic net, there may be outliers affecting the Linear regression model , and because some of your predictors present paralalism this also poses a problem

    • @stepbro1992
      @stepbro1992 3 місяці тому

      Is paralalism when your using highly correlated independent variables ?

  • @KhushiSingh-vo9nf
    @KhushiSingh-vo9nf 3 місяці тому

    could u provide us with the code

  • @avgspacelover
    @avgspacelover 3 місяці тому +6

    loved this

  • @PresencyStudio
    @PresencyStudio 3 місяці тому +1

    Subscribing in hopes you get into more detail later. Maybe you could achieve lower MSE by splitting the data into genres ? I'm working on a similar project of my own (but for commercial purposes)

  • @taiga8798
    @taiga8798 3 місяці тому +1

    Hey man I really liked the video! Which method did you use to extract the feature importance after linear regression?

  • @gapsongg
    @gapsongg 3 місяці тому

    tbh did not learn anything haha just some graphs, but did not learn anything I hope my algorithm doesnt fail again, when it comes to clickbait

    • @wigglecollective
      @wigglecollective  3 місяці тому +1

      IF U WANT A VIDEO THAT IS NOT CLICKBIAT GO TO THIS LINK ua-cam.com/video/CBewV_akO9M/v-deo.html !!!!

    • @wigglecollective
      @wigglecollective  3 місяці тому +1

      (NOT CLICKBAIT)

  • @lucastefanescu7815
    @lucastefanescu7815 3 місяці тому +1

    what a video! Awesome stuff man. I haven't even watched the video, but from the title I can infer this ones going to be a banger.

  • @AurL_69
    @AurL_69 3 місяці тому +1

    I feel like this video is going to be a banger
    Edit : it was

  • @siddheshd9
    @siddheshd9 3 місяці тому +1

    do one for Instagram , yt, google seo

  • @zskater1234
    @zskater1234 3 місяці тому +2

    Pretty cool! Keep it up

  • @heypbolon1941
    @heypbolon1941 3 місяці тому +1

    Got it! I'ma go and do exactly this

  • @stepbro1992
    @stepbro1992 3 місяці тому

    Perhaps linear regression was not the best in your case. You’re loosing too much Info when normalizing the data to fit the model. In any case, good job. Are you doing a masters or a bachelor’s?

  • @vanyaknyazev9710
    @vanyaknyazev9710 3 місяці тому +2

    Really cool, I've been curious in applying ML for music too

    • @wigglecollective
      @wigglecollective  3 місяці тому +3

      i think this study left me with more questions than what i went in with, especially about streamability and tiktokability for virality, also there is so much crazy stuff going on in the background in the music industry with nepotism, abusive contracts ect ect..

    • @stepbro1992
      @stepbro1992 3 місяці тому

      Some things can’t be quantified, neither explained. It’s the same thing as success. You can’t predict it.

    • @vanyaknyazev9710
      @vanyaknyazev9710 3 місяці тому

      @@wigglecollective couldn't agree more, but if you get creative there's many more use cases beyond prediction with ML

    • @hecticbeatzz5628
      @hecticbeatzz5628 3 місяці тому

      Go for it bro. Low competition for internships compared to other ML subfields. A lot of interviews so far too.

    • @vanyaknyazev9710
      @vanyaknyazev9710 3 місяці тому +1

      @@wigglecollective couldn't agree more, but if you get creative there's other use cases of ML beyond prediction here :)

  • @leonardodicaterina7675
    @leonardodicaterina7675 3 місяці тому +1

    very cool! how did you get this dataset?

    • @wigglecollective
      @wigglecollective  3 місяці тому +2

      you can use spotify api or here www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs

    • @leonardodicaterina7675
      @leonardodicaterina7675 3 місяці тому

      @@wigglecollective thank you very much

  • @Russell_Chubb
    @Russell_Chubb 3 місяці тому

    This is cool, I've done something similar but with aggregated soundcloud "mix" information.
    I assume you got the data from the Spotify API?

    • @wigglecollective
      @wigglecollective  3 місяці тому

      oh cool can u send me urs id like to compare :) and yes but its also available on kaggle for ease of access

  • @theobeevers369
    @theobeevers369 3 місяці тому

    Nice vid, what model did you use?

  • @DeLuxMusicChannel
    @DeLuxMusicChannel 3 місяці тому

    Lets all work together on making music more mainstream!

    • @wigglecollective
      @wigglecollective  3 місяці тому

      bro hasnt even seen my song in this video ua-cam.com/video/CBewV_akO9M/v-deo.html T-T

  • @sederonveyll8409
    @sederonveyll8409 3 місяці тому

    I think your problem is that your data isn't normally distributed. Normal distribution is essential for linear regression.
    Also i wonder if it's correct to use linear regression model to data with several clusters. It seems to me that linear regression should be applied to each cluster separately.

    • @sederonveyll8409
      @sederonveyll8409 3 місяці тому

      Also your correlation plot shows that track_popularity isn't correlating with anything, so it's no point in making regression.
      However before making such conclusion you should find partial correlations and then find significant one. In the end, use only the features that gives significant partial correlation with track_popularity, if there any, in your regression model.
      And by significant correlation i mean r > 0, because you don't want negative correlation in your case.

    • @Siroitin
      @Siroitin 3 місяці тому

      mixed effect models? Or more broadly hierarchical model. Y_{ij} is score of jth song in ith category. We calculate average score /mu for all songs, then category-specific \alpha_i random effect. Lastly we calculate song-specific effect (which is deviation of the jth song score from the average for the ith category)

    • @sederonveyll8409
      @sederonveyll8409 3 місяці тому

      @@Siroitin Thank you for the information. I don't know why i've never heard of this model before.

    • @wigglecollective
      @wigglecollective  3 місяці тому

      thankyou! I will research try to apply this next time!