How to Create a Viral Song: Spotify Stream Data Analysis with K-Fold, Regression, Feature Importance
Вставка
- Опубліковано 28 вер 2024
- THIS IS A VIDEO FOR MY DEGREE
(AI GEN)
Explore a detailed analysis of Spotify's streaming data where we uncover the essential elements that contribute to the viral success of songs. Delve into our comprehensive study to understand the intricate factors shaping music popularity and virality. This video provides a deep dive into our findings, revealing actionable insights for musicians, data enthusiasts, and music industry professionals looking to understand the dynamics behind viral hits on Spotify. Gain valuable knowledge about trends, algorithms, and strategies that impact song virality in today's digital music landscape.
#DataAnalysis #BigData #ComputerScience #MusicIndustry #ViralSongs #StreamingData
Are you doing this for a school/college project?
this was for uni, but i am researching and learning various ai methods for my personal project, (the video featured on my channel) ua-cam.com/video/CBewV_akO9M/v-deo.html
@@wigglecollective is it something that gives the probability of popularity of an audio?
Hey Alexa play regression
One things that might be interesting to check out is to bin the songs to years (or longer periods) when they were published. Humanity's cultural preferences are changing with time and so are trends. Perhaps binning will let you better identify some prominent song features that were indicative of viral songs during a given "cultural era".
oh yes im sure we could create some really interesting data visualisations of how genres have changed and branched over the years
interesting!
Music and data! Two of my favourite things. Great analysis. Could you do another video that goes deeper into your process and/or add some links to the description?
I will be updating this video after my next one with an improved study and comparison!
it's pretty cool, don't hesitate going a little further, maybe pca, maybe doing a model by subgenre etc... going more in depth will be really instructionnal for you
ty, i am expanding my learning in preparation for an ai trail camera project that can automatically monitor populations of endangered animals. do you have any ideas what i should practice for this?
@@wigglecollectiveCNNs and (mini)batch processing 😘
Pca doesn't improve the models performance.
@@TheMrN4R3K bs, it absolutely can help prevent overfitting with correlated features
@@wigglecollective deep learning and machine learning with tabular data differ a lot in practise tbh, if you haven't done them try playing with the digits dataset I guess
awesome project, good job! 👏
More traditional way would have been to choose a criterion (AIC, BIC, etc...) and compare models with good scores.
:O Very interesting
It would be great if you included a github repo or ipynb notebook link, would love to go through the code!
available on my other videos - as this is a uni piece im not really allowed to share it :/
I mean this kind of vid so cool
I'd love a study on how much people like a song that's AI generated when knowing it is versus not knowing it's AI generated. I bet knowing it will remove some amount of enjoyment.
hahah sounds like a fun experiment maybe will do me vs ai musician who can make a better song
May I know where you got this dataset from, I would like to build this project as well!
www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs
@@wigglecollective thank you
Very interesting analysis. Where did you get your data set from?
www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs - kaggle is a great source for usable data sets
@@wigglecollective Okay thank you
Very interestin stuff! Do you have data about the song's release year, when it was popular, and the listener's age? If your data is just popularity among all songs of any year over all time for all ages, then it might be difficult, because there might not be anything that makes a song popular "universally". If you have data about the song's year, when it was popular, and the listener's age, then you could have a higher change of finding correlation, because then you would have information about what makes a song popular in their context. Or even just having when the song was popular might give you ability to predict what's going to be popular next, e.g. features of popularity in a given year might indicate features of popularity in the following years.
Hi, Its likely i will revisit this with updates around comments and new knowledge. Im not sure wether this information is publically available but we defo could do some web scraping to infer!
can you share the data, or attach it in the description if it is open-source?
Thanks !
www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs !
Hey, if it's alright could you share the dataset for this?
THIS WAS SO EPIC AND AWSOME LOVE IT!
VERY good.
Very impressive! but having all these features, maybe Random Forest Regression would be a better model.
I think removing the non-popular songs might be an error, since it will give you data on what doesn't work, and will balance out all the stuff from the popular songs. My 2 cents, great job overall
True, maybe removing songs that have been released less than a year and leave the rest
yes I think it was an oversight to remove completely, maybe focusing on the nan data and studying that will give me better context to create the model
how did you obtain your data? is it available from spotify?
Perhaps: Training an ML model on this data for better insight?
one hot encoding could be skewing some of the features, if your applying a scaler to all the other features but not the one-hot encoded features, the weight of the one-hot encoding could skew and be the reason for such a high mse
to get around this, you could drop all the one-hot encoded data (and model it separately) or scale all the features after you’ve done your one-hot encoding together which will balance some of the weight of the one-hot features
At 6:18 you barely say anything about residual plot, however it is quite a significant plot overall. It's clear that the linear regression model is not the best here, since the residuals are not exogenous. Different techniques can be used to minimise effects of the endogenous residuals, you definitely should check them out.
I tried a few methods but struggled to make any of them work very well without muddling the data a lot. What sortof things would you recommend ill give them a go in my next project :)
@@wigglecollective You can still work with linear regression models just add more parameters to explain the data. For instance instead of y = a*x, use y = a*x + b*x^2. Training this linear regression and many other variations will help to understand what kind of parameters are needed, hopefully reducing the error.
@@antonbordwine Ah very interesting this was not mentioned in my lectures or in any reading material i was using - I will try this approach on my new dataset - going to be looking at yt virality next, hopefully my skill will be improved by then! TY
How to get data bro?
www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs
PSA: discarding NaN's, 0's and examining correlations among the remaining songs will lead you to very wrong conclusions about causality. This analysis suffers from selection bias and mistaking correlation for causation. If you want to make a claim that "feature X *causes* a song to go viral" it does not suffice to do the analysis you did.
To learn more, read the book Causal Inference: What If by my ex-PhD advisors at Harvard.
Yes absolutely, this was one of my conclusions for the write up of this data. I will give the book a read and hopefully my next attempt will be stronger!
thankyou for your comment its really important to strive for accurate and strong conclusions when analysing data and it annoys me to no end when newspapers and the like publish studies that are not conclusive or have a forced conclusion
That’s a great attitude, congrats for having that outlook! With that outlook you’ll perform better analyses than 99% of data scientists in the long run.
Can you name the book?
@@taha5754 did you read my comment to the end?? It’s right there.
Lesson learned: just make rap and one day you'll top the spoitfy charts
quite possibly, famously rappers tend to come from underprivileged backgrounds and historically have been able to break through to virality despite the lack of funding of the genre!
the soundcloud generation is a good example of this!
That linear regression means nothing. Need non parametric statistics yo make the topic of virality interesting. Or else it means nothing. Having a viral song is like winning the lottery aka fat tailed distributions.
ah it was a video for uni we had to show examples of things that werent important for the grade ¯\_(ツ)_/¯
Did you released the code somewhere? Especially the cluster analysis I found very interesting.
Since it’s a classification task, why not use a CNN instead of a linear regression model?
I had a similar idea in mind, difference is, is that it takes the chords and notes that would make a viral song.
awesome project, could you share the code?
all that data and no correlation to the chord/notes.........
maybe try a penalised linear regression, like lasso or Elastic net, there may be outliers affecting the Linear regression model , and because some of your predictors present paralalism this also poses a problem
Is paralalism when your using highly correlated independent variables ?
could u provide us with the code
loved this
Subscribing in hopes you get into more detail later. Maybe you could achieve lower MSE by splitting the data into genres ? I'm working on a similar project of my own (but for commercial purposes)
Hey man I really liked the video! Which method did you use to extract the feature importance after linear regression?
tbh did not learn anything haha just some graphs, but did not learn anything I hope my algorithm doesnt fail again, when it comes to clickbait
IF U WANT A VIDEO THAT IS NOT CLICKBIAT GO TO THIS LINK ua-cam.com/video/CBewV_akO9M/v-deo.html !!!!
(NOT CLICKBAIT)
what a video! Awesome stuff man. I haven't even watched the video, but from the title I can infer this ones going to be a banger.
new graph video les goooo
I feel like this video is going to be a banger
Edit : it was
hahah ty :)
do one for Instagram , yt, google seo
Pretty cool! Keep it up
Got it! I'ma go and do exactly this
Perhaps linear regression was not the best in your case. You’re loosing too much Info when normalizing the data to fit the model. In any case, good job. Are you doing a masters or a bachelor’s?
Really cool, I've been curious in applying ML for music too
i think this study left me with more questions than what i went in with, especially about streamability and tiktokability for virality, also there is so much crazy stuff going on in the background in the music industry with nepotism, abusive contracts ect ect..
Some things can’t be quantified, neither explained. It’s the same thing as success. You can’t predict it.
@@wigglecollective couldn't agree more, but if you get creative there's many more use cases beyond prediction with ML
Go for it bro. Low competition for internships compared to other ML subfields. A lot of interviews so far too.
@@wigglecollective couldn't agree more, but if you get creative there's other use cases of ML beyond prediction here :)
very cool! how did you get this dataset?
you can use spotify api or here www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs
@@wigglecollective thank you very much
This is cool, I've done something similar but with aggregated soundcloud "mix" information.
I assume you got the data from the Spotify API?
oh cool can u send me urs id like to compare :) and yes but its also available on kaggle for ease of access
Nice vid, what model did you use?
Lets all work together on making music more mainstream!
bro hasnt even seen my song in this video ua-cam.com/video/CBewV_akO9M/v-deo.html T-T
I think your problem is that your data isn't normally distributed. Normal distribution is essential for linear regression.
Also i wonder if it's correct to use linear regression model to data with several clusters. It seems to me that linear regression should be applied to each cluster separately.
Also your correlation plot shows that track_popularity isn't correlating with anything, so it's no point in making regression.
However before making such conclusion you should find partial correlations and then find significant one. In the end, use only the features that gives significant partial correlation with track_popularity, if there any, in your regression model.
And by significant correlation i mean r > 0, because you don't want negative correlation in your case.
mixed effect models? Or more broadly hierarchical model. Y_{ij} is score of jth song in ith category. We calculate average score /mu for all songs, then category-specific \alpha_i random effect. Lastly we calculate song-specific effect (which is deviation of the jth song score from the average for the ith category)
@@Siroitin Thank you for the information. I don't know why i've never heard of this model before.
thankyou! I will research try to apply this next time!