Building a Recommendation System in Python

Поділитися
Вставка
  • Опубліковано 28 гру 2024

КОМЕНТАРІ • 101

  • @nayibahued5955
    @nayibahued5955 2 роки тому +170

    deepest data scientist voice in the world

  • @nabila_1203
    @nabila_1203 2 дні тому

    I am working on my final year project and this video is helping me understand the topic really well. Thank you for it!!!

  • @naderkhaled9410
    @naderkhaled9410 2 роки тому +27

    Dude I know this is off topic, but ur voice is insanely satisfying !!

  • @AbhishekChandraShukla
    @AbhishekChandraShukla Рік тому +4

    Holy cow! That is a really good recommendation system! Humbling tutorial as well!

  • @Agent7155
    @Agent7155 3 роки тому +11

    Ended up searching up for movies to watch at the end xD

  • @ea1766
    @ea1766 Рік тому +1

    easily the best video on this subject, all the other videos were so boring and mundane. I wish UA-cam promoted this video more to the top.

  • @icequeen2778
    @icequeen2778 Рік тому +1

    Would love to see more of this type of video!

  • @folahan
    @folahan Рік тому +1

    The first time I will follow a training using my own dataset and I didn't get any error from start to finish.

  • @l3o_pl4ys51
    @l3o_pl4ys51 4 місяці тому +2

    couldn't figure out totally what he's used in this video. the only Things that I could get was that he separated the user and movies to their ratings, after this he wheighted all of them to put in a sort of a scale and then he made the predictions into clusters. Did I miss something?

  • @vincent_hall
    @vincent_hall 2 роки тому

    Thank you sir.
    I have forked it and shall have a go collaborating with a friend.

  • @dan7582
    @dan7582 2 роки тому

    Nice video, keep up the good work!!

  • @ahmadjunaidi21-l6l
    @ahmadjunaidi21-l6l 7 місяців тому

    Dude is not only learn deep learning but deep voice. damn

  • @robbillington1603
    @robbillington1603 5 місяців тому

    Jaba ah voice! Great video

  • @simyixiang3358
    @simyixiang3358 4 місяці тому

    bcs when i run the code at 9:10 in the video ,the output error

  • @alexhort__
    @alexhort__ 4 місяці тому +1

    How would you do it from a real-time database, with real users?

  • @stmasanti
    @stmasanti Рік тому

    Great video!

  • @ayushthombare9235
    @ayushthombare9235 3 роки тому

    Very informative and useful video.... Thank you so much

  • @vinayvajrala4366
    @vinayvajrala4366 6 місяців тому

    A big like for that voice

  • @Thunderclap777
    @Thunderclap777 25 днів тому

    are there any metrics that you can use to test if what your recommending is accurate at all?

  • @simyixiang3358
    @simyixiang3358 4 місяці тому

    Notebook setting use T4-GPU ?

  • @elisama2936
    @elisama2936 Рік тому +1

    Hello! :) Ty for the video. I have a question regarding the line " def __init__(self, n_users, n_items, n_factors=20)". Can you explain why 20?

    • @SpencerPaoHere
      @SpencerPaoHere  Рік тому +2

      Number of latent factors was arbitrary! Though, you could optimize for that value.

    • @elisama2936
      @elisama2936 Рік тому +1

      @@SpencerPaoHere Thank you for your answer!

  • @nikhilsastry6631
    @nikhilsastry6631 4 місяці тому

    Deepest Learning

  • @ryderthewatermelon611
    @ryderthewatermelon611 6 місяців тому

    If i was to adapt this methodology to recommend songs based on user song selection, and used a dataset with parameters of a songs, how would i do that?

  • @gauravpoudel7288
    @gauravpoudel7288 Рік тому

    Thanks for the awesome content.
    BTW Is that really your voice?

  • @Bjorn_R
    @Bjorn_R 11 місяців тому

    Hello Spencer im split between collaborative recommender systems and a confirmation tree project for my master thesis. What would be most beneficial?

  • @marcelomlr
    @marcelomlr 8 місяців тому

    Hey man, nice video, and thanks for the tutorial. I'm actually trying to build a recommendation system for online courses, like udemy, but I can't find any datasets for user reviews to make the collaborative filtering. So I decided to manually create a dataset, and thought of choosing like 4 subjects and putting some users to rate like 10-15 courses of each subject. Do you know if something like that can work, or have any tips you can give me?

  • @tactical_savant01
    @tactical_savant01 Місяць тому

    Hi Spencer, the github link for the code is not working. can you pls resolve it. Thank you

  • @vaiterius
    @vaiterius Рік тому +1

    How do you know which libraries/functions to use to make these algorithms? I’m trying to make a videogame recommendation system from a Steam games dataset, similar to what you’re doing here

    • @hamzak5674
      @hamzak5674 11 місяців тому

      Hey, I’m making something similar using the RAWG dataset. Did you manage to get anywhere? I’m planning to start in the next few days

    • @SpencerPaoHere
      @SpencerPaoHere  10 місяців тому

      Python typically wraps around alot of theoritcal applications behind C/C++. When it comes to a recommendation system base, tensorflow/keras are the building blocks and are quite effective when building something from scratch or fine tuneable

  • @JaisonSimon-h7p
    @JaisonSimon-h7p 7 місяців тому

    helo brother,can i use any movie dataset from kaggle?

  • @obi666
    @obi666 Рік тому

    I'm not sure what these clusters are (for example Cluster #1 and printed titles), are they some sort of groups of similiar movies?

    • @SpencerPaoHere
      @SpencerPaoHere  10 місяців тому +1

      Yep! Each cluster represents a group of data points that are similar.

  • @Historyiswatching
    @Historyiswatching 2 роки тому +1

    I'm sorry I was distracted by your good looks xD

  • @nazrulabuzhar2210
    @nazrulabuzhar2210 3 роки тому +9

    What is your skincare routine sir? You're looking good

    • @SpencerPaoHere
      @SpencerPaoHere  3 роки тому +5

      😂😂😂
      Comment made my day!
      Cleanser + Moisturizer

  • @NobixLee
    @NobixLee 2 роки тому

    Great video, but how do we then get scores for the User_ID? Something like there is this much probability that User_ID 2 will be in cluster 2? Thank you.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +1

      One way that you can go about this:
      You'd need more data to have a more accurate way of doing this. Since there are only 4 features: userID, movieID, rating, timestamp in the dataset I am using in this video. However, with the way that I have done this in the video, you can go forth and associate the average of the ratings that each user has appled for all of the users' ratings with the movies in each cluster. Normalize across all clusters with the given movie and sort upon highest ratings per cluster for the user. Whichever movies that may not have been seen by the user in the cluster should be recommended to the user. I am open to hearing your thoughts on this!

  • @bhadauriaji
    @bhadauriaji 2 роки тому +1

    Hi Spencer. Was working on a similar problem where i have users who have listened to a set of songs and based on there listen history. I have to recommend new songs to the user. Almost 10. How to do that?
    Also I don't have ratings for songs I have listen count for each song. And listen count is in relation to user.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +4

      You'll probably need additional features such as length of listen, genre, artist, etc for a better recommendation algorithm.
      You could do the frequentist approach (to start) where you recommend the song that has been listened the most and slowly make your application more advanced once you've accumulated more focused data.

    • @bhadauriaji
      @bhadauriaji 2 роки тому

      @@SpencerPaoHere The problem is I can't have more features. My dataset has UserId, SongID,listen count , artist, song title, and date of the song only. I have to build a recommendation engine using that only. Also I tried using Kmeans and some brute force filtering techniques but not getting accuracy.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +2

      @@bhadauriaji Unfortunately, those features aren't going to be doing recommendations justice. You could, however, do a weighted sampling song recommendation based on hits. Its not perfect, but it may be what you are looking for.

    • @bhadauriaji
      @bhadauriaji 2 роки тому

      @@SpencerPaoHere Thanks a lot for the info, will try that surely. 🤗

  • @sachamallet5157
    @sachamallet5157 Рік тому

    Hi, I would like to know if the mac mini M2 pro with only 16gb of RAM is enough for 8Go of data analysis. Thank you so much for your feedback

    • @SpencerPaoHere
      @SpencerPaoHere  Рік тому

      Yeah it should be good for smaller datasets. Though you never know until you try ! (Maybe try 2 gb and see how long that’ll take - and approximate from there)

  • @abi_xyz
    @abi_xyz Рік тому

    great

  • @appyviral8753
    @appyviral8753 2 роки тому +2

    How much u charge for making a video recommendation system for Android app?

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +1

      If it's highly interesting, $0.00.

    • @appyviral8753
      @appyviral8753 2 роки тому +1

      @@SpencerPaoHere it will be! how to contact u?

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +1

      @@appyviral8753 You can send me a message at business.inquiry.spao@gmail.com

    • @seankirbycordova3937
      @seankirbycordova3937 2 роки тому +1

      Can I ask the source code? im building library system, I have no idea implemting the collaborative filtering algo. Thank you if you can help me 😊

  • @aumasandra9307
    @aumasandra9307 Рік тому

    Why do I keep getting KeyError: 46970 in the code train_set = Loader()
    And how do I solve this error

    • @SpencerPaoHere
      @SpencerPaoHere  Рік тому

      Is this my code? Did you run through all the cells? If so, check out the loader(Dataset) class and provide some logging statements to see which lines are throwing that error.

  • @casewhite5048
    @casewhite5048 2 роки тому

    How do you set a rating system for the output of movies lets say it recommends a movie you never want to watch like Fried Green Tomatoes recommends Avengers: Endgame tell it to rate it 10/10 and train it to find more clusters with higher ratings and train it to find more of these over time as more movies come out

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому

      There are many ways that you can go about doing this: I'd check out the ELO/FIDE rating system. Based on user input, they manually click either "Yes" or "No" depending on whether they like the recommendation. You can use this system to tailor prediction output to the customer.

  • @erick388
    @erick388 2 роки тому

    Heyo, and thanks for the video! This was incredibly helpful to learn and understand how to make something rudimentary (even if I imagine a full fledged system would be SO much more complex in how you measure input from the user and live data to form a more robust recommendation). I do have one quick question though, since when I tried making my own slight version (mostly changing the dataset and some small aspects), I came across a slight issue regarding the loading aspect.
    To attempt to make this run faster, I had used panda to fuse both the ratings and movies csv's together, and then I shuffled, and split them to have an even distribution with less values (this is for a class of mine more than anything, and 100k entries is a lot to run during a presentation). The columns remain the same, and headers remain the same, and all that has 'shifted' is the order in which the rows appear (which is to say its not a bunch of toy story reviews in a row, not a bunch of star wars reviews in a row, etc) and I acquired this error.
    self.ratings.movieId = ratings_df.movieId.apply(lambda x: self.movieid2idx[x])
    self.ratings.userId = ratings_df.userId.apply(lambda x: self.userid2idx[x])
    It processes movieid correctly. But when we reach the application of the lambda to the userid it proceeds to return.
    Key Error, NaN.
    Given that the csv is the same, save for the alteration to the order of the rows but not the headers, and the values are all indeed numeric, what would be a feasible way to fix and remove this error? Or could it bet he way that I shuffled the dataset that's causing it to assume that the numeric values are NaN and that there's a peculiar way I have to shuffle the values?
    Also on a fun sidenote, I've run this both with and without CUDA installed. I didn't particularly find anything that changed, but maybe that's just me. It runs regardless, though I presume that will create its own problems when it comes down to it.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому

      Glad you enjoyed it ! This might be an issue when your are shuffling the data together. There could be many reasons why this is the case. Though, I'd recommend to obtain a small subset of your dataset and run the cleaning algorithm from there. (It'd be easier to debug)
      It seems you are attempting to combine 2 datasets together based on movieId. Have attempted to do Join statements? (inner join to be specific). Also double check if the casting is appropriate. You may be getting a null value due to the userID somehow becoming a string.
      Otherwise, could you provide an example on what the current dataset looks like and what you are trying to achieve?

    • @erick388
      @erick388 2 роки тому

      @@SpencerPaoHere Yeah I got it working. I think it was a messed up join on my end which prematurely ended my experimenting with the dataset, so all's good!
      On another sidenote, as I'm still learning some machine learning stuff, I have friends who keep talking about accuracy for machine learning algorithms, and the more I look into it I begin to wonder how that may apply here, or if it's even an actual possible thing to quantify here. I know that MSE calculates the error between predicted values, and actual rating values (do correct me if I'm wrong), which makes me question if 'accuracy' or 'error' are actual aspects of this algorithm, or if that's related to other forms of algorithms that are more specific with their goal?
      Regardless! Big thanks for the help and awesome video. This was honestly a pretty good starting point as it helped me get curious about a lof ot topics I had never got to touch before.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому

      @@erick388 Glad you enjoyed the content!
      Regarding the accuracies, there are actually several metrics you can go about optimizing for. A great optimizer function would be adam. Accuracy by itself is not that 'accurate'; you need precision as well. Take a look into F1 scores. That'll help.
      Increasing "accuracy" comes down to additional features, more data, and different ML algorithms, or tuning algorithms. That's essentially the world of Data Science.

    • @erick388
      @erick388 2 роки тому

      @@SpencerPaoHere Gotcha, I'll look into that too. It's a lot to take in but it's always fun and interesting to learn. Appreciate all the advice!

    • @erick388
      @erick388 2 роки тому

      @@SpencerPaoHere Actually, I suppose one final question is how I would qualify something as a false positive, or a true positive (or really any of the prerequisite information) for the calculations of F1 Scores (such as the requirements for Precision, Recall, etc). I'm not quite sure how to do that given that in this example here we're giving a recommendation of ten movies based on their overall rating, and I don't really know what would quantify as a false positive (or a true positive).

  • @christianmoreno7390
    @christianmoreno7390 2 роки тому

    dang bro do you practice retention ??

  • @dustinvo6097
    @dustinvo6097 2 роки тому

    Hi Spencer. Nice video as always. I am working on a problem where the users interact with banking website and app. So I have userid, the interaction name, timestamps and some demographic varibables. I'm trying to cluster them into some "personas" based on their interaction and timestamp for biz use. Do you have any ideas how to do that? Thanks.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +1

      Glad you enjoyed it! That use case can definitley be quite tricky. You'd first need to categorize what personas you are trying to bucket users in. Based on those personas, what actions (i.e features ) would link them to said persona?
      I'd suspect that a lot of AB testing would be required to fulfill your hypotheses. But, if its literally just something related to money management via banking, I'd probably look at it from the angle of on-time payments, quantity, frequency, tiered users, time of withdrawl from ATM, fees encountered, zipcodes, and features related to that. (excluding PII unless TOS states as such)

    • @dustinvo6097
      @dustinvo6097 2 роки тому

      @@SpencerPaoHere thank for the advice. Another question: if I try to focus on just userid and interactionname, how can I cluster the userid basing on the interactions (withdraw, request credit score,...) while they are repeated categorical measurement? Kmode is a good one?

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +1

      @@dustinvo6097 I think I have just the video for you :) ua-cam.com/video/NKQpVU1LTm8/v-deo.html
      (If you haven't seen it already)

  • @sssaturn
    @sssaturn 2 роки тому

    is there a reason you dont split the data set?

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому

      I just wanted to highlight the recommendation aspect (not necessarily the training aspect)
      Though, in an ML model, you definitely want to do the typical 60/20/20 split!

    • @sssaturn
      @sssaturn 2 роки тому

      @@SpencerPaoHere cool, thank you spencer!

  • @ujjwal.kandel
    @ujjwal.kandel 3 роки тому

    How would I pass a movie title to the recommender and get a list of recommendations?

    • @SpencerPaoHere
      @SpencerPaoHere  3 роки тому +1

      Great question! You might have to change the model itself to be more 'linear' to return a movie title that is most similar to the input.
      With the Kmeans algorithm, you can technically "Pass in a movie title" and the list would be the cluster associated with that movie title. You can then sort by shortest distance and get the top most rated movie. Some additional coding will be required to do that.

    • @ujjwal.kandel
      @ujjwal.kandel 3 роки тому +3

      @@SpencerPaoHere I could really use that extra code you're talking about. I'm doing a recommender for my final year project without zero experience in machine learning. Half this code is gibberish to me lol. I just need 10 recommendations for any list of movies. That's all I ask for😭

  • @sospixs
    @sospixs 2 роки тому

    Hi Spencer
    Thanks for your vdo .
    I've arrange the code , but got stuck in section for loop tqdm
    len(losses) = 0
    for it in tqdm(range(num_epochs)):
    ....
    ....
    ZeroDivisionError Traceback (most recent call last)
    Input In [59], in ()
    11 optimizer.step()
    12 #print(loss.item())
    ---> 13 print("iter #{}".format(it), "Loss:", sum(losses) / len(losses))
    ZeroDivisionError: division by zero
    any ideas ?

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому

      yeah. Whatever is populating your losses is not being done correctly or there is a divergence issue. The len(losses) == 0. You'd need to figure out why that is the length is zero.

    • @sospixs
      @sospixs 2 роки тому

      @@SpencerPaoHere Yep,
      I'm using jupyter in my PC , And Is running on GPU: False
      I think that the problem

  • @guitar300k
    @guitar300k 2 роки тому

    How to solve big scale problem, you guys?

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому

      It depends on the use case, but there are many ways to scale a problem. All of which are somewhat unique. For deployment on a website for example, Kubernetes is quite popular.

  • @maximshidlovski23
    @maximshidlovski23 2 роки тому

    Hi Spencer, thanks for the video. I am currently working on the problem of creating a tag-based recommendation system. The user has a list of tags of interest to him and needs to recommend content based on tags and words that are hyperonyms and hyponyms of these tags. I have the user's UserId, FavoriteUserTagsIds and the content's ContentID and ContentTagsIds. Do you have any ideas how to do that? What is best way to create tag-based recommendation system?
    Thanks.

    • @SpencerPaoHere
      @SpencerPaoHere  2 роки тому +1

      This seems like an NLP type problem! You can check out a generalized large language model to see if your keywords exist within its vocabulary. Then, using its word embeddings, you can perhaps utilize the distances between the vectors as a gauge behind the meaning. Then, you can plug in the output of the NLP model to a recommendation system.

    • @maximshidlovski23
      @maximshidlovski23 2 роки тому

      @@SpencerPaoHere Thanks, I came up with a similar solution yesterday, now I'm working on implementing it.

  • @kain5244
    @kain5244 2 роки тому

    thanks

  • @hmhm2903
    @hmhm2903 3 роки тому

    dataset link pls

    • @SpencerPaoHere
      @SpencerPaoHere  3 роки тому

      You can try here: files.grouplens.org/datasets/movielens/ml-20m-README.html

  • @brahimsabiri3116
    @brahimsabiri3116 3 роки тому

    Could you share the code plz

    • @SpencerPaoHere
      @SpencerPaoHere  3 роки тому

      github.com/SpencerPao/Data_Science/tree/main/Recommendation%20Systems

  • @mainguyenhoang2667
    @mainguyenhoang2667 3 роки тому

    can you share the code sir?

    • @SpencerPaoHere
      @SpencerPaoHere  3 роки тому

      As requested, here is my code: github.com/SpencerPao/Data_Science/tree/main/Recommendation%20Systems

    • @mainguyenhoang2667
      @mainguyenhoang2667 3 роки тому

      @@SpencerPaoHere thanks you

  • @umershabir7045
    @umershabir7045 7 місяців тому +1

    is your voice AI generated?

  • @phatle-248
    @phatle-248 Рік тому

    I can't hear that "deep" voice clearly

  • @eda1198-w6m
    @eda1198-w6m 2 роки тому