Solving real-world data analysis problems with Python Pandas! (Lego dataset analysis)

Поділитися
Вставка
  • Опубліковано 31 тра 2024
  • In this video we walkthrough a data analysis project on DataCamp. This project has us walk through a Lego dataset and answer a few questions. To do our analysis we use the Pandas library of Python.
    Check out DataCamp!
    bit.ly/KeithGalliDCFeb22
    Link to my GitHub:
    github.com/KeithGalli/lego-an...
    From the DataCamp website:
    The Rebrickable database includes data on every LEGO set that has ever been sold; the names of the sets, what bricks they contain, what color the bricks are, etc. It might be small bricks, but this is big data! In this project, you will get to explore the Rebrickable database and answer a series of questions related to the history of Lego!
    Link to Rebrickable database: rebrickable.com/downloads/
    Some skills worked on in this video:
    - Reading CSV files with Python
    - Filtering DataFrame based on conditional parameters
    - Grouping data by column values and aggregating it
    btw, I apologize at about the 25-minute mark I started having microphone issues, I'll have it solved by my next video.
    Thank you to DataCamp for sponsoring this video :)
    -------------------------
    Follow me on social media!
    Instagram | / keithgalli
    Twitter | / keithgalli
    -------------------------
    Song at the end
    good morning by Amine Maxwell / aminemaxwell
    Creative Commons - Attribution 3.0 Unported - CC BY 3.0
    Free Download / Stream: bit.ly/2vpruoY
    Music promoted by Audio Library • Good morning - Amine M...
    -------------------------
    If you are curious to learn how I make my tutorials, check out this video: • How to Make a High Qua...
    Practice your Python Pandas data science skills with problems on StrataScratch!
    stratascratch.com/?via=keith
    Join the Python Army to get access to perks!
    UA-cam - / @keithgalli
    Patreon - / keithgalli
    *I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.
    -------------------------
    Video Timeline!
    0:00 - Introduction
    1:05 - Getting started w/ Lego analysis project
    2:33 - How to follow along if you are not a premium DataCamp subscriber (GitHub)
    4:01 - Project tasks overview
    5:40 - Basic exploration of the dataset
    9:45 - Task #1: What percentage of all licensed sets ever released were Star Wars Themed?
    24:23 - Task #2: In which year was Star Wars not the most popular licensed theme?
    34:00 - Bonus Task: How many unique sets were released each year (1955-2017)?
    42:26 - Conclusion!

КОМЕНТАРІ • 120

  • @KeithGalli
    @KeithGalli  2 роки тому +31

    Level up your data science skills with courses, projects, and competitions offered by DataCamp! Use my link below and check out the first chapter of any course for FREE! :)
    bit.ly/KeithGalliDCFeb22

    • @masternobody1896
      @masternobody1896 2 роки тому +1

      can you do some google job coding. so how can i get a job

  • @KeithGalli
    @KeithGalli  2 роки тому +84

    Big shout-out to my mom for not throwing away my Legos! She's the real MVP

  • @DataProfessor
    @DataProfessor 2 роки тому +21

    Wow the Lego stop motion was awesome!

  • @KenJee_ds
    @KenJee_ds 2 роки тому +18

    dude, loved the intro!

    • @KeithGalli
      @KeithGalli  2 роки тому +3

      Hahaha thanks man :). Very happy that my mom didn't throw out all of my legos!

  • @lVaNeSsA90
    @lVaNeSsA90 2 роки тому +6

    Thanks for being honest while you search for syntax in the beginning. Love this raw, step by step video.
    I'm using your videos on my project to get inspired ❤️ thanks for being a good tutor 😊

  • @markomarjanovic8348
    @markomarjanovic8348 Рік тому +5

    Absolutely love the raw natural style you are doing, hope everyone else appreciates it too, keep going buddy, you are amazing!

  • @rafaelmello8194
    @rafaelmello8194 2 роки тому +5

    I'm a begginer in Python and I'm learning a lot from you. You are an awesome teacher. Your pacing and didactic are perfect. Thanks a lot for your effort

  • @JW-pu1uk
    @JW-pu1uk 2 роки тому +1

    I really like the thought process in these videos. It's very raw, and really will translate well to an actual work project.

  • @alan6506305
    @alan6506305 2 роки тому +5

    God, this is brilliant. I watched the other two videos of yours on Pandas. You are a great teacher and friend. Thank you very much for your hard work and kindness.

  • @simonvanwijk5178
    @simonvanwijk5178 2 роки тому +2

    Man so good to have you back! If it was not for you I would have not gotten a role as a DA as you helped me the most in the beginning.

  • @H99x2
    @H99x2 2 роки тому +1

    These type of videos are your strengths! Great tutorial and explanation Keith

  • @thebeeskhakis7145
    @thebeeskhakis7145 2 роки тому +1

    I'm so happy you're back. Your videos helped me get my new job!

  • @logannon
    @logannon 2 роки тому +1

    Dude, I thought you were dead. Your videos have helped me so much. Glad to see you back!

  • @qalinlekhaliif5518
    @qalinlekhaliif5518 2 роки тому +2

    Thanks a lot man. Your videos are helpful and entertaining as well. We appreciate your great work.

  • @itsReshad
    @itsReshad 2 роки тому

    Love the great content! Please dont stop! You have an impeccable way of teaching its amazing

  • @ben-tiki
    @ben-tiki 2 роки тому +1

    Another great video Keith! Glad to see yo back. Awesome that you got to work with datacamp. Please if you can make a video o OpenAI it would be awesome. Ive been using their API and its awesome

  • @danielsantoyo2640
    @danielsantoyo2640 2 роки тому +9

    Im so happy to see you are back! Panda and Numpy tutorials would be great !!! I’m currently trying to learn panda and numpy for data analytics and this video was super interesting !!! Thanks Keith keep going you are doing great 💯

  • @FIBONACCIVEGA
    @FIBONACCIVEGA Рік тому +1

    This video has been a true inspiration to continue learning. I'm doing the datacamp since I want to change my field and I've always liked programming and analyzing data. But he didn't know if he could use the learned knowledge to use it in real life. Now I know that everything I have learned is what is used in real life data analysis. Saludos

  • @Sensei10238
    @Sensei10238 2 роки тому +1

    Finally back! It helped me a lot in learning python! Thank you so much!

  • @kartikeyasharma9908
    @kartikeyasharma9908 Рік тому

    Hi Keith, loving the video tutorials!

  • @Omzodijacky
    @Omzodijacky 2 роки тому

    Man , I'm happy you are back ! you were truly missed

  • @amansorout.6779
    @amansorout.6779 2 роки тому

    Happy to see you back, fighting with something serious, you are not alone.

  • @PaYaMv2
    @PaYaMv2 2 роки тому

    Good to have you back my dude! Loooooooved this!

  • @YunusFidan_
    @YunusFidan_ 2 роки тому

    Good to see you uploading again!!

  • @patriciosebastiankellyfuen9547

    props for sharing your knowledge man, its really easy to understand and apply what you're doing (Y)

  • @cyrilodoi6868
    @cyrilodoi6868 2 роки тому

    So good to have you back man! 💯

  • @weitingteng3241
    @weitingteng3241 2 роки тому

    Great great and great to see you back

  • @MashiroRedo
    @MashiroRedo 2 роки тому

    Waited so long! Thank you

  • @azrmuradl6420
    @azrmuradl6420 2 роки тому

    Please provide more such kind of videos, or as you always do, give us tips about how we can find such kind of real world ds projects online.

  • @lucaspioli7970
    @lucaspioli7970 2 роки тому

    Love your videos! Keep going

  • @manfungnewmanyu1426
    @manfungnewmanyu1426 2 роки тому

    Yeah!!! Your tutorial is very great and help me so much at the AI master course .

  • @stratascratch
    @stratascratch 2 роки тому

    Good to see you’re back!

  • @rksingh1997mp
    @rksingh1997mp 2 роки тому +4

    He’s back baby!!

  • @terrytas13
    @terrytas13 2 роки тому

    Love the introduction!!!

  • @freddy4videos
    @freddy4videos Рік тому

    thank you, much love

  • @jongcheulkim7284
    @jongcheulkim7284 2 роки тому

    Thank you, sir. I had lots of fun^^

  • @codewithkarthik7136
    @codewithkarthik7136 2 роки тому

    nice video keith

  • @ocraking
    @ocraking 3 місяці тому

    Dude, you ROCK

  • @user-jl8vr4ff1e
    @user-jl8vr4ff1e 2 роки тому

    keep up the good work!

  • @tuandino6990
    @tuandino6990 2 роки тому

    I've been waiting for this

  • @politiqueriachile4973
    @politiqueriachile4973 2 роки тому +1

    keep doing more videos pls :D

  • @terrytas13
    @terrytas13 2 роки тому

    Welcome back Keith, so good to see your face again. Stay well my friend!

  • @putyah
    @putyah 2 роки тому +1

    Awesome video. Small detail: On the new era answer you typed the variable in. It would be nicer to drop every value that is Star Wars. Next select the remaining year as an variable. When the dataset is changed the variable is dynamic so the answer would still be correct.

    • @KeithGalli
      @KeithGalli  2 роки тому

      Good suggestion! I agree that would be a better way to go about it :)

  • @kirubaselvi6754
    @kirubaselvi6754 2 роки тому +2

    Keith, Pytorch tutorial please

    • @KeithGalli
      @KeithGalli  2 роки тому +1

      I definitely want to! I need to spend considerable time reviewing and building up my own PyTorch skills before I make a tutorial on it.

  • @Magmatic91
    @Magmatic91 2 роки тому

    Did this project on DataCamp. Was a lot of fun.

  • @kotharidhruv75
    @kotharidhruv75 2 роки тому

    w8ing fr more such videos

  • @Viralvlogvideos
    @Viralvlogvideos 2 роки тому

    welcome back to your first tutorial after long back :P

  • @aditiparashar9171
    @aditiparashar9171 11 місяців тому

    you are freakingly smart!

  • @rafaelcastellarmartinez3498
    @rafaelcastellarmartinez3498 2 роки тому +1

    Hi Keith, just tried to do the project with you and i got that Star Wars was not the most popular theme in 2004 - Harry Potter and 2017 - Super Heroes, weird that datcamp test said ok, but i did the math manually and harry potter was the most popular in 2004, thanks for your videos. an student from Colombia Latin America!

    • @adelekeemmanuel4917
      @adelekeemmanuel4917 11 місяців тому

      omg... i just did the exercise myself and i discovered the same thing too... Came ti check the video but im seeing something else

  • @shahrose786
    @shahrose786 2 роки тому

    question: when you merge when using left_on and right_on ...we get the merged df.
    So for the merged df and under parent_theme why are most if not all of those are "Legoland" and all IDs are 411?
    also how do we check the full tabular data -- print(df)?

  • @leomiao5959
    @leomiao5959 2 роки тому +1

    The man is back. The hero is back for us!!

  • @davida99
    @davida99 2 роки тому

    Yoooo love the vids

  • @tuandino6990
    @tuandino6990 2 роки тому

    Question 2:
    theme_count_by_year = licensed_lego_set.groupby('year')['parent_theme'].value_counts().unstack()
    theme_count_by_year.fillna(0, inplace=True)
    theme_count_by_year = pd.DataFrame.transpose(theme_count_by_year)
    Or you can use pivot_table function. By approaching in this way you can create a data frame that's easy to do plot (heatmap) and make high number pops out.

  • @baburamchaudhary159
    @baburamchaudhary159 Рік тому

    in line [99]
    ie. .groupby(['year', 'parent_theme'])
    and in next line: .drop_duplilcates(['year'])
    since we already have grouped by 'year' and 'parent_theme' [I think, it groups unique year and parent_theme] why do we need to drop duplicates by 'year'?

  • @gersonchadijunior7499
    @gersonchadijunior7499 Рік тому

    Hey Keith, I love so much your videos. I've been learning Pandas with you since your pokemon's video, but I feel that the last answer is not accurate and in fact the right year should be 2006, because it was the year with less Star Wars Sets released. Can I send you my code somehow?

  • @rodrigodasilva9176
    @rodrigodasilva9176 2 роки тому

    This dude is cool, this chanel too.

  • @admonitoring-pi9os
    @admonitoring-pi9os 2 місяці тому

    Hello there. I hope you are good. I am a little late with this comment because this video is already more than 2 years old but since i have started learning python now its the right time for me. where can i find the codes you explained in the video bcz no code is availbale in the project file at the github provided link.

  • @soldierbirb
    @soldierbirb 2 роки тому +1

    Hey Keith, I'm divided between going towards data science or cyber security. I love both but I kinda needs to make money by now. Do you think I can own money in a short time in data science? Working as a freelancer or supporting small companies...
    Edit: I'm glad that you came back. Really love your videos

    • @adeshmishra1671
      @adeshmishra1671 2 роки тому +1

      Go for Cybersecurity brother, Since difficulty level is medium..
      But while earning 💰 you can also learn data scientist!!

  • @ratchakoon
    @ratchakoon 2 роки тому

    themes.csv which you provided on github does not have 'is_licensed' field. Is 'parent_id' filed as same as 'is_licensed' field?

    • @KeithGalli
      @KeithGalli  2 роки тому

      A little confusing, but you want to use parent_themes.csv, not themes.csv !!

    • @ratchakoon
      @ratchakoon 2 роки тому

      @@KeithGalli Thank you

  • @Levy957
    @Levy957 2 роки тому +1

    that task #2 was really hard to do alone

  • @alkiviadessavoullis2021
    @alkiviadessavoullis2021 2 роки тому

    does anyone know why when I press continue or start project the Python Use python ... code checks gets highlighted pink and I can't work on the project ?

  • @baggid6257
    @baggid6257 2 роки тому

    He is back~!

  • @raghavgoyal3324
    @raghavgoyal3324 2 роки тому

    please upload a project every week

  • @nitiknayyar7659
    @nitiknayyar7659 2 роки тому

    Damn I also started this project on Datacamp.

  • @clayherz_articles
    @clayherz_articles Рік тому

    if i solve the second question with this code,
    counted_2 = licensed_sets.groupby(["year", "parent_theme"])[["is_licensed"]].count()
    counted_2 = counted_2.reset_index().sort_values("is_licensed", ascending=False)
    counted_2.drop_duplicates("year").sort_values("year", ascending=True)
    is it wrong

  • @gopikaprasad8607
    @gopikaprasad8607 Рік тому

    How to export the for loops result into excel?? Please reply

  • @elianmoralespina5851
    @elianmoralespina5851 2 роки тому

    Hey guys, would it be a good idea to use Datacamp projects in my resume?

  • @sanjeetlal1873
    @sanjeetlal1873 2 роки тому

    Legend's back❤️

  • @sabbirahmed8012
    @sabbirahmed8012 2 роки тому

    Hello Keith, can you please mention some resource to master natural language processing?

    • @KeithGalli
      @KeithGalli  2 роки тому

      Hey! I actually did a PyCon lecture on NLP. That should be pretty helpful: ua-cam.com/video/vyOgWhwUmec/v-deo.html

  • @rabinmainali3373
    @rabinmainali3373 Рік тому

    I done it in following ways:(question 2)
    1. i count each licenced film released every year.
    2.Then count the only star wars film released every year
    3.And i calculate the proportion of step2 and step1.
    Is it okey ? ,by the way the result is also 2017 for me.

  • @merterisen
    @merterisen 2 роки тому

    16:52 how did you change 'Star wars' text immediately?

    • @KeithGalli
      @KeithGalli  2 роки тому

      Lol that was just video editing xD.

  • @manu93ize
    @manu93ize 2 роки тому

    bro Can you do a tutorial on data cleaning with Pyspark with real world example.

  • @letsjoinhands
    @letsjoinhands 2 роки тому

    hello again Keith. For Q#2 I am getting a different result for new_era using this code:
    So the lego_all_lic is the DF containing all licensed lego set themes with the shape (1179 x 8) and that has been grouped by year to form lego_all_lic_yr. And the rest of the code I have written is quite simple to understand. Looks as if I have made a big mistake in aggregation but can't seem to locate it.
    lego_all_lic_yr = pd.DataFrame(lego_all_lic.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('set_num', 'count')))
    lego_all_lic_yr.reset_index( inplace = True)
    lego_all_lic_yr.replace(to_replace = [theme for theme in lego_all_lic_yr['parent_theme'] if theme != 'Star Wars'], value = 'Others', inplace = True)
    lego_all_lic_yr = pd.DataFrame(lego_all_lic_yr.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('Parent_Theme', 'sum')))
    lego_all_lic_yr
    When you look at the result it shows that 2006 was the first year in which Star Wars lost to other themes in terms of the sets released in that year.

    • @letsjoinhands
      @letsjoinhands 2 роки тому

      Ok so I misunderstood the Q basically. It wasn't about Star Wars themed sets vs All The Rest rather it the year in which Star Wars lost out to some other individual theme. Got the correct answer using:
      lego_all_lic_yr = pd.DataFrame(lego_all_lic.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('set_num', 'count')))
      lego_all_lic_yr.reset_index( inplace = True)
      lego_all_lic_yr = pd.DataFrame(lego_all_lic_yr.groupby(by = ['year', 'parent_theme'], axis = 0).agg(Parent_Theme = ('Parent_Theme', 'sum')))
      lego_all_lic_yr = lego_all_lic_yr.sort_values(by = ['year','Parent_Theme'], ascending = False)
      lego_all_lic_yr.head(50)

  • @user-ty4jy4cp3r
    @user-ty4jy4cp3r Рік тому

    why didn't you use .agg?

  • @damarbowo
    @damarbowo 2 роки тому

    Can I see your membership playlist? I can't find that playlist

    • @KeithGalli
      @KeithGalli  2 роки тому

      Hmm I'm not sure what you are asking to see, can you clarify?

    • @damarbowo
      @damarbowo 2 роки тому

      @@KeithGalli you have a membership benefits. One of the benefit is got playlist or videos for member. Do you have an example the video or playlist for member join your channel?
      Hope you understand

    • @KeithGalli
      @KeithGalli  2 роки тому +1

      I just started my memberships last week so I haven't posted any exclusive videos there yet. To get an idea of the types of content I'll post there, check out these videos ua-cam.com/video/qnSF8YaPx78/v-deo.html
      ua-cam.com/video/OkEoPIOwvhg/v-deo.html

    • @damarbowo
      @damarbowo 2 роки тому

      @@KeithGalli I'll wait Keith.
      Regards

    • @KeithGalli
      @KeithGalli  2 роки тому

      Sounds good!

  • @zeasammy7572
    @zeasammy7572 2 роки тому

    Does DataCamp have video learning platform?

    • @KeithGalli
      @KeithGalli  2 роки тому +1

      The typical structure of classes is short videos that overview the concepts and then a bunch of interactive problems with a code editor to drill down the technical side of those concepts.

  • @dharshankumar2522
    @dharshankumar2522 2 роки тому

    Keith is back...yeahhhh

  • @mufasao6776
    @mufasao6776 2 роки тому +1

    I see that you posted some of your hidden videos. Thank you.

  • @letsjoinhands
    @letsjoinhands 2 роки тому

    Hi Keith! this is how I solved Q # 1. Pls let me know if this is a bad coding practice, is acceptable or is good in your opinion. so I first made a function called is_lic.
    def is_lic(df_1, df_2):
    df_1['is_licensed'] = bool
    theme_1 = list(df_1['parent_theme'])
    theme_2 = list(df_2['name'])
    lic_status = list(df_2['is_licensed'])
    for i, s in enumerate(theme_1):
    for r, t in enumerate(theme_2):
    if s == t:
    df_1['is_licensed'][i] = lic_status[r]
    Then is_lic(lego_sets, lego_themes)
    Then all_themes = [ ]
    for r in lego_sets.itertuples():
    all_themes.append([ r[6], r[1], r[7] ]).
    Then all_lic_themes = [x for [x, y, z] in all_themes if y is not np.NaN and z == True]
    star_wars = [theme for theme in all_lic_themes if theme == 'Star Wars']
    the_force = int(len(star_wars)/len(all_lic_themes) * 100)
    the_force = 51%

    • @KeithGalli
      @KeithGalli  2 роки тому +1

      So my biggest recommendation based on your code is to be more explicit with how you name your variables. So instead of "df_1" & "df_2" you might name those dataframes "parent_themes_df" & "lego_sets_df" respectively. Furthermore it would be better to name variables "i" & "s" something like "parent_theme_index" & "parent_theme_value". These types of changes will make your code more readable. Functionally, everything looks sound though. Nice work!

    • @letsjoinhands
      @letsjoinhands 2 роки тому

      @@KeithGalli thanks a bunch Keith. and now in retrospect when I think about how you were working on solving this Q in the video I realised that all the time you were using pandas built in methods to solve the Q. so yes we could use a smattering of python methods to do this (like I did) but using that libraries' built-in methods would be more simpler and advantageous most of the times. Is that correct?

  • @RED_S0N
    @RED_S0N 9 місяців тому

    keith moment

  • @igor-xadrezxadrez8541
    @igor-xadrezxadrez8541 2 роки тому

    Hey, there's a red dot on your nose.

    • @KeithGalli
      @KeithGalli  2 роки тому

      I got in a fight playing hockey!

  • @Viralvlogvideos
    @Viralvlogvideos 2 роки тому

    Big nose :P

  • @54peace
    @54peace 2 роки тому

    you got injured on your nose???

    • @KeithGalli
      @KeithGalli  2 роки тому

      I got into a little ice hockey fight!

  • @AbhishekSharma-hy4nl
    @AbhishekSharma-hy4nl 2 роки тому

    Bro what happened to your nose😟?

    • @KeithGalli
      @KeithGalli  2 роки тому +1

      Got into a little fight playing ice hockey! We won the game though so it's cool xD