Python Pandas Tutorial (Part 8): Grouping and Aggregating - Analyzing and Exploring Your Data
Вставка
- Опубліковано 3 гру 2024
- In this video, we will be learning how to group and aggregate our data.
This video is sponsored by Brilliant. Go to brilliant.org/cms to sign up for free. Be one of the first 200 people to sign up with this link and get 20% off your premium subscription.
In this Python Programming video, we will be learning how to group and aggregate our data. This will allow us to explore our data in ways we have not yet done in this series. We will be able to answer questions such as: "What is the most popular social media site for each country?" We will be using the groupby method, and also some aggregate functions such as mean, median, value_counts, etc. Let's get started...
Video Timestamps:
Aggregate Column - 2:00
Aggregate DataFrame - 3:55
Value Counts - 7:51
Grouping - 12:30
Multiple Aggregates on Group - 26:00
People Who Know Python By Country - 27:20
Practice Question - 34:20
Concat Series - 37:27
The code for this video can be found at:
bit.ly/Pandas-08
StackOverflow Survey Download Page - bit.ly/SO-Surve...
✅ Support My Channel Through Patreon:
/ coreyms
✅ Become a Channel Member:
/ @coreyms
✅ One-Time Contribution Through PayPal:
goo.gl/649HFY
✅ Cryptocurrency Donations:
Bitcoin Wallet - 3MPH8oY2EAgbLVy7RBMinwcBntggi7qeG3
Ethereum Wallet - 0x151649418616068fB46C3598083817101d3bCD33
Litecoin Wallet - MPvEBY5fxGkmPQgocfJbxP6EmTo5UUXMot
✅ Corey's Public Amazon Wishlist
a.co/inIyro1
✅ Equipment I Use and Books I Recommend:
www.amazon.com...
▶️ You Can Find Me On:
My Website - coreyms.com/
My Second Channel - / coreymschafer
Facebook - / coreymschafer
Twitter - / coreymschafer
Instagram - / coreymschafer
#Python #Pandas
I hope everyone had a great week! We've got a long video this week, but we go over a lot of important topics about how to analyze data in Pandas. We will learn how to answer very interesting questions such as "What is the most popular social media site by country?". I put timestamps together for this video so that you all can skip around if you need to go back and watch a specific section. Here are those timestamps:
Aggregate Column - 2:00
Aggregate DataFrame - 3:55
Value Counts - 7:51
Grouping - 12:30
Multiple Aggregates on Group - 26:00
People Who Know Python By Country - 27:20
Practice Question - 34:20
Concat Series - 37:27
Have a great weekend everybody!
Thanks so much bro,,,,much love from kenya
Hey, is this series gonna continue?
Can you explain all the types of joins
And if possible please explain multi-level Indexing as well
Corey, is it safe to assume if your coming from a SQL background, that you can effectively use things like the 'pd.concat()' to replace the various joins (left, right, inner etc) workflows in SQL and just use SQLAlchemy or pyodbc libs to load the data and then do all the calculations with python that you would normally do in whatever SQL dialect?
3:10 median function
5:00 describe function
7:20 count()
8:05 value_counts()
12:51 grouping the data
14:39 groupby() function
16:07 get_group(), grabbing a specific group by name
17:30 doing same by using the filters
18:40 using value_counts on filters
20:20 value_counts() for groups
21:49 using loc to find for one country
23:40 percentage by using normalize
25:00 median by country group
26:13 agg function for multiple functions
27:30 using filtering to get python users by country
30:20 error on using same approach for groups
31:40 apply method to run that on group
35:40 finding the percentage of people using python in each country(group)
37:40 using concat for combining series in a dataframe
45:30 adding percentage column
thx, this is very useful. The videos already are very concise and to the point, but if I am just looking for how to do a proper groupby quickly on my own dataset....
thanks a lot.
@@afdqwfqwqwdfqwdawdas sure
FYI, the percentage problem can be solved alternatively as follows: country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum()/x.count())
@@80expertube It throws a RuntimeWarning
I love the fact that there are no ads interrupting in the middle. So thoughtful. ❤️
Yeah, I didn’t want the to ruin the flow of the videos. Glad you noticed :)
Corey Schafer OMG! Your reply just made my day!
@@coreyms Thank you so much for doing this. You are such a considerable person with a big heart.
Corey, do you have a Patreon page? Thank you for your exceptional videos; a huge help for me and so many people!
Yes please, do a video on the topic of MULTIPLE INDEXING!!
I would pay to see Corey's tutorial on this. Every time I encounter a multi-index, I'm on stack overflow. It just doesn't seem to stick.
Let's all admit that this dude is a hard working man and his work is just a wow!
I've been following him for quite some time now and I am always impressed by how thoughtful, tactical and clear his explanation is in every tutorial he makes.
Hat off to you, dude!
I'd like to share my solution to the practice question:
country_grp['LanguageWorkedWith'].apply( lambda x: x.str.contains('Python').sum() / len(x) * 100 )
As you can see it's just as symple as adding " / len(x) * 100 " in the lambda function, where len(x) is the total number of users for each country.
como se te ocurrio esto? ahem I mean, How did u come up with this? well played
I guess x.count() in place len(x) makes more sense, since there can be people who did not answer language(I highly doubt XD)
@@BCS_FahadAhmad I think x.count() will not count NaN so I think len is better if you want to include people that skipped that question. I am assuming that was an option.
I was thinking the same, so i downloaded his repository and tried it and it works. Came here to comment and saw your comment. so , i just wasted 5-10 minutes of my day. whatever.hallelujah.
Damn well played
I'd like to share my solution to the practice question.
ctr_knows_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python', na=False).value_counts(normalize=True))
ctr_knows_python.rename({False:'Don\'t know', True:'I know'}, inplace=True)
ctr_knows_python
Hope this helps too!
Nice!
Hi, can you please explain what "na=False" means here and why do we have to put this in the code? Thanks in advance
@@moushumitamanna not applicable
@@tplano3794 thanks. But why should we put na=false in this code
@@moushumitamanna in a column which is expected to have numbers, na does not make sense so we filter out these values. also if you run any functions (mean, median) then you may run into syntax errors
numeric_columns = df.select_dtypes(include="number")
medians=numeric_columns.median()
print(medians)
# this is a way of getting the medians of numerical values as I use df.median(), it gave me value error that says could not convert string to float"I am not a student who is learning to code" thanks for great work. I learn more from you than from my professors. Thank you so much for great efforts!😎
You saved my learning session, thanks!
Thank you bro
Thank you, you save me alot of time,
I found that you can shorten it by writing numeric_columns.median()
Thanks man
The level of my programming in Python has been substantially improved since I have started watching your great videos. Many thanks, Corey. Would you please prepare some videos regarding the networkx module as well?
other gurus are just like "here's the code for this, copy it and don't ask why" but you are the only one who shows how things work. Thank you very much
I love how you are just using the same data throughout the whole series. Thank you so much, Corey!
I just discover that your way of teaching is to tell not only how to do it but why this is how to do it. thumb up!!
YESSSS !!! Corey`s video for groupby. I press like before I watch it. Groupby is just a superpower. Thank you for this awesome series Corey. You are the best.
I don't care if some of your videos are long, in other channels they're just redundant but that's not your case! If you start doing short videos we may be losing all that valuable information that you provide to us. So far, this is the best Python channel I've seen. Greetings from Medellín, Colombia.
For your practise question, use .mean() instead of .sum()
.mean() on a Series of bool will give you the fractions in a quick and easy way. Multiply by 100 for %.
edit: As per discussion below, .mean() ignores NA values whereas Corey's approach treats NA as '0'. An alternative is thus:
mygroups['LanguageWorkedWith'].apply(lambda x:x.str.contains('Python').fillna(0).mean())
Now, the results should be equal to Corey's.
It looks like a nice solution but numbers from Corey's video are slightly different than those with .mean() and I do not know why.
For example:
for Afghanistan PctKnowsPython 18.181818, with mean is 20.512821
for Albania PctKnowsPython 26.744186, with mean 27.710843
Does anyone know why?
@@davidsp7949 yeah i have the same doubt,,i guess that s due to the fact that mean() take in consideration only the respondents who effectively answered the question,,and sum() take all respondents even the ones with NaN for the question.so corey solution is Pct over all respondents,, and the mean() is over only the ones who answered this Q.
.mean() neglects those "NAN" responses while if u use x.str.contains('Python').sum()/x.size instead, it would count those "NaN" as "no pythoner" which is what Corey was doing
Well played. For me your solution is quite elegant.
BTW, in case you wanted to treat NA as zeros (to get the same results from the video), you can simply use .mean() with its 'skipna' option disabled. This would make:
mygroups['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').mean(skipna = False)
I just changed sum to value_count(normalize=True) and it worked
You make all your subscribers happy with just uploading your videos !!! Thanks, Corey
Thank you all for watching!
This has to be one of the best videos on youtube about Pandas, thank you so much. Greetings from Perú.
Hey Corey! Thanks a million for the Pandas Series. As always, very intuitive and easy to follow.
Now that you've taught Matplotlib and Pandas, would love to see a new Numpy series in order to complete the Data Science trinity. Please consider adding a Numpy Series.
one of the best thing that happened to me when I woke up (I am on the opposite side of the world to Corey Schafer) is finding that Corey just upload another Pandas tutorial video , thank you !
I signed up for brilliant org just for Corey Schafer. Thanks for sponsoring him.
Best video in the series loving them and normally can’t wait for the next.
Working on a project evaluating an employee survey and this is just what the doctor ordered. Thanks! One of the best channels in UA-cam for data analysis hands down
after this lecture i started loving the complex coding of pandas and matplotlib. really you are an amazing teacher
For your exercise (What % knows Python) , I divided the sum in the lambda function by x.count() then multiplied by 100 :
country_group['LanguageWorkedWith'].apply(lambda x: 100 * x.str.contains('Python', ).sum() / x.count())
Thanks for the free content, awesome
use len(x) instead of x.count() because count() function exclude respondents who does not know any language.
Thank you so much for this video. I learnt way more from this than the many hours I spent sitting in class listening to a teacher who just wanted to end the lesson early or have long lunch breaks. This is really precious. And thanks for the reassurance that if I find this difficult, there's nothing wrong with me LOL.
Corey Again. Very fantastic tutor. I press the like button before I watch.
I am a GIS student and I want to thank you because I'm doing my last assignment for university and I'm using Geopandas, matplotlib, pandas, cartopy and forth on and you helped so much with your videos, I have build a nice map and I have produced different tables with my data. Thanks god you are out there and sharing your knowledge free
Hey Corey! For the question you gave: The percentage of people by country who use python. There is an efficient solution too (Without creating a separate dataframe).
country_grp["LanguageWorkedWith"].apply(lambda x:(x.str.contains("Python").sum()/len(x))*100)
Actually what I am doing here is, in the lambda function, at the return part, I divide the No. of python users by the length of the given series and then multiply it by 100. This gives the percent of python users in each country. This approach might be a bit code efficient but can be a bit confusing for some.
len(x) might not give u total respondent since it is series on LanguageWorkedWith and people might have skipped.
@@maheshmmmec Correct
You can use len(x) in the lambda, it will include the na in the series.
You cannot use count(x) because it skips na. But you can use value_counts(x).
If you run the following code, it will verify that Prasanna solution is corret:
len_total = country_grp["LanguageWorkedWith"].apply(lambda x: len(x)).sort_values(ascending=False)
us_no_answer = country_grp["LanguageWorkedWith"].apply(lambda x: x.isna().sum())
us_answer = country_grp["LanguageWorkedWith"].apply(lambda x: x.notna().sum())
df_counts = df['Country'].value_counts()
df_counts = pd.concat([df_counts,len_total,us_answer,us_no_answer],axis='columns')
df_counts.columns=['df_value_counts','lambda_len','user_respond','user_did_not_respond']
df_counts
produces:
df_value_counts lambda_len user_respond user_did_not_respond
United States 20949 20949 20769 180
India 9061 9061 8844 217
Mr. Schafer, I am so happy I found your teaching. I have been on a journey to become a data analyst, and after completing the Google Analytics Course , I realized that I needed to learn much more. I am currently finishing a Python Course through Coursera offered by IBM.
Not every professional, no matter how good they are, have the natural ability to teach. Your method and technique are so amazing and helped me to overcome some of the confusions I had with coding in Python. I learned so much from just this video alone.
I will definitely visit the site you referenced, and look forward to learning more from your videos.
Thank you so much!
American?
hey if ur df.median() doesn't work and ur getting typeerror and valueerror u can do df.median(numeric_only=True)
Thanks man
Thanks a bunch!
was looking for an advice on this, thanks a lot!
thanks a lot
Hey, Corey. I'm a network engineer and I'm learning pandas to be able to do some packet analysis and your videos really help me a bunch!
This is my solution for the coding problem from the video. Hope it helps!
know_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
total_respondents = country_grp['LanguageWorkedWith'].apply(len)
know_python / total_respondents
Im browsing thru some of the videos to brush up on Python, and this is the first python video that didnt get me bored. Concise and brillliant. Love your videos! keep up the good work :)
Thanks Corey for sharing these videos. Pretty clear explanations. You are a great teacher.
Perfect timing. Thank you for all the effort put into these videos. I've done that last jibe crash so many times, with the feet just off centre
Thanks so much Corey ! It's clearly impossible not to understand what you explain on all your videos : It's fluid, straightforward, crystal clear ! And more over your english : Whaoooo ... Congratulations !! I wonder if I've learned more Pandas or english !!
200% great !!
thanks so much for this series. started from the first video two weeks ago, now in the 8th. this series so far made a lot progress in me,. thanks so much, .May God Bless You. Love from Sri Lanka...
NAMASTE!! Corey Schafer.. Love From India
perfect content. one flaw is that , you never explained what a "lamda" function is and went straight to using the function in your previous videos. you did the same thing here. now i have to first study lamda.
Yeah I had to take a look into it. For anyone curious, lambda and "def return" is the same thing, but lambda are throway functions while "def return" assign a name to be applid further in the code. The "def return" syntax is easier to read and more clean than lambda
Every second of your python video are really worth it!!!
Best videos on pandas on UA-cam by Corey Sir. Loving them and normally wait for the next videos. Lots of love for you from India.
Really appreciate the addition of practice problems. It helps me to grasp the material
Man you are one of the best teachers who ever learned me something, much love and support ♥️
Every time I saw a new post, I click the 'like' button before watching:D
The reassurance at the end was so appreciated as a beginner. Thank you for your help !!!
My solution to the practice question:
def pctKnowsPython(x):
try:
return x.str.contains('Python', na=False).value_counts(normalize=True).loc[True]
except KeyError:
return 0
pctKnowsPython = country_grp['LanguageWorkedWith'].apply(pctKnowsPython)
Thanks, Corey. This is one of the most comprehensive pandas tutorials on UA-cam. Love from India.
I also noticed a subtle issue. We are adding the number of respondents who filled their Country and not who filled LanguageWorkedWith. Someone can fill Country and not LanguageWorkedWith.
Hi Corey. I think this might be a relevant simpler approach for getting percentage. I used value_counts(normalize=True) instead of sum.
df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))
This of course return both percentage who know Python and Who don't know. So if we want to get for a specific country, for instance Japan, then:
df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True)).loc['Japan'][1]
I have a similar approach to,i am happy to see that i am not the only one who always sees the easiest ways =)))
instead of using : value_counts(), use sum() in second line. Then you'll get appropriate answer.
df.groupby(['Country'])['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
Great approach. Thanks for sharing
Here is my approach : filt=df['LanguageWorkedWith'].str.contains('Python',na=False)
python_count=df.loc[filt]['Country'].value_counts()
python_count.rename('p_c',inplace=True)
python_count
--
total_count=country_grp['Country'].value_counts()
total_count.rename('t_c',inplace=True)
total_count
--
result_horizontal = pd.concat([total_count, python_count], axis=1)
import numpy as np
result_horizontal.replace({'p_c':np.nan},0,inplace=True)
result_horizontal['perc']=(result_horizontal['p_c']/result_horizontal['t_c'])*100
result_horizontal
I just went through Part 8 a second time. Thanks a bunch!! I learned a lot.
Hey, Corey, I see many people gave their own answer which are wonderful. So want to give my one here, which looks like more a beginner answer:
#create a new column called: 'Python use'
df['Python_use'] = (
df['LanguageWorkedWith'].str.contains('Python')
)
#.value_counts() the 'Python use' column as it is a boolean type:
country_grp = df.groupby(['Country'])
Python = country_grp['Python_use'].value_counts(normalize=True)
Thanks for your great Lectures! I watched from part 1. I am your big fan now.
Best Pandas playlist in youtube. One point solution for python learners
Just want to share here my solution for the practice question (but with the survey of 2022):
---
country_group['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))
---
And also give thanks to your wonderful videos, Corey!
It's been 3 years and they're still among one of the best tutorials.
I thought the same thing
We can do this in the lambda function for getting the percentage :
lambda x : x.str.contains ('Python').value_counts(normalize=True)
And I would love to thankyou for all these videos!
"a = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True))"
this what i did is it correct or not i dont know??
Thanks Corey.
I have waited for this video whole week.
Great explaination
This video is so useful , the simplification that python does for the problems is so helping. is the best language in the programming of code. And the proffesor of this video is really a genius. !!! thanks.
im just diving into pandas and would like to say - GREAT THANK YOU for such nice and detailed explanation.
great job!
Thanks a lot for your teaching! Here is the my solution at the end of the video:
# group object['column'] is a Series object, so the input of the function is a Series, ana the output value of the function is a float
def percent_know_python_each_country(countrySeries):
num_know_python = countrySeries.str.contains('Python').sum()
num_all = len(countrySeries)
percent = round((num_know_python / num_all * 100), 2)
return percent
country_group['LanguageWorkedWith'].apply(percent_know_python_each_country).sort_values(ascending=False).head(30)
Thanks for doing this video in a detailed way, like you always do. Just under an hour is a good length for a video like this. Thanks!
LOL @ "If I did this correctly, and it's definitely possible I made a mistake." Happy I found these videos, thanks.
A much simpler way of finding the percentage of people from each country who know python is by using the mean() function
Something like this :
country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').mean())
P.S: Corey's method produces a much detail-oriented version of this result
Corey Shafer deserves a UA-cam Teacher award
Helps me to get into my PhD. Thanks a lot for uploading this!
The groupby explanation comparing it to a filter is really good.
47 minutes of a pure pandas tutorial from a god in python, man you're a hero🔥🔥
I've watched quite a variety of different data analysis tutorials and this one was easily one of the most engaging for me. Having interesting data really helps.
Perfect explanation. Making a convoluted yet so important concept crystal clear through step by step explaning and also giving connections to pandas object types. Cheers!
I found amusing explaining what a percentage% is after going over apply & lambda methods, but that's exactly the thoroughness that makes your videos so loved :)
One of the best and most easily understandable vid on Pandas. Thank you for creating this !!
I've been doing data analysis using Python & Pandas for a few years now. Still, I'm picking up new things from your videos. Very clearly explained! Thank you!
Thanks Corey! This is pure gold!
Thanks a lot Corey! Got to learn complex syntax in simple ways. You are amazing teacher.
This was a lot to learn in one video. That's why I went back and watched it again this week. At the end, I added something I think would be useful in what I do. I added a filter to select records where the number of respondents is >= 5.
filt = python_df['NumRespondents'] >= 5
python_df.loc[filt]
You're Genius
Love From India
You mentioned India in your tutorial
Love this gesture 🇺🇸🇮🇳
Hey Corey,
first of all thank you very much for all those fantastic videos. I also have tried to answer the question of percentage knowing python for each country. I came up with following solution:
PrctKnowPython = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python', na=False).value_counts(normalize=True)).loc[:,True]
One advantage of this approach is, that you can just remove na=False and ignore NaN values in your data.
Always looking forward for your videos, Corey.
nice video :) i liked the q/a approach at the end where you left it open.
You are just wonderful, it makes so much fun watching your tutorials. I finde directly the answer, those I need.
God bless you brother.
thank you Corey!! this is a wonderful pandas series!! you make the concept so easy that even a python beginner (that is me) without programming background in colleague can understand!!
i'd like to share my solution as well: (since i don't know concat method, i calculate the answer first then convert them into a dataframe by dictionary)
know_py = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
answer = df['Country'].value_counts()
per_cent = know_py/ answer
result = pd.DataFrame({'answer': answer, 'Python': know_py, 'percentage': per_cent})
i find you stop uploading new tutorials for a long time, hope everything goes well with you. and strongly looking forward to hearing from you soooooon!!
thank you & greeting at 2022 sep 7th :)
Thank you for the last example (percent that knows python). It was great to see how the different methods learnt can come together in a practical example; this really helped consolidate the knowledge gained.
you teach really well, I am learning a lot...Thanks. I have also learned other topics from your videos....I was stuck on opening and reading a csv file in python, till I saw your video and learned it....I am an absolute beginner...😅...Thanks
Corey, the content of your videos are amazing. This tutorial in special is sensational.
Corey, I learn a lot from your videos. But this one has been especially helpful. Thanks.
best tutorials ever, i have read lot of books, but your technique is global and works best...keep on the good word @Corey Schafer
I really like the way you speak. Your language is clearly understandable and you have a great accent. :)
I am following your pandas series since the last 3 days and may complete in 1 or 2 days max, I will come back to the series to revise it, very well made Series and keep up the good work !😀
At 35:40 since you want to get the percentage of boolean values that are 1 you can pretty much just replace .sum() with .mean(). It's sum/length.
Corey, thank you very much for your free videos!
Hello, about the question in the video, I don't know if my attempt is right but here it is:
1) To take into account the NaN values (so they are involved in the calculation):
country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum() / len(x))
2) Not taking into account the NaN values:
country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum() / x.count())
Since the information of how many entries are in country_grp and are grouped by country I think it is inefficient to look for the number of entries elsewhere, thank you for the course !
Thank you very much for the great video. for the percentage calculation, I use the following code based on what is learned:
country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').value_counts(normalize=True)).loc[:,True]
Thank you,I am very lucky that I found your tutorials.
Really appreciate your teaching strategy man..have been learning a lot from you since the quarantine started.Love from bangladesh
Best Valentine's Video!!
Thank you for this. Had clearer understanding of pandas than before. Wish you the very best
you are a great teacher Corey.
Hey Corey Schafer Thanks a lot for this amazing series which helped me to upgrade my skills which I was unaware of that.
Really enjoying this series, thank u Corey!
Yes Corey, having a future video on multi-index will be very helpful!
Wow! what a fantastic explanation! I saw lot of videos! but you are different! i know why? because you are using pandas with great examples and nice explanation! thank you again!
These tutorials are well thought out and really great in explainatory purposes. Greats skills here from Corey! Thank you.
Heyyy coreyyy I got drops in my eyes after watching the way you taught....you made my day❣️✨love you so much corey
First of all, Thank you so much for all your tutorials. You are a great teacher.
and yes, please make a video on Multiple Indexing.