When should I use a "groupby" in pandas?
Вставка
- Опубліковано 30 вер 2024
- The pandas "groupby" method allows you to split a DataFrame into groups, apply a function to each group independently, and then combine the results back together. This is called the "split-apply-combine" pattern, and is a powerful tool for analyzing data across different categories. In this video, I'll explain when you should use a groupby and then demonstrate its flexibility using four different examples.
SUBSCRIBE to learn data science with Python:
www.youtube.co...
JOIN the "Data School Insiders" community and receive exclusive rewards:
/ dataschool
== RESOURCES ==
GitHub repository for the series: github.com/jus...
"groupby" documentation: pandas.pydata.o...
"agg" documentation: pandas.pydata.o...
"plot" documentation: pandas.pydata.o...
== LET'S CONNECT! ==
Newsletter: www.dataschool...
Twitter: / justmarkham
Facebook: / datascienceschool
LinkedIn: / justmarkham
Seriously this was an amazing explanation. I cant like it enough.
Thanks! :)
It is indeed
Great video. It's better than many of the so- called great courses on the internet! Cheers!
Awesome, thank you! :)
I am a beginner to pandas. Amazing videos that were suggested by a friend. I have a doubt where I am stuck. For me in Python 3 drinks.groupby('continent').mean() does not work. min, max and count works but this does not (not without specifying 1 or more column names)
Excellent question! In the current version of pandas, if a DataFrame contains non-numeric data and you want to calculate the mean of all numeric columns after a groupby, you have to include the argument numeric_only=True. Thus: drinks.groupby('continent').mean(numeric_only=True). Hope that helps!
@@dataschool Yes it sure helps!!! Thanks for pointing out this argument
%matplotlib inline was answer to my 1 hour of headache to search why the hell plot() is not working! :P
Ah yes! That is very important, and easy to miss :)
Can I ask how to calculate the group by aggregation and only count unique number?Thanks ( BTW YOU ARE AMAZING)
Thanks for your kind words! As for your question, could you clarify? I don't quite understand. Thanks!
Your content is amazing , I visited some others channel , they are not taught like you,
I see this video in suggestions & I learn perfectly how to use groupsby and where to use ...
Thanks man........
Thank you so much! I never used the groupby function until now and this helped me complete my pandas assignment! very clear explanation and your videos have helped me a ton!
Glad it helped! 🙌
Me a Ph.D. student in experimental high energy physics getting spirit from all these tricks, and avoiding some boring tasks. Thank you very much, sir.
You're very welcome!
What do you recommend if I want to group by conditionals? Let's say I have a column "Blood Pressure" and I want to group by values 120.
Could you use the "for each" formulation when you're grouping by multiple categories?
Great question! Yes, you would say "for each combination of these categories".
Group By multiple column
data_frame.groupby(by=['continent','country']).beer_servings.mean()
Thanks for sharing!
Thanks, top video!
Only suggestion would be to expand on this with some strategies for creating a new DF with the grouped/agg values.
Thanks again
Thanks for your suggestion!
Thank you sir! Eu really saved my life, hugs from Brazil!!
Glad it helped!
Y there are less views🙃🙃.. Its an amazing explanation..
Thank you!
You’re my go to person for informative videos on pandas, Numpy, and Matplotlib.
Thanks!
Q&A Series: Video #14 _06(TC02:10)
==========================================
If you are trying to figure out how this actually works, this might help:
Let's filter the dataframe by Africa for a moment.
===========================
drinks[drinks.continent == 'Africa'] > drinks (A) [ ] (B) drinks.continent
===========================
drinks[]
Q&A Series: Video #14 _04(TC01:14)
==========================================
drinks.beer_servings.mean()
What if you wanted to know beer_servings across continents? You wanted to know, how it varied from Europe to Asia to Africa.
So this is where a GROUPBY (groupby) is USEFUL. (TC01:34)
That is exactly what I was looking for and couldn't find at the same time. Thank you!!!
You are welcome!
Great videos,helped a lot.Thanks!
You're welcome!
On the agg method. Is there a way to simply return which countries the min and max belong to? I know that now that, min 0 for specific continent I could just filter dataframe based on continent and beer_servings value of 0, but could I somehow do that in the initial agg call to also print out the biggest and smallest consumers?
Great question! If there's an easy way to do this, I can't think of it at the moment. Let me know if you come up with something!
Any update on that? Thanks!
import pandas as pd
drinks =pd.read_csv('data/drinks.csv')
print('*****************************')
max_bear= drinks['beer_servings'].max()
min_bear= drinks['beer_servings'].min()
print ('max_bear:',max_bear,',min_bear',min_bear)
print('*****************************')
print(drinks[drinks['beer_servings']==max_bear])
print('+++++++++++++++++++++++++')
print(drinks[drinks['beer_servings']==min_bear])
You can get the corresponding index values of a maximum with groupby().idxmax(). But if the country is in a column rather than the index, I think this is a 2-step process where you then index the country column by the results of the groupby.idxmax.
df.loc[df.reset_index().groupby('continent').beer_servings.idxmax()]
df.loc[df.reset_index().groupby('continent').beer_servings.idxmin()]
Q&A Series: Video #14 _03(TC01:10)
==========================================
drinks.beer_servings.mean()
The average beer_servings is 106
4:49 and dot aggggggggg~ hahaha 😂
But seriously tho, what an absurdly good explanation of groupby function. I wish you could be a professor in my university🤪
Thank you! 😄
drinks.groupby('continent').mean() does not work. but drinks.groupby('continent').min() works well
In the current version of pandas, if a DataFrame contains non-numeric data and you want to calculate the mean of all numeric columns after a groupby, you have to include the argument numeric_only=True. Thus: drinks.groupby('continent').mean(numeric_only=True). Hope that helps!
Thank you very much for an amazing explanation. I doubted .groupby method for a long time and finally found this video, This video totally clear my doubt out :D
You are very welcome! So glad it was helpful to you :)
You might be interested in my latest video, which has some usage examples of groupby: ua-cam.com/video/dPwLlJkSHLo/v-deo.html
greetings BROTHER,
at 1:47 you have written command drinks.groupby('continent').beer_servings.mean() , is there another way of writing which includes df['beer_servings'] instead of writing in attribute format (.beer_servings)??
drinks. groupby('continent').mean().plot(kinds='bar')
OR
d.c.plot()
Just sayin'
i recently noticed someone use this approach, drinks.groupby("continent").agg({"beer_servings":"mean"}) instead of drinks.groupby("continent").beer_servings.agg("mean") . Is it more preferable to use the first solution as it returns "beer_serving" label on column?
Even for my bad english level, this explanation was so clear. Big Thanks!
Great to hear!
Awesome as usual! thanks very much!
Thank you!
What if in groupby function....i want continents grouped with only 2 columns ( beer_servings and spirit _servings)
Hello,
what if i want to group by a string type of columns
i mean i could use groupby('first_column).second_column.sum() but this will give me the strings attached to each others without a space between them :/
so how can i add the space ??
Thanks! I realize this is years old, but worth a shot...
One thing I would love to see clearly explained is how to change values in one column based on a condition on values in another column. For example, if drinks.spirits > 100 and drinks.total_liters_of pure_alcohol >10, change drinks.continent (or other column) to "lush". As an addition, if values in one column (say drinks.name) contain the word "dirty" change drinks.briny to True. Etc
For your first example, the code would look something like this:
drinks.loc[(drinks.spirits > 100) & (drinks.total_liters_of pure_alcohol >10), 'continent'] = 'lush'
loc is super useful! Here's more information on loc: ua-cam.com/video/xvpNA7bC8cs/v-deo.html
Newbie in python. Can anyone tell me if this group by will work for a large dataset having 1 lakh rows and 500 columns?
Unfortunately, some of the values in the drinks data frame turned out to be NaN. I wonder if you could show us how to handle missing values to this. Thanks.
1:01 I know this is pedantic, but the values aren't weighted by the population size of teh countries so it's not actually a meaningful mean
Whether or not that is the "correct" calculation depends on the question you are trying to answer.
Great explanation! How could you show the information for the country in each continent with, say, the highest beer servings?
sir why in the last 2 cases where we didn't specify, in there with mean it is not executed but with min, max and count, it is being executed without showing any error? same for the visual form also?? help
I really like your videos, thanks. What if you wanted to check for condition before doing any math? Life if the min > 0, then do some math operation?
I will start your Sci-kit series too as I enjoyed so much watching all these pandas series as very clear.
succinct and effective, many thanks
Thanks!
Could you explain the difference between count() vs nunique()
For the multiple aggregation, is there a way to round the result to 2 decimal places
Hello. Firstly, thank you for this magnificent video series. I really mean that! I have a question, continuing with your beer example. If you didn't have a continent column in the source data how would you accomplish the same output?
Glad you like the series! Regarding your question, the continent column has to be part of the DataFrame, otherwise you can't group by it. Hope that answers your question!
Amazing explanation bro.
Thanks!
how can we do different functions for multiple columns - mean for 2 columns and sum for other three columns? 2 question: how can we do linear regression for each group after group by ?
This is great! I'm taking an absolutely hateful pandas course that's basically a needle-in-a-haystack scavenger hunt. Faced with a .groupby() problem where the course materials in no way logically lead to the answer, I wound up here. The problem doesn't precisely conform to your tutorial, but just three of your lines were enough to convey the basic mechanism of action that led to an answer. Thanks!
Awesome to hear!
nice presentation. if it is possible can you help me with error by using get_group()? it gives an error.thnx in advance
How about multiple inner connected groups?
Countries has cities, cities has streets, street has houses...
Amazing video as always, Kevin! Thank you for great tip at 03:48
My pleasure!
You explain pandas very well! Pls keep posting ...would love to see something on how to apply custom functions to tables (avoiding iteration use)
Hi, it sounds like what you are looking for is vectorization. I would recommend reading up on numpy where() and numpy select(). I looked through Kevin's (data school's) videos and didn't see anything on this topic. He is amazing at explaining things, in the mean time you many be interested in this video where I talk about vectorizing a little bit at the end of the video here:
ua-cam.com/video/CG3EV7UBELA/v-deo.html
Fantastic explanation :D
Glad it was helpful!
WOW... I wish all the tutorials were that clear. Thanks man!
You are too kind! 😄
Corey and Kevin you are brilliant. Thanks Corey as he advertised your channel, so pleased. It's like here go how to make a you tube content to teach
Excellent tutorials as always Kevin. Thanks for sharing.
Thank you!
Your'e rate of speech is inline with your coding..., I appreciate it.. thanks
You're welcome!
Does this mean that these are equivalent? Assuming dplyr in R:
## Python
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max', 'mean'])
## R
drinks %>%
group_by(continent) %>%
summarise(count = n(), min = min(beer_servings), max = max(beer_servings), mean = mean(beer_servings))
I don't keep up with dplyr any more, I'm sorry! Perhaps someone else can answer.
thank you for this tutorial series you are really a good teacher it helped me a lot
I have a question would you consider making a tutorial video for pandas memory handling methods(I am not sure if the terminology is right ) how can we remove objects from memory when we are done or is there a way to control these processes
thank you!!
Thanks for the suggestion! I'll consider it for the future!
I second this request!
I second this request even
What happens if i have more that 264000 rows, and they are strings ?
Great Tutorials. :)
Thanks for such a nice series.
Could you please come up with videos on Heat Map & Correlation Table?
thanks in advance.
Thanks for your suggestion!
You made that so easy. Thank you.
You're very welcome!
Hiya,
The following code:
topcommenters = pd.DataFrame(dfcom.groupby("author")["score"].sum().sort_values(ascending = False))
topcommenters.head(5)
Gives me a long error saying "TypeError: '
It's hard to say, without more details. Good luck!
How to SUM across columns (horizontaly) filtered by parameter?
I think this video will help you to understand: ua-cam.com/video/PtO3t6ynH-8/v-deo.html
i am a great fan of you .. you rocks !!
Thank you! 🙌
Hi Kevin,
Thanks for the wonderful Video and i should really thank you for the service you are rendering for Data sciences community ....
@UCnVzApLJE2ljPZSeQylSEyg in the below Code Snippet, i would like to apply aggression on two columns i.e, say beer.servings & wine_servings. How i can do this? is any way we can group by on one column and apply aggression on two column?
drinks.groupby('continent').beer_servings.agg(['max','mean'])
Not sure I understand, sorry! But something on this page might help: www.dataschool.io/python-pandas-tips-and-tricks/
Great! What about grouping by several columns?
df.groupby(['col1', 'col2'])...
This is very helpful video. How can I apply this to the dataframe and want to keep all columns?
I have a dataframe called Dataprep with 19 columns and 18484 raws. I have aggregated the columns I wanted with the code below:
Dataprep.groupby('CustomerKey').OrderQuantity.agg('count')
CustomerKey is the ID number for customers and orderquantity is the number of orders by each customer. The data look like(Sorry I only include the columns required):
11000 - 1
11000 - 1
11000 - 1
11000 - 1
11001 - 1
11001 - 1
11001 - 1
11001 - 1
11002 - 1
11002 - 1
11002 - 1
11002 - 1
11002 - 1
11003 - 1
11003 - 1
11003 - 1
11003 - 1
11003 - 1
11004 - 1
11004 - 1
11004 - 1
11004 - 1
First column is customerkey and other one is orderquantity.
I'm sorry, I don't understand this dataset well enough in order to help. Good luck!
I have a groupby question that I can't find on the internet. What if my dataframe columns are dates: 1/1/2021, 1/15/2021, 2/1/2021...5/15. Can I groupby by month? The dates run along the x axis as columns and not the Y axis as rows.
I think you will need to convert to datetime format and then use resample. Hope that helps!
This video was just amazing!! You just got a new subscriber.
Thank you!
First video I clicked on and it answered all my confusion. Many thanks.
Great to hear!
This is really helpful and easy to understand! Looking forward to your future great works, keep it up.
Great to hear!
is there a way to sort the results of a groupby mean on the servings? and filter by top 'n' per continent?
i.e. I'd like to see the the top 3 results for beer servings sorted in descending order per continent instead of an alphabetized list
Continent, beer servings mean
Europe, 193.777778
South America, 175.083333
N America, 145.434783
thanks in advanced, your videos are great!
It's likely possible, though I can't think how to do it off-hand, sorry!
I really enjoy ur video. i was wondering if you had any videos on how to do a report and if not then can u point to some resources that may have a way to explain it to me.
Very clear explanation..Thank you very much for your contribution!! I am really enjoying while watch your video!!
Why don't you publish a Course in Udemy?
Thanks for your kind words and your suggestion! I know that many students like Udemy courses, but my values as a course creator don't align with their business model, and so I'm not currently interested in publishing a course there. I prefer to offer courses directly to interested students. Thanks for understanding!
7:50 drinks.groupby('continent').mean().plot(kind='bar'), how to show number for each bar?
Not sure off-hand, sorry!
Hello guys,
First of all, great videos, tahnks man, secondly i wanted to filter the dataframe (drinks) by continent and only print those who are min or max of their continent :
drinks[(drinks.beer_servings.isin(drinks.groupby('continent').beer_servings.min()) | (drinks.groupby('continent').beer_servings.max()) )]
I find this line of code a little bit long, and i was wondering if i could make it easier to write (optimize it) ?
I'd have to play around with it to figure this out... let me know if you come up with something!
Hey! I was trying to run this snippet from a Notebook
mean_tissue = {chunk.Taxon[0]:chunk.Tissue.mean() for chunk in data_chunks} # This part is returning a Key Error
Can we do this using the groupby method or any fix to get this running?
I'm sorry, it's hard for me to say without knowing a lot more about the data and data types. Good luck!
Hi, can i possibly use groupby to assign items in a dataframe as keys in a dictionary, and then reference other items as their values? For instance, in a two column DF and you have name,age,sex,country in one column and the corresponding answers in the next column, can you use groupby to make a nested dictionary that looks like this
biodata={name:{age:20,sex:male,country:latvia}}
If it isn't possible, kindly advise on which pandas function is more appropriate for this.
Thanks
I'm not sure how to do what you are suggesting. Sorry!
U r awesome man. Pls Pls make numpy series for data science, stats
It was really useful for me. I don't know english well, but I understood almost all words in this video
That's awesome to hear! 🙌
Well hopefully you could help me out, I'm currently working with a dataset that has a lot of redundant description values in a certain column, can a groupby be used to find the maximum number value for each of those redundant values? Say I have 4 rows that have the ACC value, but each of those rows have the number value as follows: 1, 4, 6, 3.... can a groupyby be used to find the one row that has the max number value?
I'm sorry, I don't quite understand. Could you provide a short example to explain what you are trying to do? This link might help you to create a reproducible example: stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples
great video
I have a question >> how to specify two columns ?? beer_servings and wine_servings for example
thanx in advance
drinks[['beer_servings', 'wine_servings']]
Q&A Series: Video #14 _01(TC00:05)
The question for today is: When should I use a "groupby" in pandas?
================
Kevin says, "Let's jump right in and pull up an example dataset."
import pandas as pd
--------------------------------
and our dataset is ALCOHOL consumption by COUNTRY.
Really cool vid but -
Cant I use a pivot_table for all group_by operations? I'm a little confused when to use each...can anyone help?
This video might help: ua-cam.com/video/tcRGa2soc-c/v-deo.html
Hi, Can you please explain transform,filter and apply methods with groupby? Those are becoming headache for me
Thanks for the suggestion - I'll consider it for a future video!
I do have one video on using 'apply', but not as it relates to a groupby: ua-cam.com/video/P_q0tkYqvSk/v-deo.html
Thank you Man! You are doing a great job here by showing Pandas tutorials in such easy way. Pray that you can carry out this benevolent work more
Thanks very much for your kind words!
Best explanation !
Thank you!
I love your videos bro
Thank you!
Excellent tutorial! If possible maybe another video to refresh and/or to address Pandas 1.0.x?
Thanks for your suggestion!
How could I show the names of each country in each continent?
agg is the best. thanks a lot
You're welcome!
Great, clear tutorial. Plus now I know Europe is the most drunk continent
🍺😉
Hey, thanks for the video on GroupBy. I am curious how you can use GroupBy in the context of quicker data processing of slicing and dicing. Rather than use a full dataset, use first groupby function to condense dataset and then work with that condensed dataset. Can you think about that scenario and elaborate how to do it in the most efficient way?
Sure, you can save the groupby object as then use that for future calculations!
Hello. using your example, id like to make histograms for each continents. How do I make a poisson distribution out of the groupby mean? & fit the distribution to the histograms
Sorry, I won't be able to assist you with this!
Are you fucking kidding me.... you sir are the man. These tutorials have been blessing me. Where have you been all my life 😂
Awesome! Thanks so much for your very kind comments :)
Can Sir we get some practice problem based on that series
one quick question, do i need to verify that country fiedl type is a string or the groupBy works with Objects. i have a lot of fields in my dataset which are Objects and can't use the groupby. it that is the case then how to convert Object to string?
Strings are objects in Pandas (so you don't need to convert objects to strings). Groupby will work with object/string columns.
Nice job with the explanation. I combed through dozens of videos and articles online specific to leveraging groupby and your's by far is the best.
Thanks very much for your kind words!
Can anybody xplain me..that how to see which country in africa having maximum beer serving
You can see this video ua-cam.com/video/ni9ng4Jy3Z8/v-deo.html if for an explanation on how the first part of this code works.
First we filter the df to only include countries in Africa.
Then we look at only country and beer servings and sort it from largest to smallest number of beer servings
Then we get the answer Namibia with 376 beers served
africa = drinks[drinks['continent']=="Africa"]
africa[['country','beer_servings']].sort_values(by='beer_servings', ascending=False)
Nailed it! Exactly what I was looking for. Thanks
You're welcome!
Easily understandable. Thanks for the video and keep making such videos.
You're very welcome!
Let's say for instance, you have a column for mobile numbers and you want to find out the top 10 numbers that appeared more. How can i solve that?
Try df.column_name.value_counts()