🙏🙏🚩🚩🙏🙏Truly sir great lecture I had been trying to understand group by in pandas since last 25 days, but no-one was able to clear my confusion. But you sir explained me brilliantly and I am really so obliged of you. Thanks and I subscribed you and share on Facebook page, from Banaras City, India 😄😄😄🙏🙏🙏🙏🙏🙏
Good step by step tutorial. But one thing you missed by Groupby multi columns, and apply different aggregate function. example: [column A, column B] A=sum, B=average. something like that
I had to watch this a couple times too hear that part around 4:18 about why groupby will only return those who survived. It is good you added that. Now that I understand that, I can take a shot at age groups for the Titanic.
It's great to have a 5-min quick & dirty dive, but a couple more seconds here and there to say that "agg" means "aggregate", that if we want more than one column summarised we must provide a list (hence the double brackets), etc. It provides a simple explanation that facilitates memory.
Thank you very much for sharing! It really helped me, was exactly what I was looking for. People like you are blessed ang good people helping to develop this world! I just subscribed, follow and will share in my groups!
Hi. Thank you for your video. May I ask how do you know exactly that which age group is divided to which bin? Although these ages are put into 3 bins but I am unclear which exact age which bin contains? For example: what age range for 'young' in this case?
@Adeel KhanI can think of 3 approaches to this: - Group by age_bins, then take the minimum and maximum age: df.groupby(['age_bins']).['age'].agg(['min', 'max']) - Use retbins=True in the pd.cut() function; I think retbins returns the bounds of your bins. - Define the bins yourself, i.e. bins=[0, 20, 60, 120] (instead of bins=3 as in the video) will divide the passengers into a 60 bin
Awesome 👌. Clear crystal 🔮. I specially like the bin trick, straightforward. That is really amazing 👏 😍. I had to break into intervals using numpy select ( ) or user defined function with apply ( ) to get the same result with the bin method. Keep it up.
Thanks for the great video. Im wondering about how you could group the ages in intervals of 10 years. I feel like you probably wouldnt use cut for that since you would need to know the highest / lowest age in order to determine how many cuts you need. Do you have a recommendation on how to do that?
at 5:39. in setting labels for 'age_bins' how did it get to know that from which age group is young, which one is middle and old. like you did not set the parameters from 0 to 20 for young, 21 to 60 for middle and above 60 for old. or either it does it implicitly.
Using bins=3 as a parameter to the pd.cut() function automatically divides the group into 3 equally sized categories. See my comment to Xuan Tran for an explanation of how you can find out what it does or what you could do differently.
The as_index=0 tip is great! When doing this with .count() instead of sum, like for example I’m doing a project with the code format Df.groupby([‘x’][‘y’],as_index=False)[‘y’].count(), is there any way to keep the original y column along with the new y “count” column in a resulting data frame? With this method it replaces the original y with the count of y.
What if I have a dataframe with two date columns (start-date, end-date) along with other attributes and I wish to create bins for each year incorporating both those date columns. How do you think I can manage to do that?
Great Video! One question... Let say you do like the first example, group survivers by class and sum(), but I want the result sorted in a descending order ( the class with most survivers to the least...) How would you do that?
Hey there, for some reason when i try doing Single Group, Multiple Columns (like in 2:19), I keep getting an error basically stating that it thinks my 'fare' column is filled with strings - as opposed to floats. As such, I can't do sum/mean/numeric methods on that data. I can't seem to get around it.
Hey Cole DD, sometimes when you read in your data pandas thinks the data is a string even though it should be integers or floats. This video here ua-cam.com/video/evKYySLSzyk/v-deo.html discusses how to convert datatypes of columns and some common problems that you may run into when doing so. Let me know if that works.
Hi Bradon, Awesome tutorial. 4:41, survived by class, mean and sum. Proportion would have been more meaningful. How to get percentagem there, I mean the proportion of survived (survived rate) by class. Using transform????? For aggregation only allowed sum, mean, count,......
Hi Michael, good question. The age bins was were grouped with the pandas cut method. By default the cut method will turn continuous data into categorical data by grouping it into three bins (you can specify how many bins you want - but if you don't it will make three bins). So if you have 12 values it will create three bins with 4values in each bin. pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
Maybe this will allow you to specify the ranges of the bins. The length of the labels have to be -1 inferior with respect to the length of the bins df['age_cat'] = pd.cut(df['age'], bins=[x for x in range(0,100, 5)], labels=[x for x in range(5,100, 5)], right=True)
thank u for your time and exertion! i have a question, i have a dataset, there are a few columns in it including "Fuel_Type". Fuel types are petrol, diesel and CNG. all i want is to group by the fuel_type and store the copy of datasets in variables both petrol and diesel. how can I do that, i have been searching for hours :))) pls answer me
I'm not sure I understand your question. Are you looking to filter the dataframe so that only pclass = 1 is contained in the dataframe? You could use a boolean mask pclass1 = df[df['pclass'] == 1]. If that's what you are looking for you can check out this video on filtering which I think you will find helpful ua-cam.com/video/ni9ng4Jy3Z8/v-deo.html
hi thanks for this. How do I show group by results for more than one variable with more than one aggregate function without the index. so basically mulitple groups as columns + aggregated on more than one function
Hey, thanks for the video. I have a dataframe that has a column with 0-4 in value, but I wish to group it by 0 and then 1-4. How would that be possible? Is it a big difference?
Hi Priti, will you run df.dtypes and let me know if there are any numeric (float or int) datatypes in your dataframe? If they are all objects check out this video on how to convert objects into numberic values ua-cam.com/video/evKYySLSzyk/v-deo.html (hopefully that will solve your problem. If this doesn't solve your problem will you copy and past your groupby statement and send it to me please?
When trying out this example: df['age_bins'] = pd.cut(df['age'], 3, labels=('young','middle_age', 'old')), I got a error returned. TypeError: can only concatenate str (not "float") to str. I don't know why. I looked at the manual, the code seems good to me.
It seems to me like this function doesn’t really need to exist. I feel like I could make all of these manipulations relatively easily with Boolean operations. Can someone explain the advantage of using groupby()? Because it’s easier? Or is there something I’m missing?
I have data frame contains three columns, one for restaurants_id , the second for his categories (one or plus categories) and the third column is for his zone. I need to calculate for each restaurant how many restaurants in his zone that share this restaurant in one category at least, and put the result in a new column ?
Hi F Ashaikh, is it possible for you to email me your data (or provide me with some made up data that is similar to the data you have). That will help me see what is going on a little better. My email is bradonvalgardson@gmail.com
Thanks for the video! I have a question. If you want to select one specific biological sex, How could I write that code? For example just females. df.groupby(["pclass", [sex] == female])["survived].sum() It would be right to write it like this? Thanks in advance!
Hello is there any way to put all values in their column depending on their index if value i m trying to group by is lets say Switzerland and it has multiple Happiness ratings for each year how do i put all ratings in same column for each year but just seperate them by comma without summing them up?
Great question Ivan. Try this out and see if it works for you. First I create a dictionary of data with 3 different countries and some happiness scores. Then I create a DataFrame with this data. The I use groupby function to group each country and then use apply(list) to create a list of all the values in each group. data_dict = {'country':['country_1','country_2','country_3','country_1','country_', 'country_2','country_3','country_2','country_3','country_1, 'happiness':[3,1,3,5,7,4,1,2,3,4]} df = pd.DataFrame(data_dict) df_grouped = df.groupby('country'['happiness'].apply(list)
@@ChartExplorers thank you for swift answer i managed to do it for one column but i m trying to do it for multiple columns basically just uniting rows with same country values but seperate them with comma its working when i do it for happiness score but if i try to add happiness rank it just throws out happiness score and happiness rank not values just those strings i tried as list but yea still not working I did it with this code which works for Happiness Score: frame.groupby(['Country'])['Happiness Score'].apply(lambda x:' , '.join(x.astype(str))).reset_index()
@@houndofjustice5 I think I see what you are asking. So you want to groupby country and then list out all the values for that country in the happiness and rank columns. Let me know if this works. If not, I am setting up a discord server for Chart Explorers. That might be a better medium for problem solving. # Example Data data_dict = {'country':['country_1','country_2','country_3','country_1','country_1', 'country_2','country_3','country_2','country_3','country_1'], 'happiness':[3,1,3,5,7,4,1,2,3,4], 'rank':[1,2,3,4,5,6,7,8,9,10]} df = pd.DataFrame(data_dict) # groupby with list for multiple columns df_grouped = df.groupby('country')[['happiness','rank']].agg(lambda x: list(x))
Please help as I have data of employees in which they did multiple sale, I want if any employee did sale more the 50000 againt it each emp I'd of that person print excellent rest low. Like Emp I'd. Sale status Emp1001 5000. Excellent Emp1001 45000. Excellent Emp1001 2000. Excellent Emp1002 5000. Low Emp1003 2500. Low
Hi @@SudhirKumar-ry4gk, so you are wanting to group by employee Id and for employees that had sales greater than $50,000 mark them as excellent otherwise mark them as low? Is that correct?
at 5:54 while applying pd.cut did not work for me it gives error TypeError: can only concatenate str (not "float") to str Solution: used the two lines that solved the issue. df['age'] = df['age'].replace('?',0) #clean data df['age']=df.age.astype('float64') #convert data type to float
# function that groups data by attribute1 and calculates per-group statistics for attribute2 mean and count , how do we make a function for this def get(data, attr1, attr2, statistic):
Hi Pursh, I'm not sure if I understand exactly what you are trying to accomplish. Are you trying to obtain the mean and count on groups based on multiple columns/attributes? df.groupby(['pclass','sex], as_index=False)['survived'].agg(['mean','count']) If this is the case I'm not sure the purpose of creating a function to do this.
I have a dataframe that has around 20 columns and 800 rows. One column contains multiple duplicate information that I am using as the group, and based on one of the other columns I want to filter the dataframe to show unique values based on the highest number of this column using max(). I still want to retain all of the other columns and end up with a dataframe that contains these unique values including the original columns. group = df_UE5_Compatability_info.groupby('lookup')['Function Count'].max() where "lookup" is the column I want to group by (containing multiples of the same value) and filter to show the rows with the highest number for "Function Count", how do I make the dataframe contain the other remaining columns associated with the resultant rows determined by the groupby? I am struggling. Difficult to describe in words.. sorry
Hi Alan, you did a great job explaining thanks providing me an example of what you have done. 😀 If I'm understanding correctly (please correct me if I'm wrong), you have 1 column that contains categories and you want to get the max value for each of those categories in every column that you have (using groupby). Here is a simple example I made that will get the max value for every column in the dataframe based on the groups in Col_4. import pandas as pd # Create practice df df = pd.DataFrame({'Col_1':[1,2,3,4,5], 'Col_2':[6,7,8,9,10], 'Col_3':[11,12,13,14,15], 'Col_4':['Group_1','Group_2','Group_1','Group_1','Group_2'] }) # groupby Col_4 (in your case use lookup) group = df.groupby('Col_4').max() group.head() You will notice here, instead of adding a list of columns to perform the groupby function on I excluded it. This will perform the operation on all the columns. In your example, you should be able to do the following to get your answer: group = df_UE5_Compatability_info.groupby('lookup').max()
@@ChartExplorers Thanks for the reply. Below is a sample dataset (made up) to try and better explain and one that is more representative to my actual dataset. df = pd.DataFrame({'lookup':['abc123','abc124','abc123','abc125','abc125'], 'Supported':['no','yes','no','yes','yes'], 'Percentage':[0.9,0.6,0.6,0.7,0.6], 'Number of features':[1,6,10,8,11], 'Platform':['Release 1.0','Release 1.0','Release 2.0','Release 1.0','Release 2.0'] }) The output should look like the following: lookup Supported Percentage Number of features Platform 0 abc123 no 0.9 1 Release 1.0 1 abc124 yes 0.6 6 Release 1.0 2 abc123 no 0.6 10 Release 2.0 3 abc125 yes 0.7 8 Release 1.0 4 abc125 yes 0.6 11 Release 2.0 Column "lookup", Row 0 and 2 are common values, as are rows 3 and 4. My goal is to have one row per value in column "lookup", filtered on the highest value in column "Number of features" and all other columns values for the selected row should be shown in the output data frame. Using the following group = df.groupby('lookup').max() creates: Supported Percentage Number of features Platform lookup abc123 no 0.9 10 Release 2.0 abc124 yes 0.6 6 Release 1.0 abc125 yes 0.7 11 Release 2.0 But the percentage is wrong for rows abc123 and abc125, as its has included the highest percentage in each of the groups. My desired result is as follows:- abc123 no 0.6 10 Release 2.0 abc124 yes 0.6 6 Release 1.0 abc125 yes 0.6 11 Release 2.0 where values for columns "Supported', 'Percentage' are taken "as-is' from the dataframe row that contains the row with the highest "Number of features' In my script I am using group = df.groupby('lookup')['Number of features'].max() which returns the following, but I am missing the other columns, in this example Supported, Percentage and Platform. lookup abc123 10 abc124 6 abc125 11 Also, if I try to save the dataframe to csv, I only get the following Number of features 10 6 11 I would have expected to have this csv output? lookup Number of features abc123 10 abc124 6 abc125 11 Thanks again.. and I hope this is more descriptive?
@@apz9022 thanks for providing the example, that clarifies things a lot. If you use the same dataframe you created in your example you should be able to use the following code: new_df = pd.DataFrame(pd.DataFrame(columns=df.columns)) for item in df['lookup'].unique(): temp_df = df[df['lookup']==item] row = temp_df[temp_df['Number of features'] == temp_df['Number of features'].max()] alist.append(row) new_df = pd.concat([new_df, row], ignore_index=True) new_df Sadly, this uses a for loop. There might be another way to do this would avoid the for loop (I need to work on it a little more to get it to work - I'll let you know if I get it to work). I'm also going to look into groupby a little more. There are some cool things you can do with groupby, but this has several constraints that I do not think groupby will support. With 800 rows and 20 columns performance should not be an issue (but it's always nice to squeeze as much performance out as possible just for fun!). Hope this works. Let me know.
@@ChartExplorers Thanks.. updated my code and its working like a charm! Thanks. One point, alist.append(row) did not work for me? I have left it out and it still seems to work. What does this do?
Good question, I should have explained this in the video. In the csv file missing data is represented with '?'. When we read in missing data into pandas we can tell it that missing data is represented by then pandas will treat it as a missing value rather than getting confused.
Thanks a lot! You saved me days! I'm literally crying rn. So pricise and to the point. Love the content
I'm glad it helped! Groupby was always a sore spot for me learning, but now that I know it I use it all the time.
🙏🙏🚩🚩🙏🙏Truly sir great lecture I had been trying to understand group by in pandas since last 25 days, but no-one was able to clear my confusion. But you sir explained me brilliantly and I am really so obliged of you. Thanks and I subscribed you and share on Facebook page, from Banaras City, India 😄😄😄🙏🙏🙏🙏🙏🙏
Dude thank you sooo much. Finally someone with proper english explained things properly
Good step by step tutorial. But one thing you missed by Groupby multi columns, and apply different aggregate function. example: [column A, column B] A=sum, B=average. something like that
I had to watch this a couple times too hear that part around 4:18 about why groupby will only return those who survived. It is good you added that. Now that I understand that, I can take a shot at age groups for the Titanic.
It's great to have a 5-min quick & dirty dive, but a couple more seconds here and there to say that "agg" means "aggregate", that if we want more than one column summarised we must provide a list (hence the double brackets), etc. It provides a simple explanation that facilitates memory.
ok this is a mad comprehensive information that is explained amazingly briefly and clearly within just 7 min.
I have seen three of your videos so far, all were very well thought out. Really helpful. You deserve many more subscribers!
Thanks for your kind words Imad Uddin!
Brilliant. It had exactly what i needed. Multiple groups and the splitting trick
Perfect! I'm glad it had what you needed.
Thanks a lot i am searching this in entire weeks on articles.
Simple and informative i love this video and am saving it for future references! Thank you!
Thanks a lot! You saved me day , now i can calculate mean by categorizing datasets
Thanks a lot, it's really informative for my upcoming exam.
This is a very good video for explanation. Thanks so much from Hong Kong.
THANK YOU!!! that last tip is a life saver
Really helpful tricks. Thank you!
You're welcome!
Thank you very much for sharing! It really helped me, was exactly what I was looking for. People like you are blessed ang good people helping to develop this world! I just subscribed, follow and will share in my groups!
concise, short , illustrious!! Thanks alot!!!
You're welcome!
Great video! so clear... It helps me a lot! Tks from Brazil!)
Thank you for your detailed demonstrations.
This is one of the best videos EVER! really helpfull! Thanks a LOT!
Hi. Thank you for your video. May I ask how do you know exactly that which age group is divided to which bin? Although these ages are put into 3 bins but I am unclear which exact age which bin contains? For example: what age range for 'young' in this case?
@Adeel KhanI can think of 3 approaches to this:
- Group by age_bins, then take the minimum and maximum age: df.groupby(['age_bins']).['age'].agg(['min', 'max'])
- Use retbins=True in the pd.cut() function; I think retbins returns the bounds of your bins.
- Define the bins yourself, i.e. bins=[0, 20, 60, 120] (instead of bins=3 as in the video) will divide the passengers into a 60 bin
great video. it would be great if you also provide the link for the notebook
Awesome 👌. Clear crystal 🔮.
I specially like the bin trick, straightforward. That is really amazing 👏 😍. I had to break into intervals using numpy select ( ) or user defined function with apply ( ) to get the same result with the bin method.
Keep it up.
Just what we needed . Awesome content 🙌🏼
Great explanation! Good JOB! Thumbs up!
wow, u made it easy for me and saved lot of time.. THANK YOU
VERY CLEAR , PLEASE IF YOU CAN EXPLAIN HOW DOING INTERSECTION IN CASE WE HAVE (ONE -TO -MANT) RELATIONAL DATA BASE ?. THANKS
Thanks for the great video. Im wondering about how you could group the ages in intervals of 10 years. I feel like you probably wouldnt use cut for that since you would need to know the highest / lowest age in order to determine how many cuts you need. Do you have a recommendation on how to do that?
at 5:39. in setting labels for 'age_bins' how did it get to know that from which age group is young, which one is middle and old. like you did not set the parameters from 0 to 20 for young, 21 to 60 for middle and above 60 for old. or either it does it implicitly.
Using bins=3 as a parameter to the pd.cut() function automatically divides the group into 3 equally sized categories. See my comment to Xuan Tran for an explanation of how you can find out what it does or what you could do differently.
Neat and objective!!!
Thanks for sharing. I do appreciate your content.
Very good and funny videos bring a great sense of entertainment!
The as_index=0 tip is great! When doing this with .count() instead of sum, like for example I’m doing a project with the code format Df.groupby([‘x’][‘y’],as_index=False)[‘y’].count(), is there any way to keep the original y column along with the new y “count” column in a resulting data frame? With this method it replaces the original y with the count of y.
you got a new follower Sir!
really clear, really good explained, God, finally I understand :D thanks so much!
thanks for the great video, it really helped me.
you are telling very well proffessor:))
Great explanation! Thank you.
What if I have a dataframe with two date columns (start-date, end-date) along with other attributes and I wish to create bins for each year incorporating both those date columns.
How do you think I can manage to do that?
Great Video! One question... Let say you do like the first example, group survivers by class and sum(), but I want the result sorted in a descending order ( the class with most survivers to the least...) How would you do that?
.sort_values(ascending=False)
Hey there, for some reason when i try doing Single Group, Multiple Columns (like in 2:19), I keep getting an error basically stating that it thinks my 'fare' column is filled with strings - as opposed to floats. As such, I can't do sum/mean/numeric methods on that data.
I can't seem to get around it.
Hey Cole DD, sometimes when you read in your data pandas thinks the data is a string even though it should be integers or floats. This video here ua-cam.com/video/evKYySLSzyk/v-deo.html discusses how to convert datatypes of columns and some common problems that you may run into when doing so. Let me know if that works.
Great video, thanks!
When I use groupby for multiple columns like you did, it show me a message that used list instead of square brackets.
How can I combine groupby then do distinct count on one of the cat column then sum on some of the numeric column
How would we put the result of a groupby function as a column in our dataframe?
Great video ! how can I get max() value grouped by column and yet get the intire dataframe colums to be presented ?
Thank you so much sir !!
Hi Bradon, Awesome tutorial. 4:41, survived by class, mean and sum. Proportion would have been more meaningful. How to get percentagem there, I mean the proportion of survived (survived rate) by class. Using transform?????
For aggregation only allowed sum, mean, count,......
How do you sort the data when different conditions are involved in the groupby?
Great explanation!
i love it, keep it up mate
In the Quick Tip Section, How did the program know that 29 is Middle_age, 2 is Young_age and 50 is old???
How did python determine which age_bin to place the individual into? You never specified the age-ranges associated with the categories?
Hi Michael, good question. The age bins was were grouped with the pandas cut method. By default the cut method will turn continuous data into categorical data by grouping it into three bins (you can specify how many bins you want - but if you don't it will make three bins). So if you have 12 values it will create three bins with 4values in each bin. pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
Maybe this will allow you to specify the ranges of the bins. The length of the labels have to be -1 inferior with respect to the length of the bins
df['age_cat'] = pd.cut(df['age'],
bins=[x for x in range(0,100, 5)],
labels=[x for x in range(5,100, 5)],
right=True)
I have a problem it keeps giving me keyError it doesn’t identify the name of the columns how can I solve it ? Please help me
Very helpful , keep it up ❤
Como hago un grafico con el resultado de un groupby.
How do I make a graph with the result of a groupby?
thank u for your time and exertion!
i have a question, i have a dataset, there are a few columns in it including "Fuel_Type". Fuel types are petrol, diesel and CNG. all i want is to group by the fuel_type and store the copy of datasets in variables both petrol and diesel. how can I do that, i have been searching for hours :))) pls answer me
Hi, how do i get with specific value column pclass sum for ex : 1 only
I'm not sure I understand your question. Are you looking to filter the dataframe so that only pclass = 1 is contained in the dataframe? You could use a boolean mask pclass1 = df[df['pclass'] == 1]. If that's what you are looking for you can check out this video on filtering which I think you will find helpful ua-cam.com/video/ni9ng4Jy3Z8/v-deo.html
hi thanks for this. How do I show group by results for more than one variable with more than one aggregate function without the index. so basically mulitple groups as columns + aggregated on more than one function
so please, I need a personal favor, I need to make labels for a plot I generated from a groupby method, any help with that?
Hey, thanks for the video. I have a dataframe that has a column with 0-4 in value, but I wish to group it by 0 and then 1-4. How would that be possible? Is it a big difference?
Hey I'm having problem in groupby as it is giving Data error and No numeric type to aggregate. Could you please help ?
Hi Priti, will you run df.dtypes and let me know if there are any numeric (float or int) datatypes in your dataframe? If they are all objects check out this video on how to convert objects into numberic values ua-cam.com/video/evKYySLSzyk/v-deo.html (hopefully that will solve your problem. If this doesn't solve your problem will you copy and past your groupby statement and send it to me please?
@@ChartExplorers # Visualize Churn Rate by Gender
plot_by_gender = churn_dataset.groupby('gender').Churn.mean().reset_index()
plot_data = [
go.Bar(
x=plot_by_gender['gender'],
y=plot_by_gender['Churn'],
width = [0.3, 0.3],
marker=dict(
color=['orange', 'green'])
)
]
plot_layout = go.Layout(
xaxis={"type": "category"},
yaxis={"title": "Churn Rate"},
title='Churn Rate by Gender',
plot_bgcolor = 'rgb(243,243,243)',
paper_bgcolor = 'rgb(243,243,243)',
)
fig = go.Figure(data=plot_data, layout=plot_layout)
po.iplot(fig)
This is giving me the error .Can you suggest an alternative
You are a savior
I want to group by on mobile number and want to merge messages received, how can i do that?
Saved me looots of hours haha! thanx!
Great video sir
When trying out this example: df['age_bins'] = pd.cut(df['age'], 3, labels=('young','middle_age', 'old')), I got a error returned. TypeError: can only concatenate str (not "float") to str. I don't know why. I looked at the manual, the code seems good to me.
It seems to me like this function doesn’t really need to exist. I feel like I could make all of these manipulations relatively easily with Boolean operations.
Can someone explain the advantage of using groupby()? Because it’s easier? Or is there something I’m missing?
I have data frame contains three columns, one for restaurants_id , the second for his categories (one or plus categories) and the third column is for his zone. I need to calculate for each restaurant how many restaurants in his zone that share this restaurant in one category at least, and put the result in a new column ?
Hi F Ashaikh, is it possible for you to email me your data (or provide me with some made up data that is similar to the data you have). That will help me see what is going on a little better. My email is bradonvalgardson@gmail.com
I did , thank you very much for your help.
I came here to understand concept of groupby but left with emotions we men sacrificed. 🥺
Thanks for the video.
excellent tutorial
Thanks for the video! I have a question. If you want to select one specific biological sex, How could I write that code? For example just females.
df.groupby(["pclass", [sex] == female])["survived].sum()
It would be right to write it like this?
Thanks in advance!
How do I make a poisson distribution of a groupby column?
I'm not sure. I would need to see your data and know more context to better understand what you are trying to accomplish.
Hello is there any way to put all values in their column depending on their index if value i m trying to group by is lets say Switzerland and it has multiple Happiness ratings for each year how do i put all ratings in same column for each year but just seperate them by comma without summing them up?
Great question Ivan. Try this out and see if it works for you.
First I create a dictionary of data with 3 different countries and some happiness scores.
Then I create a DataFrame with this data.
The I use groupby function to group each country and then use apply(list) to create a list of all the values in each group.
data_dict = {'country':['country_1','country_2','country_3','country_1','country_',
'country_2','country_3','country_2','country_3','country_1, 'happiness':[3,1,3,5,7,4,1,2,3,4]}
df = pd.DataFrame(data_dict)
df_grouped = df.groupby('country'['happiness'].apply(list)
@@ChartExplorers thank you for swift answer i managed to do it for one column but i m trying to do it for multiple columns basically just uniting rows with same country values but seperate them with comma its working when i do it for happiness score but if i try to add happiness rank it just throws out happiness score and happiness rank not values just those strings i tried as list but yea still not working
I did it with this code which works for Happiness Score:
frame.groupby(['Country'])['Happiness Score'].apply(lambda x:' , '.join(x.astype(str))).reset_index()
@@houndofjustice5 I think I see what you are asking. So you want to groupby country and then list out all the values for that country in the happiness and rank columns.
Let me know if this works. If not, I am setting up a discord server for Chart Explorers. That might be a better medium for problem solving.
# Example Data
data_dict = {'country':['country_1','country_2','country_3','country_1','country_1',
'country_2','country_3','country_2','country_3','country_1'],
'happiness':[3,1,3,5,7,4,1,2,3,4],
'rank':[1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(data_dict)
# groupby with list for multiple columns
df_grouped = df.groupby('country')[['happiness','rank']].agg(lambda x: list(x))
Please help as I have data of employees in which they did multiple sale, I want if any employee did sale more the 50000 againt it each emp I'd of that person print excellent rest low.
Like
Emp I'd. Sale status
Emp1001 5000. Excellent
Emp1001 45000. Excellent
Emp1001 2000. Excellent
Emp1002 5000. Low
Emp1003 2500. Low
Hi @@SudhirKumar-ry4gk, so you are wanting to group by employee Id and for employees that had sales greater than $50,000 mark them as excellent otherwise mark them as low? Is that correct?
Can we then plot a graph of any sort using the generated table we've just grouped ?
@Chat Explorers
@Chart Explorers*
Sir how will you solve the problem when you have to determine who are the top5 highest rated players for every position in fifa dataset?
Hi, it might be fifa.groupby(by='position').apply(lambda group: group.sort_values(by='rate', ascending=False').head(n=5)
Great video
Waow, your're amazing man :))
very clear, thxxx
Thanks a lot!
Thanks by heart
Thank you so f much!
at 5:54 while applying pd.cut did not work for me it gives error
TypeError: can only concatenate str (not "float") to str
Solution: used the two lines that solved the issue.
df['age'] = df['age'].replace('?',0) #clean data
df['age']=df.age.astype('float64') #convert data type to float
Can you send dataset
Nice. helpful
nice ! thanks :)
# function that groups data by attribute1 and calculates per-group statistics for attribute2
mean and count , how do we make a function for this
def get(data, attr1, attr2, statistic):
Hi Pursh, I'm not sure if I understand exactly what you are trying to accomplish.
Are you trying to obtain the mean and count on groups based on multiple columns/attributes?
df.groupby(['pclass','sex], as_index=False)['survived'].agg(['mean','count'])
If this is the case I'm not sure the purpose of creating a function to do this.
Thanks man
I have a dataframe that has around 20 columns and 800 rows. One column contains multiple duplicate information that I am using as the group, and based on one of the other columns I want to filter the dataframe to show unique values based on the highest number of this column using max(). I still want to retain all of the other columns and end up with a dataframe that contains these unique values including the original columns.
group = df_UE5_Compatability_info.groupby('lookup')['Function Count'].max()
where "lookup" is the column I want to group by (containing multiples of the same value) and filter to show the rows with the highest number for "Function Count", how do I make the dataframe contain the other remaining columns associated with the resultant rows determined by the groupby? I am struggling. Difficult to describe in words.. sorry
Hi Alan, you did a great job explaining thanks providing me an example of what you have done. 😀 If I'm understanding correctly (please correct me if I'm wrong), you have 1 column that contains categories and you want to get the max value for each of those categories in every column that you have (using groupby).
Here is a simple example I made that will get the max value for every column in the dataframe based on the groups in Col_4.
import pandas as pd
# Create practice df
df = pd.DataFrame({'Col_1':[1,2,3,4,5],
'Col_2':[6,7,8,9,10],
'Col_3':[11,12,13,14,15],
'Col_4':['Group_1','Group_2','Group_1','Group_1','Group_2']
})
# groupby Col_4 (in your case use lookup)
group = df.groupby('Col_4').max()
group.head()
You will notice here, instead of adding a list of columns to perform the groupby function on I excluded it. This will perform the operation on all the columns. In your example, you should be able to do the following to get your answer:
group = df_UE5_Compatability_info.groupby('lookup').max()
@@ChartExplorers Thanks for the reply. Below is a sample dataset (made up) to try and better explain and one that is more representative to my actual dataset.
df = pd.DataFrame({'lookup':['abc123','abc124','abc123','abc125','abc125'],
'Supported':['no','yes','no','yes','yes'],
'Percentage':[0.9,0.6,0.6,0.7,0.6],
'Number of features':[1,6,10,8,11],
'Platform':['Release 1.0','Release 1.0','Release 2.0','Release 1.0','Release 2.0']
})
The output should look like the following:
lookup Supported Percentage Number of features Platform
0 abc123 no 0.9 1 Release 1.0
1 abc124 yes 0.6 6 Release 1.0
2 abc123 no 0.6 10 Release 2.0
3 abc125 yes 0.7 8 Release 1.0
4 abc125 yes 0.6 11 Release 2.0
Column "lookup", Row 0 and 2 are common values, as are rows 3 and 4.
My goal is to have one row per value in column "lookup", filtered on the highest value in column "Number of features" and all other columns values for the selected row should be shown in the output data frame.
Using the following group = df.groupby('lookup').max() creates:
Supported Percentage Number of features Platform
lookup
abc123 no 0.9 10 Release 2.0
abc124 yes 0.6 6 Release 1.0
abc125 yes 0.7 11 Release 2.0
But the percentage is wrong for rows abc123 and abc125, as its has included the highest percentage in each of the groups. My desired result is as follows:-
abc123 no 0.6 10 Release 2.0
abc124 yes 0.6 6 Release 1.0
abc125 yes 0.6 11 Release 2.0
where values for columns "Supported', 'Percentage' are taken "as-is' from the dataframe row that contains the row with the highest "Number of features'
In my script I am using group = df.groupby('lookup')['Number of features'].max() which returns the following, but I am missing the other columns, in this example Supported, Percentage and Platform.
lookup
abc123 10
abc124 6
abc125 11
Also, if I try to save the dataframe to csv, I only get the following
Number of features
10
6
11
I would have expected to have this csv output?
lookup Number of features
abc123 10
abc124 6
abc125 11
Thanks again.. and I hope this is more descriptive?
@@apz9022 thanks for providing the example, that clarifies things a lot. If you use the same dataframe you created in your example you should be able to use the following code:
new_df = pd.DataFrame(pd.DataFrame(columns=df.columns))
for item in df['lookup'].unique():
temp_df = df[df['lookup']==item]
row = temp_df[temp_df['Number of features'] == temp_df['Number of features'].max()]
alist.append(row)
new_df = pd.concat([new_df, row], ignore_index=True)
new_df
Sadly, this uses a for loop. There might be another way to do this would avoid the for loop (I need to work on it a little more to get it to work - I'll let you know if I get it to work). I'm also going to look into groupby a little more. There are some cool things you can do with groupby, but this has several constraints that I do not think groupby will support. With 800 rows and 20 columns performance should not be an issue (but it's always nice to squeeze as much performance out as possible just for fun!).
Hope this works. Let me know.
@@ChartExplorers Thanks.. what is "alist.append" ? I get an error stating "alist" is not defined?
@@ChartExplorers Thanks.. updated my code and its working like a charm! Thanks. One point, alist.append(row) did not work for me? I have left it out and it still seems to work. What does this do?
thanks
You're welcome! 😀
Awesome.
why '?' is needed while reading a csv file??
Good question, I should have explained this in the video. In the csv file missing data is represented with '?'. When we read in missing data into pandas we can tell it that missing data is represented by then pandas will treat it as a missing value rather than getting confused.
@@ChartExplorers oh, thank you (^ ^)
to the point!
how do you know what is young, middle_age or old. This is not defined.
Perfect
very good
Thanks!
goat
nice
simpler way to explain things...
i love u