I spent hours trying to figure this stuff out through reading chapters and chapters in Python books. Then I come here, and everything I was trying to figure out was explained in 9 minutes. This was IMMENSELY helpful, thanks!
wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!
I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python. I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like: 1st category would be for the reviews less than 100,000, 2nd category would be for the reviews between 100,001 and 1,000,000, 3rd category would be for the reviews between 1,000,001 and 5,000,000 and 4th category would be for the reviews anything more than 5,000,000. Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command: gps['Reviews Group'] = [1 if x
just find your channel , just watched this as my first watch for your videos , and pressed subscribe !!! , cause your explanation for the idea as whole is very remarkable 😃 thanks a lot .
A very much appreciated efforts. Thanks a million for sharing with us your python knowledge. It has been a wonderful journey with your precise explanation. keep the hard work! Warm regards.
Myself from Punjab .M studying at IIT even then i got satisfaction of pandas from ur videos only . Thanks please give all u done in text format or like tutorial ,
I like it the way you explain things...it's very clearly and precisely. My problem is little more complex where I want to remove the entire row where it met the following conditions. If any rows in Latitude column that has the same value as previous row (-1) AND the same row in the Longitude column that has the same values as previous row THEN remove the whole entire row that duplicated. Basically we have to compare two consecutive ROWS and COLUMNS and IF both conditions are met then remove the entire row. Let's say if there are 15 rows have the same values(i.e, If Lat[1,1] == Lat[0,1] & Lon[1,2] ==Lon [0,2] then remove, else skip, # Lat = Col1, Long = Col2) in both Latitude and Longitude columns then remove them all except keep one. Hope you got my points... :-). Looking forward to see your code.
Glad you like the videos! It's not immediately obvious to me how I would approach this problem, but I think that the 'shift' function from pandas might be useful. Good luck! Sorry that I can't provide any code.
Hi, In the above video, at 1:12 minutes - the pandas DataFrame is displayed in Tabular form, with all the variables separated by vertical line. But in latest jupyter notebook, we get a single line below variable name. Can we get the same display as earlier, with new Jupyter version ?
I can solve the duplicate data from my CSV file~~~ Thank you. However, I suggest you can do more in this video. I think you can show after the delete result list. Such as: >> new_data=df.drop_duplicates(keep='first') >> new_data.head(24898) If you have to add it, I think this video will be more perfect~~~
How can I remove duplicate rows based on 2 column values? I want to drop a row if two column values are the same. E.g. I have one column with Country = [USA, USA, Canada, USA] and an income column with values = [1000, 900, 900, 900]. I only want to drop the duplicate where both the country AND the income is 900. While if one row has country = Canada and income = 900 and second row has USA with income 900 I want to keep them both. Answers appreciated! Your videos are really helpful for learning pandas. Keep up the good work!
Sorry, I'm not quite clear on what the rules are for when a row should be kept and when it should be dropped. Perhaps you could think of this task in terms of filtering the DataFrame, rather than using the drop duplicates functionality?
Thanks for the reply! I managed to improve my code to avoid the duplicates in the first place. Keep up your great work with the videos, really helpful for improving my skills!
You have done very Good jobs about under standing of DataFrame and make very easy to understanding DataFrame it so easy with the people which are working in excel Best wishes from me
OMG I WANT TO THAT YOU SOOOO MUCH 😊I been on the problem for days and the way you explain it make so easy then how I learned in class. I was so happy not to see that error message 😂 Thank you
At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks
Thank you so much💕 your videos are really amazing...can you tell how to read any csv(without header on first line) and set first row with non null values as header...
wait Kevin, keep=first means what is duplicated are the rows towards the bottom, meaning they have a much higher index. Keep= last means ?? Oh men am getting mixed up. Could someone please explain to me. Kevin,Please?
Hi, I am wondering whether you could identify an issue that I am having whilst cleaning a dataset with the help of your tutorials. I will post the commands that I have used below: df["is_duplicate"]= df.duplicated() # make a new column with a mark of if row is a duplicate or not df.is_duplicate.value_counts() -> False 25804 True 1591 df.drop_duplicates(keep='first', inplace=True) #attempt to drop all duplicates, other than the first instance df.is_duplicate.value_counts() # -> False 25804 True 728 I am struggling to identify why there are still some duplicates that are marked 'True'? Kind regards,
That's an excellent question! The problem is that by adding a new column called "is_duplicate", you actually reduce the number of rows which are duplicates of one another! Instead of adding that column, you should first check the number of duplicates with df.duplicated().sum(), then drop the duplicates, then check the number of duplicates again. Hope that helps!
Hey Buddy, You are amazing and you remind me of Sheldon Cooper (BBT) because of the way you talk and also both of you are super smart. :-) One request- Please cover outliers sometime. Thanks.
Trying to figure out how to replace values above/below a threshold with the mean or median. If I find values that are skewing the data from a column, but don't want to exclude the whole row and drop the row, I just want to replace the value in one of the columns with a mean/median value. Can't figure out how to do this! IE: I want to replace all values in column 'age' that are above 130 (erroneous data), with the mean age of all the other values in 'age' column.
I'm sorry, I don't know the code for this off-hand. However, this would be a great question to ask during one of my monthly live webcasts with Data School Insiders: www.patreon.com/dataschool (join at the "Classroom Crew" level to participate)
Thank you for this useful tutorial. Quick question, how do you check whether a value in column A is present in column B or not; not necessarily on the same row. It is like the samething that VLOOKUP function looks for in Excel. Many thanks for your feed-back!
I have some missing dates in my dataset and want to add the missing dates to the dataset. I used isnull() to track these dates but I don't know how to add those dates into my dataset..Can you please help.Thanks
Because of your quality panda series I started following you. @duplicate - in my use case instead of drop duplicate I would like to keep 1st instance and just remove other duplicate values from specific column, so shape will remain same after removing duplicate values from column. Really appreciate if you got some time to answer this, thanks.
Glad you like the series! I'm not sure I understand your question - perhaps the documentation for drop_duplicates will help? pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
HI, when you mention the "inplace" in the video, I am happy that PD have this parameter for experiment, but a problem comes, should I rember all the method that have the inplace parameter ;and rember the method that affect the origial dataframe in case that I use the DF already change when doing the calculation. That is a hugh job to remove all the method that have 'inplace' parameter or doesnot have ,isn't it..... TOT
The 'inplace' parameter is just for convenience. I do recommend trying to memorize when that parameter is available. But if you forget, that's fine, because you can always write code like this: ufo = ufo.drop('Colors Reported', axis=1) ...instead of this: ufo.drop('Colors Reported', axis=1, inplace=True)
Is all inplace argument in method way default by "False"? My problem is that: I worry that somethimes the method change original dataframe by method that have "inplace parameter"; somethimes the method does not change original dataframe. so i confuse when it affect the original DataFrame , since the wrong judgemet might be lead to bad conclusion.
Thank you for this content! I have a question : how can we handle quasi redundant values in different columns ? (Imagine two different columns each containing similar values at 80%). Thanks a lot
When you say "handle", what is your goal? If you want to identify close matches, you can do what is called "fuzzy matching". Here's an example: pbpython.com/record-linking.html Hope that helps!
@@dataschool Merci beaucoup for the reply. Let me explain my question : I have two variables/features named categories (milk, snack,pasta,oil,etc) and categories_en(en:milk , en:snack, en: pasta). My goal is to keep only one feature since both features share the same information. It was suggested that running a chi square test would help me decide which feature to keep but it seems silly to me :( ( I have almost 2millions records)
For learning regular expressions, I like these two resources: developers.google.com/edu/python/regular-expressions www.pythonlearn.com/html-270/book012.html
Hey Kevin, I am confused for the drop duplicates here: the number of duplicated age and zipcode is 14; but after your drop the duplicates, the shape is 927. The total shape is 943, so the correct shape should be 943 - 14 = 929? Thanks a lot for your help!!!
Hi, need help. Suppose if we have table such as transaction contains atleast 1 common item in the item column. How to code which are the transactions having coffee atleast? Transaction Item 1 Tea 2 Cookies 2 Coffee 3 cookies 4 Bread 4 Cookies 4 Coffee
This is case of complete duplicates. So what should we do when we have to deal with incomplete duplicates..Ex age,gender and occupation same but zip is different.. could you also make a video on that please..
i get a error when i run users.drop_duplicates(subset=['age','zip_code']).shape . error "'bool' object is not callable" even i get the same error if i run users.duplicated().sum()
Remove the .shape, and see what the results look like. Also, compare your code against mine in this notebook: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb
Great video. But I'd like just to find a duplicate column and then go to another column and find the duplicate and go to another column and find the duplicate and remain only one row with certain information.
I had weird error with this one. Setting index col with index_col='user_id' does not work for me it raises KeyError: 'user_id' error. Instead I had to run users = pd.read_table('bit.ly/movieusers', sep='|', header=None, names=user_cols) first and then users.set_index('user_id') for this tutorial to work
Hi, I have question here. I want to mark the continue duplicate value like this [1,1,1,0,2,3,2,4,2], my expected result is [True,True, True,False,False,False,False,...]. But the pandas.duplicated(keep=False) returns [True,True,True,False,True,False,True,False,True], The function treat the '2' in 2,x,2,y,2,z,2 sequence as duplicated. but it is not I want. How to remove it? I just want to mark the 1,1,1 as true. thanks.
Thanks for good channel. I like it very much. I have a query. I am working on tweets, I have to remove duplicate tweets as well as tweets which are different in at most one word. I can do first part, Will you please guide me how can I do the second part?? Thanks
I really need help guys. I have a table that has a column : Column name - " Neighbourhood" This Column has A LOT of names repeated MANY times. To be specific, the column "Neighbourhood" has 10 Names that are repeated ALOT of times. My question is : I NEED HELP IN CREATING A SEPARATE COLUMN SPECIFYING HOW MANY TIMES EACH ELEMENT IN "NEIGHBORHOOD" HAS BEEN COUNTED. If anyone help me please.
I'm not positive this would work, but I might start by creating a dictionary out of value_counts, and then use that as a mapping for the new column. Anyway, I hope you were able to figure out a solution!
hi good afternoon. how do i remove different letter from values for example i have got column which contain customer income like J:10,000, P:50,000 . i want to make it like 10000,50000
You can use string methods to strip the first two characters, and then the astype function to change the type from string to integer. These videos might be helpful to you: ua-cam.com/video/bofaC0IckHo/v-deo.html ua-cam.com/video/V0AWyzVMf54/v-deo.html Good luck!
Glad you liked the video! This video shows how to remove rows or columns: ua-cam.com/video/gnUKkS964WQ/v-deo.html Does that help to answer your question?
I spent hours trying to figure this stuff out through reading chapters and chapters in Python books. Then I come here, and everything I was trying to figure out was explained in 9 minutes. This was IMMENSELY helpful, thanks!
Awesome!! That's so great to hear!
I like your concise and precise videos. I really appreciate your efforts.
Thanks, I appreciate your comment!
Thanks so much for this! You helped me combine 629 files and remove 250k duplicate rows!
You're the man! *Subscribed*
Great to hear! 😄
lol, just when I felt you wouldn't handle the exact subject I was looking for: there came the bonus! Thanks!
wow! you are already teaching data science in 2014 when it is not even popular! Btw, your videos are really good, you speak slow and clear, easy to understand and for me to catch. Kudos to you!
Thanks very much for your kind words!
I have watched a lot of your videos; and I must say that the way, you explain is really good. Just to inform you that I am new to programming let alone Python.
I want to learn a new thing from you. Let me give you a brief. I am working on a dataset to predict App Rating from Google Play Store. There is an attribute by name "Rating" which has a lot of null values. I want to replace those null values using a median from another attribute by name "Reviews". But I want to categorize the attribute "Reviews" in multiple categories like:
1st category would be for the reviews less than 100,000,
2nd category would be for the reviews between 100,001 and 1,000,000,
3rd category would be for the reviews between 1,000,001 and 5,000,000 and
4th category would be for the reviews anything more than 5,000,000.
Although, I tried a lot, I failed to create multiple categories. I was able to create only 2 categories using the below command:
gps['Reviews Group'] = [1 if x
just find your channel , just watched this as my first watch for your videos , and pressed subscribe !!! , cause your explanation for the idea as whole is very remarkable 😃 thanks a lot .
Thank you!
You are the greatest teacher in the world
I always find what I need in your channel.. and more... Thank you
Great to hear!
I didn't find much in Duplicates. Thanks so much sir. I can't thank u enough.
You're welcome!
Exactly what I needed! Why not set up a Patreon so we can show some love?
Thanks for the suggestion! I am planning to set one up soon, and will let you know when it's live :)
I just launched my Patreon campaign! I'd love to have your support: www.patreon.com/dataschool/overview
Thank you! here is a way to extract the non-duplicate rows df=df.loc[~df.A.duplicated(keep='first')].reset_index(drop=True)
Thanks for sharing!
love u brother . u r changing so many lives, thanku ....the best teacher award goes to Data school.
Thanks very much for your kind words!
A very much appreciated efforts. Thanks a million for sharing with us your python knowledge. It has been a wonderful journey with your precise explanation. keep the hard work! Warm regards.
Thanks very much! 😄
Myself from Punjab .M studying at IIT even then i got satisfaction of pandas from ur videos only . Thanks
please give all u done in text format or like tutorial ,
Is this what you are looking for?
nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb
Really, your teaching method is very good, your videoes give more knowledge, Thanks Data School
You're very welcome!
THANK YOU for the keep tip, that's exactly what I was looking for!
Great to hear!
Kevin your videos are super helpful! thank you!!!
You're very welcome!
This is so helpful!
Pandas has the best duplicates handling. Better than spreadsheets and SQL.
Thanks!
Great video, Kevin! Super useful!
Thanks Jeff! :)
That's exactly what I was looking for, great explanation, thanks for sharing!
You're welcome!
I like it the way you explain things...it's very clearly and precisely. My problem is little more complex where I want to remove the entire row where it met the following conditions.
If any rows in Latitude column that has the same value as previous row (-1) AND the same row in the Longitude column that has the same values as previous row THEN remove the whole entire row that duplicated. Basically we have to compare two consecutive ROWS and COLUMNS and IF both conditions are met then remove the entire row. Let's say if there are 15 rows have the same values(i.e, If Lat[1,1] == Lat[0,1] & Lon[1,2] ==Lon [0,2] then remove, else skip, # Lat = Col1, Long = Col2) in both Latitude and Longitude columns then remove them all except keep one.
Hope you got my points... :-). Looking forward to see your code.
Glad you like the videos! It's not immediately obvious to me how I would approach this problem, but I think that the 'shift' function from pandas might be useful. Good luck! Sorry that I can't provide any code.
Hi, In the above video, at 1:12 minutes - the pandas DataFrame is displayed in Tabular form, with all the variables separated by vertical line. But in latest jupyter notebook, we get a single line below variable name. Can we get the same display as earlier, with new Jupyter version ?
There's probably a way, but it's probably not easy. I'm sorry!
I can solve the duplicate data from my CSV file~~~ Thank you.
However, I suggest you can do more in this video. I think you can show after the delete result list. Such as:
>> new_data=df.drop_duplicates(keep='first')
>> new_data.head(24898)
If you have to add it, I think this video will be more perfect~~~
How can I remove duplicate rows based on 2 column values?
I want to drop a row if two column values are the same. E.g. I have one column with Country = [USA, USA, Canada, USA] and an income column with values = [1000, 900, 900, 900]. I only want to drop the duplicate where both the country AND the income is 900. While if one row has country = Canada and income = 900 and second row has USA with income 900 I want to keep them both. Answers appreciated!
Your videos are really helpful for learning pandas. Keep up the good work!
Sorry, I'm not quite clear on what the rules are for when a row should be kept and when it should be dropped.
Perhaps you could think of this task in terms of filtering the DataFrame, rather than using the drop duplicates functionality?
Thanks for the reply! I managed to improve my code to avoid the duplicates in the first place. Keep up your great work with the videos, really helpful for improving my skills!
Great to hear! :)
You have done very Good jobs about under standing of DataFrame and make very easy to understanding DataFrame it so easy with the people which are working in excel
Best wishes from me
Thanks!
OMG I WANT TO THAT YOU SOOOO MUCH 😊I been on the problem for days and the way you explain it make so easy then how I learned in class. I was so happy not to see that error message 😂 Thank you
You're so very welcome! Glad I could help!
HOW DO YOU KNOW WHAT I NEED? YOU ARE MY FAV TEACHER FROM NOW
Ha! Thank you! 😊
Thanks a lot. It was a great help. Much appreciated!
You're welcome!
you're amazing we need more videos in your channel
I do my best! I've got 20+ hours of additional videos available to Data School Insiders at various levels: www.patreon.com/dataschool
At the end are you saying that "age" + "zip code" must TOGETHER be duplicates? Or are you saying "age" duplicates and "zip code" duplicates must remove their individual duplicates from their respective columns? Thanks
Clean and informative !
Thanks!
Thank you so much💕 your videos are really amazing...can you tell how to read any csv(without header on first line) and set first row with non null values as header...
Amazing and thanks bro , the right place for data queries
Happy to help
It helps me a lot. Can you explain how do we get the count of each duplicated value.
simple and useful. thanks Kevin.
You're welcome!
Yo! You are a superb teacher!
Thank you!
Great work man!
Thanks!
Very methodical explanation
Thanks!
full of useful info. Thanx man
You're very welcome! :)
Great video! Btw, how do you know all these stuff? Do you take classes or read books?
Work experience, reading documentation, trying things out, teaching, reading tutorials, etc.
Great video. This helped me tremendously.
How would you go about finding duplicates "case insensitive" with a certain field?
Great! Very well explained.
Thanks!
Awesome videos Kevin. Thanks a to for the knowledge share.
Thanks Prakash!
Thank you so much, you made my day. Finally i found the row of code, that i really needed to finish my task:)(Code Line 17)
Glad I could help!
hello, thank you for the video, I'm wondering if you can make some tutorials about the API requests
Thanks for your suggestion!
If I have a datataframe with a million rows and 15 columns, how do I figure out if any columns in my dataframe has mixed data type?
wait Kevin, keep=first means what is duplicated are the rows towards the bottom, meaning they have a much higher index. Keep= last means ?? Oh men am getting mixed up. Could someone please explain to me. Kevin,Please?
Hi, I am wondering whether you could identify an issue that I am having whilst cleaning a dataset with the help of your tutorials. I will post the commands that I have used below:
df["is_duplicate"]= df.duplicated() # make a new column with a mark of if row is a duplicate or not
df.is_duplicate.value_counts()
-> False 25804
True 1591
df.drop_duplicates(keep='first', inplace=True) #attempt to drop all duplicates, other than the first instance
df.is_duplicate.value_counts() #
-> False 25804
True 728
I am struggling to identify why there are still some duplicates that are marked 'True'?
Kind regards,
That's an excellent question! The problem is that by adding a new column called "is_duplicate", you actually reduce the number of rows which are duplicates of one another! Instead of adding that column, you should first check the number of duplicates with df.duplicated().sum(), then drop the duplicates, then check the number of duplicates again. Hope that helps!
Hey Buddy, You are amazing and you remind me of Sheldon Cooper (BBT) because of the way you talk and also both of you are super smart. :-)
One request- Please cover outliers sometime. Thanks.
Ha! Many people have commented something similar :) And, thanks for your topic suggestion!
Trying to figure out how to replace values above/below a threshold with the mean or median. If I find values that are skewing the data from a column, but don't want to exclude the whole row and drop the row, I just want to replace the value in one of the columns with a mean/median value. Can't figure out how to do this! IE: I want to replace all values in column 'age' that are above 130 (erroneous data), with the mean age of all the other values in 'age' column.
I'm sorry, I don't know the code for this off-hand. However, this would be a great question to ask during one of my monthly live webcasts with Data School Insiders: www.patreon.com/dataschool (join at the "Classroom Crew" level to participate)
beneficial videos. ❤
Thanks!
Really great gob. Thank you very much!!
Thank you for this useful tutorial. Quick question, how do you check whether a value in column A is present in column B or not; not necessarily on the same row. It is like the samething that VLOOKUP function looks for in Excel. Many thanks for your feed-back!
I'm not sure I understand your question, I'm sorry!
Bonus Question 7:55
cheers for this :) will definitely consider purchasing the package
You're very welcome! The pandas library is open source, so it's free!
sorry i meant on your website, the course ;)
Awesome! Let me know if you have any questions about the course. More information is here: www.dataschool.io/learn/
Brilliant video .
Thanks!
love to have more videos like this
Thanks for your support!
I have some missing dates in my dataset and want to add the missing dates to the dataset. I used isnull() to track these dates but I don't know how to add those dates into my dataset..Can you please help.Thanks
You might be able to use fillna and specify a method: pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
When I use the parameter keep=False I get a number of rows less than the first and last combined what is the reason of that??
very well explained ty !
You're very welcome!
great video!!
Thanks!
Because of your quality panda series I started following you. @duplicate - in my use case instead of drop duplicate I would like to keep 1st instance and just remove other duplicate values from specific column, so shape will remain same after removing duplicate values from column. Really appreciate if you got some time to answer this, thanks.
Glad you like the series! I'm not sure I understand your question - perhaps the documentation for drop_duplicates will help? pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
How do I access iPython Jupyter Notebook link? it is not available in the github repository.
Is this what you were looking for? nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb
HI, when you mention the "inplace" in the video, I am happy that PD have this parameter for experiment, but a problem comes, should I rember all the method that have the inplace parameter ;and rember the method that affect the origial dataframe in case that I use the DF already change when doing the calculation.
That is a hugh job to remove all the method that have 'inplace' parameter or doesnot have ,isn't it..... TOT
That is a huge
The 'inplace' parameter is just for convenience. I do recommend trying to memorize when that parameter is available. But if you forget, that's fine, because you can always write code like this:
ufo = ufo.drop('Colors Reported', axis=1)
...instead of this:
ufo.drop('Colors Reported', axis=1, inplace=True)
Is all inplace argument in method way default by "False"?
My problem is that: I worry that somethimes the method change original dataframe by method that have "inplace parameter"; somethimes the method does not change original dataframe.
so i confuse when it affect the original DataFrame , since the wrong judgemet might be lead to bad conclusion.
I think that 'inplace' is always False (by default) for all pandas functions.
This is what I want, thanks for sharing :)
Great!
Thank you for this content! I have a question : how can we handle quasi redundant values in different columns ? (Imagine two different columns each containing similar values at 80%). Thanks a lot
When you say "handle", what is your goal? If you want to identify close matches, you can do what is called "fuzzy matching". Here's an example: pbpython.com/record-linking.html Hope that helps!
@@dataschool Merci beaucoup for the reply. Let me explain my question : I have two variables/features named categories (milk, snack,pasta,oil,etc) and categories_en(en:milk , en:snack, en: pasta). My goal is to keep only one feature since both features share the same information. It was suggested that running a chi square test would help me decide which feature to keep but it seems silly to me :( ( I have almost 2millions records)
It probably doesn't matter which feature you keep, if they contain roughly the same information.
is that any simple regular expression on python tutorial available?
For learning regular expressions, I like these two resources:
developers.google.com/edu/python/regular-expressions
www.pythonlearn.com/html-270/book012.html
Hi, I have a doubt how do i remove duplicates from rows which are text or sentences like in RCV1 data set.
The same process showed in the video will work for text data, as long as the duplicates are exact matches. Does that answer your question?
Hey Kevin, I am confused for the drop duplicates here: the number of duplicated age and zipcode is 14; but after your drop the duplicates, the shape is 927. The total shape is 943, so the correct shape should be 943 - 14 = 929? Thanks a lot for your help!!!
I disagree with your statement "the number of duplicated age and zipcode is 14"... could you explain how you came to that conclusion? Thanks!
you're doing god's work son!
Thanks!
Hi, need help. Suppose if we have table such as transaction contains atleast 1 common item in the item column. How to code which are the transactions having coffee atleast?
Transaction Item
1 Tea
2 Cookies
2 Coffee
3 cookies
4 Bread
4 Cookies
4 Coffee
I'm not sure off-hand, good luck!
This is case of complete duplicates. So what should we do when we have to deal with incomplete duplicates..Ex age,gender and occupation same but zip is different..
could you also make a video on that please..
i get a error when i run users.drop_duplicates(subset=['age','zip_code']).shape . error "'bool' object is not callable" even i get the same error if i run users.duplicated().sum()
Remove the .shape, and see what the results look like. Also, compare your code against mine in this notebook: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb
That was so accurate, thanks a lot genius!
You're very welcome!
How can we efficiently find near duplicates from a dataset?
Great video. But I'd like just to find a duplicate column and then go to another column and find the duplicate and go to another column and find the duplicate and remain only one row with certain information.
how is the best way to compare data from tow file (in the same schema)
I don't know if there's one right way to do this... it depends on the details. Sorry I can't give you a better answer!
I had weird error with this one. Setting index col with index_col='user_id' does not work for me it raises KeyError: 'user_id' error. Instead I had to run users = pd.read_table('bit.ly/movieusers', sep='|', header=None, names=user_cols) first and then users.set_index('user_id') for this tutorial to work
Interesting! I'm not sure why that would be. But thanks for mentioning the workaround!
thanks for tips and bonus ideas
You're welcome!
Hi, I have question here. I want to mark the continue duplicate value like this [1,1,1,0,2,3,2,4,2], my expected result is [True,True, True,False,False,False,False,...].
But the pandas.duplicated(keep=False) returns
[True,True,True,False,True,False,True,False,True], The function treat the '2' in 2,x,2,y,2,z,2 sequence as duplicated. but it is not I want. How to remove it? I just want to mark the 1,1,1 as true. thanks.
How about just using code like this:
df.columnname == 1
Does that help?
very useful videos.. can you please tell me how to find duplicate of just one specific row?
Sorry, I don't fully understand. Good luck!
Thanks for good channel. I like it very much.
I have a query.
I am working on tweets, I have to remove duplicate tweets as well as tweets which are different in at most one word.
I can do first part, Will you please guide me how can I do the second part?? Thanks
That's probably beyond the scope of what you can do with pandas. Perhaps you can take advantage of a fuzzy string matching library.
Thanks...I will look into it.
You should have used sort_values option with users.loc[users.duplicated(keep=False)].sort_values(by='age')
Thanks for your suggestion!
you are amazing.
thank you ever much
You're very welcome!
I really need help guys.
I have a table that has a column : Column name - " Neighbourhood"
This Column has A LOT of names repeated MANY times.
To be specific, the column "Neighbourhood" has 10 Names that are repeated ALOT of times.
My question is :
I NEED HELP IN CREATING A SEPARATE COLUMN SPECIFYING HOW MANY TIMES EACH ELEMENT IN "NEIGHBORHOOD" HAS BEEN COUNTED.
If anyone help me please.
I'm not positive this would work, but I might start by creating a dictionary out of value_counts, and then use that as a mapping for the new column. Anyway, I hope you were able to figure out a solution!
Hello I want to know the concept of ReSampling please help
I'm sorry, I don't have any resources to offer you. Good luck!
you are a hero...
That's very kind of you! :)
hi good afternoon. how do i remove different letter from values for example i have got column which contain customer income like J:10,000, P:50,000 . i want to make it like 10000,50000
You can use string methods to strip the first two characters, and then the astype function to change the type from string to integer. These videos might be helpful to you:
ua-cam.com/video/bofaC0IckHo/v-deo.html
ua-cam.com/video/V0AWyzVMf54/v-deo.html
Good luck!
How to keep rows that contains null values in any column and remove completed rows?
Does this help? ua-cam.com/video/fCMrO_VzeL8/v-deo.html
replace similar duplicate values with one of the values how to solve it??
I think the process would depend a lot on the particular details of the problem you are trying to solve.
user id are not same then how it can be duplicated?
Thanks for this video :)
How can we remove
duplicates,delete columns,delete rows and insert new columns using python script ?
Glad you liked the video! This video shows how to remove rows or columns: ua-cam.com/video/gnUKkS964WQ/v-deo.html
Does that help to answer your question?
Thanks my question how we can sort months name
This video might be helpful to you: ua-cam.com/video/yCgJGsg0Xa4/v-deo.html
Hi I am big fan of you work, and I have learned a lot from the videos, can you please help me on how can I use
v-lookups of excel in pands
This might help: medium.com/importexcel/common-excel-task-in-python-vlookup-with-pandas-merge-c99d4e108988
Good luck!
You are amazing!
Thank you!
How to Remove Leading and Trailing space in data frame
Thanks for the video
You're welcome!
Good lesson, but the datatype has to match. I found I had to process my pandas tables with .astype(str) before this worked.
💯+ like. Thank you very much sir.
Thank you!