I've honestly never had enough data to get an memory error for pandas. I really like this vid tho bc I do use pandas a bit and knowing this will help me if I ever work with huge datasets.
I know it's an old vid, but how can I limit the string? Say I know the max len, how to avoid over allocation? More importantly using a "non-growing" str, (dynamic allocation is performance hell), how to specify it anyone knows?
I have a problem with my data. I have losts of dataframes (excel files) each from a dirent vendor. All with product descriptions (code, name, size, color, price etc) Problem is. it is not a fixed pattern. All vendors give me (daily) their own excel files. But they do not have all parammeters alike. For example some have color column others dont. For context. im using django. My goal is having a Product model with all attributes but only create or update those informations given by the vendor. First time. While creating (bulk create) i add all fields. and set a default for those missing. But when updating. I should be able to update only the fields with new diferent values, like price. Since descritions should be never changing, other wise it would be a new Product. I started with a simple code. Looping. and for a 2.000row excel file. takes 15min to check all info and handle each field based on a preset conditions
I agree for most of your talk but the choice for int16 seems a little risky. With a maximum positive value of 32767, this is less than a factor ten away from the maximum of the sample presented (4611). I would not feel safe when the maximum of the current data and the maximum of the type it is represented by is in the same order of magnitude, certainly when also running models on future data, which may be quite different from the current dataset. Therefore a int32 type seems better, although the uint variant is also applicable in this case and roughly doubles the usable data range since "units sold" should not be negative I presume. Kind regards
This is very interesting, though if im being honest, everytime I actually ran into MemoryError with pandas, it was because I had made a stupid mistake and these tips wouldn't have helped much. Still, thanks for the tips.
Great advice, given the fact that you can read a CSV file at once, because the file I need to read is so big that I can't even read it with Pandas directly.
@@babsNumber2 I went low level: I used the io library to read row by row and created a limited size dataframe, therefore I can't read all of the rows, or my program will crash. I think the dask library might help you, as far as I know about it.
The DataType! I had no idea this was even possible.
So many great tricks once you read the docs
Awesome tips, data type trick is bonkers ,😀
I've honestly never had enough data to get an memory error for pandas. I really like this vid tho bc I do use pandas a bit and knowing this will help me if I ever work with huge datasets.
start working with DNA
Useful buta.....
If i were really just wanna make a quick view on an dataset and i dont know what strcuture in this.
How should i do?
I know it's an old vid, but how can I limit the string? Say I know the max len, how to avoid over allocation? More importantly using a "non-growing" str, (dynamic allocation is performance hell), how to specify it anyone knows?
How to reduce the font size of the cells output( only)
Just by adding the correct type of each column I could drop the memory usage by almost 50%. Thanks!!!
I have a problem with my data.
I have losts of dataframes (excel files)
each from a dirent vendor.
All with product descriptions (code, name, size, color, price etc)
Problem is.
it is not a fixed pattern.
All vendors give me (daily) their own excel files.
But they do not have all parammeters alike.
For example some have color column others dont.
For context.
im using django.
My goal is having a Product model with all attributes but only create or update those informations given by the vendor.
First time. While creating (bulk create)
i add all fields. and set a default for those missing.
But when updating. I should be able to update only the fields with new diferent values, like price. Since descritions should be never changing, other wise it would be a new Product.
I started with a simple code. Looping.
and for a 2.000row excel file.
takes 15min to check all info and handle each field based on a preset conditions
Pretty awesome, I really liked specifying the data types. Reducing by an order of magnitude is fantastic
I agree for most of your talk but the choice for int16 seems a little risky. With a maximum positive value of 32767, this is less than a factor ten away from the maximum of the sample presented (4611). I would not feel safe when the maximum of the current data and the maximum of the type it is represented by is in the same order of magnitude, certainly when also running models on future data, which may be quite different from the current dataset. Therefore a int32 type seems better, although the uint variant is also applicable in this case and roughly doubles the usable data range since "units sold" should not be negative I presume.
Kind regards
I don't see it as a risk, the best way to figure out the size is getting the column max value based on that we can decide which size to use.
So you are telling me. I haven’t used Pandas to it’s full potential yet?
This is very interesting, though if im being honest, everytime I actually ran into MemoryError with pandas, it was because I had made a stupid mistake and these tips wouldn't have helped much. Still, thanks for the tips.
Great advice, given the fact that you can read a CSV file at once, because the file I need to read is so big that I can't even read it with Pandas directly.
How do you read it then. I having a similar issue in some sales data I'm using.
@@babsNumber2 I went low level: I used the io library to read row by row and created a limited size dataframe, therefore I can't read all of the rows, or my program will crash. I think the dask library might help you, as far as I know about it.
@@manoeljose3321 Thank you very much. I'll check that out.
i have the same record player 💙
awesome, thanks a lot !
Great tips