Polars: The Next Big Python Data Science Library... written in RUST?

Rob Mulla

4 700

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 9 лют 2025

КОМЕНТАРІ • 248

@rahuldev2380 2 роки тому ⁺³⁴²
Polars is built on top of Apache Arrow which pandas supports. So you can easily convert your polars dataframe to pandas with almost zero overhead. I use polars to do the hard part and jump back to pandas for the visualization stuff
@cryptoworkdonkey 2 роки тому ⁺¹¹
If you use pyarrow firstly. Pandas convert arrow in his inner representation (numpy arrays managed by BlockManager) and reverse. It not zero cost.
@rahuldev2380 2 роки тому ⁺²
@@cryptoworkdonkey Ah my bad. I thought they had updated their internals from numpy
@jakobullmann7586 2 роки тому ⁺⁴
Same here. There are some things where Pandas is more convenient, but for most stuff I strongly prefer Polars. It’s not just execution performance, but also the speed of writing the code.
@adrianjdelgado Рік тому ⁺¹¹
@@cryptoworkdonkey good news, Pandas 2.0 release candidate now uses pyarrow as the backend. Polars Pandas conversions will be zero cost.
@HermanWillems 8 днів тому
@@adrianjdelgado Pandas 3.0
@bigphab7205 2 роки тому ⁺³⁷
10000 points for printing the version. Every tutorial video should do that.
@robmulla 2 роки тому ⁺⁷
Thanks! I forget to do it on all of my videos but your comment is going to remind me to do it in the future.
@jakobullmann7586 2 роки тому ⁺²⁶
13:20 Regarding learning the syntax… It’s worth mentioning that Polars syntax is very similar to PySpark, so it’s really two birds with one stone.
@robmulla 2 роки тому ⁺⁸
That’s a good point. Thanks for pointing it out. I really need to do a spark vs polars comparison video.
@Joselias156216 2 роки тому ⁺¹⁶
Nice video. Very interesting to see how polar works, hope to see it more frequent in your future streams to learn more about the practical use.
@robmulla 2 роки тому ⁺³
Thanks Jose! I apprecaite the feedback. I'm going to definately give it a try in a future stream. I just need to find a good dataset for it.
@brd5548 2 роки тому ⁺¹⁷⁵
Our team tried to integrate polars into our analytics pipeline last year, and the result was kinda on and off. To be honest, the performance of pandas is not that bad, we spent some time on doing several fine tunings, like rewriting key bottlenecks with our native modules or with these vectorized pandas methods, and the result turned out just ok. On the other hand, the integration work of polars did require some major revamping and refactoring, due to API gaps and implementation differences between the two. However, the performance gains didn't seem to justify the effort. What's worse, while pandas does come with pitfalls and caveats here and there, polars is a relatively young project and it comes with bugs on basic text manipulating operations.
But don't get me wrong, that was my experience last year. I do think polars has the potential. It has a much more robust and modern architecture than pandas in my opinion. Its API style is cleaner and more consistent. And it comes with a query optimization engine, which many users can appreciate if you are familiar with tools like apache spark or some databases. Given time, I think polars should become another powerful player in the future. So, definitely give it a try if you're building something new!
@robmulla 2 роки тому ⁺¹⁴
Thanks for sharing! I haven't used polars in production yet, so it's interesting to hear about your experience. I guess there are limitations I didn't consider in this video. I totally agree it's worth giving a try.
@BiologyIsHot 2 роки тому ⁺⁵
This is,the major bit.. Who is bottlenecked by Pandas? I think the bottlenecks happen with ML or other modeling libraries which are working with the data in the form of Numpy arrays.
@leventelajos5078 2 роки тому ⁺²
"Its API style is cleaner" Really? I think Pandas is much more pythonic.
@incremental_failure Рік тому
@@leventelajos5078 Agree. Column assignment in Pandas seems more pythonic.
@konstagold Рік тому ⁺⁵
@@BiologyIsHot When you're working with large data sizes, you will be bottlenecked by pandas in no time. Typically at that point, you switch to spark, which has its advantages, but also downsides. Polars looks to be a good middle fit between the two that dask was trying to achieve.
@tmb8807 Рік тому ⁺²
I'm blown away by how fast this is. Sure there are some things it can't do, but man, even for just reading large data sets it's absolutely blazing.
@calum.macleod 2 роки тому ⁺¹⁶
Thanks for a good explanation of how Polars could benefit people who use Pandas and need more speed. In my project we already have a heavy emphasis on multi processing and fast inter process communication, so I am especially interested to see a Pandas vs Polar single core performance comparison for group and join. I hope that someone does the comparison and posts it to UA-cam.
@robmulla 2 роки тому ⁺²
Glad it was helpful! If you look in the polars repo they have some queries that they benchmark. H2o also has a benchmark comparison of a few different libraries.
@calum.macleod 2 роки тому ⁺¹
@@robmulla Thanks for the reply. I will look into the benchmarks and h2o.
@santiagoperman3804 2 роки тому ⁺⁷
Great timing, I was looking to start playing with Polars since Mark Tenenholtz mentioned it some days ago. I went back to Pandas because couldn't find the assign() and astype() equivalents in Polars, I thought they were lacking, but they seem to be with_columns() and cast(). Now I will resume more persistently.
@robmulla 2 роки тому ⁺¹
Glad you found this video helpful. It does seem like polars may be worth the time investment now that it's becoming more established.
@gregharvey8574 2 роки тому ⁺³⁴
Thanks for brining this to my attention, I think I might include polars into some productionalization processes. For data exploration, typically I only use parts of dataframes for plotting or investigation. Given that you can convert a polars dataframe to pandas, it seems like a good approach would be to have the the full dataset in polars and then filter into a pandas dataframe and plot.
@robmulla 2 роки тому ⁺⁷
That's a good point about how you can convert the dataframe to pandas when you need to do exploration. I'll have to think about how to use this in my EDA pipelines.
@headbangingidiot 2 роки тому ⁺¹
@@robmulla you can pass polars columns into plotting libs like plotly
@BiologyIsHot 2 роки тому
The question though is do you save much time when doing this? Instantiation of Numpy arrays and Pandas dataframes themselves isn't the fastest. I guess if you have multiple "slow" actions to perform on the data you might have some benefits? Or if you really are working at such a massive scale with many many users that saving compute time is really valuable.
@scraps7624 2 роки тому ⁺²
I saw some tweets about Polars but seeing it in action is something else
Also, I can't believe it took me this long to find your channel, subbed!
@robmulla 2 роки тому
That’s awesome! Glad you found my channel. Feel free to share with others!
@rackstar2 Рік тому ⁺¹
I recently decided to fully transition over to using polars instead of pandas for a data pipeline project.
The primary reason im liking polars over pandas is not just the speed (the speed is nice dont get me wrong) but its the Space usage!
Allmost all of my operations entailed working with data larger than memory.
One of the operations i have to do is pivoting a dataframe. My end result has thousands of columns!
My kernel never seems to hold steady when doing this with pandas, but polars is really doing the trick for me.
One small problem did face tho is when it comes to exporting the results of the pipline.
I still have to resort to something like pyarrow and use its writer to do the export in chunks.
This might just be because of how low my system memory is. Regardless of this, polars seems to be an excellent option for data processing and manipulation, and if you do want to showcase your data, you can always convert back and forth with pandas !
@jcbritobr 2 роки тому ⁺³
Nice stuff. This Polars seems a killer tool. Thank you for share.
@robmulla 2 роки тому
Thanks for watching. It does seem promising.
@DiegoSilva-dv9uf Рік тому ⁺¹
Valeu!
@robmulla Рік тому
Thanks so much 🙌
@juan.o.p. 2 роки тому ⁺²
Thanks for the recommendation, I will definitely give it a try 😊
@robmulla 2 роки тому ⁺¹
Please do and let me know what you think. There might be negatives about it that I'm not aware of.
@nikjs 2 роки тому ⁺⁴
For the python library developers : Pls create a wrapper lib that does this job of converting regular pandas syntax into the wee-bit more complicated polars syntax. I can see that not all ops would be readily convertible, but there's definitely some low-hanging fruit here, which would cover a lot of simple use cases.
@robmulla 2 роки тому
That would be nice. But I also think it’s nice to have it different to make it clear it’s not the same.
@GiasoneP 2 роки тому ⁺³⁰
Like PySpark AND Pandas. Second half mirrors PySpark. Due to the speed, and out of the box parralelization, I wonder how it stacks up against Spark and how it’s functionality compares to a cluster of machines. Take AWS for example, can it be applied to an EMR cluster? As a side note, I’m super excited about Rust and it’s future in data.
@cryptoworkdonkey 2 роки тому ⁺⁶
There is some Apache Arrow based Spark competitors (too young) like Ballista (distributed Data Fusion, written in Rust).
We "buy" Spark for Resilent in RDD abbr. Polars can process 50gb on machine, Spark - 35gb because not so effective row-based abstraction from "distributed" trade-off, scala case classes memory blowing etc. vs skinny Rayon runtime in Polars.
Ray platform has same arrow format backend and more effective than Spark but can't streaming (yet).
In Polars repo polars-dask integration is empty.
@pabtorre 2 роки тому ⁺¹
Yeah the syntax is very similar to pyspark
Wonder how well it'll run on a spark cluster...
@robmulla 2 роки тому ⁺³
Good question. I don’t think polars is meant as a replacement for pyspark because from I can tell it doesn’t computation across nodes.
@AWest-ns3dl 2 роки тому ⁺⁵
I can confirm, Polars does not use nodes.
@RyanApplegatePhD 2 роки тому ⁺²
@@robmulla With the ever improving compute, I think Polars could be in a sweet spot between Spark and Pandas. I know when I was parsing very raw large datasets in pandas I did feel sometimes constrained and moved to Spark, however; there is a lot of overhead for using Spark effectively and this might split the difference.
@curlyman_ 2 роки тому ⁺³
This is my little trick for hyper optimizing data processing haha. Pivots are insanely fast in polars
@robmulla 2 роки тому
Ohh. Never tried pivots in it.
@sonnix31 2 роки тому ⁺²
This is fantastic. Thank you
@robmulla 2 роки тому
You're very welcome!
@ChaiTimeDataScience 2 роки тому ⁺⁴
DataTable is also pretty legendary, you might also find it super awesome.
Thanks again for your amazing videos, I have watched and learned from every one of them. I hope I'll interview you about your 100k celebration sometime next year 🙏
@robmulla 2 роки тому ⁺¹
Thanks Sanyam! I need to check it out. Hopefully 100k will come next year, but maybe 2024! Talk soon.
@patrickonodje1428 2 роки тому ⁺²
I love your work. You should have a course on data science.. for folks like us just learning
@robmulla 2 роки тому ⁺³
Maybe one day! Thanks for watching Patrick!
@patrickonodje1428 2 роки тому ⁺¹
@@robmulla Looking forward
@tonik2558 2 роки тому ⁺³
The usage in Python seems to mirror a lot of the standard Rust iterator API. Looks like it would be even better if used directly in Rust. Thanks for making a video about this.
@brainsniffer 2 роки тому ⁺¹
I think that there is so much for data that is built in python that it’s easier to use an abstraction like this than to do things in rust, especially for interactions. It’s an interesting idea.
@robmulla 2 роки тому ⁺¹
I have learning RUST on my todo list. Will you teach me? 😝
@tonik2558 2 роки тому ⁺³
@@robmulla The Book is an amazing starting resource. It's how I learned Rust, and it's probably the fastest way to get started with the language
@shadowangel8005 2 роки тому
@@robmulla google just posted a small course a week or so back
@Mari_Selalu_Berbuat_Kebaikan 2 роки тому ⁺²
Let's always do good and encourage more people to do the same 🙏
@robmulla 2 роки тому
ok!
@gabrielperfumo1122 2 роки тому ⁺¹
Great channel!! Thanks for sharing. I'll check it out for sure!
@robmulla 2 роки тому
Thanks Gabriel!
@bubbathemaster 2 роки тому ⁺⁷
Extremely interesting. It’ll be hard to dethrone pandas due to the huge community support but I really like the lib.
@robmulla 2 роки тому ⁺²
I agree pandas is too entrenched at this point to be easily dethroned.
@bryanwilly4086 Рік тому
Perfect, thank you!
@samstanton-cook1419 2 роки тому ⁺⁶
Great video thanks Rob! Our data science teams use polars alot. For long timeseries aggregation queries (100M+) rows we use the pykx python package to access q kdb+ language for higher performance still over pandas and polars. Have you seen it?
kx.q.qsql.select(qtab, columns={'minCol2': 'min col2', 'medCol3': 'med col3'},
by={'groupCol1': 'col1'},
where=['col30.7']
)
@robmulla 2 роки тому
I need to check that out. Pykx… first time hearing of it. Sounds cool though. Thanks for watching.
@BiologyIsHot 2 роки тому ⁺¹⁶
I think the big problem is that it isn't inter-operable with Numpy-based libraries. I'm honestly struggling to think of many cases where Pandas is too slow. Some of thd features like a lazy/eager API could be nice, but I think most of the slow computations people are doing is within libraries that are going to require conversion to Numpy arrays already.
@robmulla 2 роки тому ⁺³
Yea, I guess it really depends on your use case. I've run across a few recently where polars was helpful.
@adrianjdelgado Рік тому ⁺²
You can convert to and from Pandas very easily. Now that Pandas 2.0 will use pyarrow as the backend, that conversion will be truly zero cost.
@rohitnair4268 2 роки тому
as usual rob nice video i have learned a lot from you
@mutley11 Рік тому ⁺²
Very compelling presentation; many thanks. I would have liked to see an example of how user-friendly the error messages are. Rust error messages are surprisingly good in general and I was wondering if that is true of polars. You missed at least one opportunity to illustrate a typo. 😊
@robmulla Рік тому ⁺¹
Glad it was helpful! Next time I'll try to throw more errors :D
@두두-b2d 2 роки тому ⁺¹
OMG.. thank you!!
@ApeWithPants 2 роки тому ⁺⁵
Pandas has some strange quirks that always bothered me. Strange syntax or unintuitive copy/not copy behavior. Glad to see more competitors
@robmulla 2 роки тому ⁺¹
I’m a big fan. But also think polars and others like it have good potential. Thanks for watching! Are you a kraken fan? Go Caps!
@pimziengs2900 2 роки тому ⁺¹
Thanks for this video! I am a data scientist always looking for some new techniques xD.
Cheers from the Netherlands!
PS: There is some background noise in your video around 3:30.
@robmulla 2 роки тому
Welcome! Glad to have a viewer from the Netherlands. Sorry about the noise at 3:30 - I didn't notice it until after I was done editing and then it was too late.
@aminehadjmeliani72 2 роки тому ⁺¹
Hi @rub, I think it's a good approach to diversity our tools this days, especially when it comes to deal with memory (sometimes I find myself running out of time with pandas)
@robmulla 2 роки тому ⁺²
Absolutely! Well said.
@hensonjhensonjesse 2 роки тому ⁺²
It looks surprisingly similar to pyspark. Especially the lazy implementation. Pretty cool stuff!
@robmulla 2 роки тому ⁺¹
Yea, a lot of similarities to pyspark!
@MaavBR 2 роки тому ⁺¹
7:10 Quick correction, SAN is San Diego, not San Francisco
San Francisco airport's code is SFO
@robmulla 2 роки тому
Doh! Good catch.
@jackychan4640 2 роки тому ⁺¹
Happy New Year 2023
@robmulla 2 роки тому
Same to you Jacky! 🎆
@AlexanderHyll 2 роки тому ⁺⁴
As a btw. If you want to plot smth quick, converting to a pandas is super fast (if ofc a bit mem inefficient). Can also just pass columns to plt. Just my 2 cents.
@robmulla 2 роки тому
Good point, I do use df.plot() a lot though so it would take some getting used to.
@adrianjdelgado Рік тому
Now that Pandas 2.0 uses pyarrow as backend, conversions will be truly zero cost.
@chris_kouts 2 роки тому ⁺¹
You should do a benchmarking video i was waiting for you to tell me if i should start using it
@robmulla 2 роки тому
I made a video about it just yesterday! Check it out on my channel.
@CaribouDataScience 2 роки тому
Good stuff!!
@robmulla 2 роки тому
Glad you enjoyed it
@bazoo513 2 роки тому ⁺¹
"Split, apply, combine" approach sounds like it could employ massively parallel processing of graphics cards. Is there a CUDA implementation?
@robmulla 2 роки тому ⁺¹
Yes! It’s called rapids. I need to make a video about it.
@bazoo513 2 роки тому
@@robmulla Thanks!
@PlatinumDragonProductions999 2 роки тому ⁺⁴
I love Pandas, but I prefer Spark. This looks very Spark-like to me; I'm eager to make it my goto dataframe processor. :-)
@robmulla 2 роки тому
If you prefer spark I’m guessing this will be a great package for you.
@HyperFocusMarshmallow 2 роки тому ⁺⁶
The rust community really produce brilliant stuff. Very impressive!
Did you find any areas where polars is lacking vs pandas?
Btw, have you checked out nu-shell? It’s essentially a new shell language designed to do the Unix-philosophy but with data frames for data flow. At least as far as I understand it. Written in rust of course.
It’s in pretty early development but it feel pretty great to play around with and can probably produce some nice workflows.
@robmulla 2 роки тому ⁺¹
Never heard of my-shell but I’ll check it out. I am not too familiar with the RUST community but this package is pretty solid. As people have mentioned the syntax is much more verbose and it lacks some of the built in pandas features.
@akhil-menon Рік тому
Hi Rob, thank you for this super informative video! In one of your takeaways, you mentioned that Polars is a good fit if we have some really heavy data processing work. Would you be able to share some insight on how Polars would stack up against Pandas when having to perform heavy NumPy specific computations?(Think linear and vector algebra, trigonometry, matrix operations)
I read on SO that it is imperative to not kill the parallelization that Polars provides by using Python specific code, so it is my intuition that applying NumPy operations on Polars columns could result in a loss of parallelization. It would be great if you could share your thoughts on this. Thank you again for the amazing content you produce!
@The-KP 2 роки тому ⁺²
@Rob Mulla Nice that Polars can perform rdbms-like ops, but what about the computation libs bind to Pandas dataframes, like numpy, scipy, scikit-learn? If it can be used with those, or somehow replaces them, I'm in! Hopefully Polars is not an island.
@robmulla 2 роки тому
I know you can easily convert from polars back to a pandas dataframe and they use similar Apache Arrow.
@AaronWoodrow1 2 роки тому ⁺²
I don't fully get why it's geared more toward data pipelining rather than data exploration (as mentioned @ 13:33) if the data needs to be contained to a single host. Even with parallelization across multiple CPUs, there's still a data size cap limited by available memory. A tool such as PySpark (or Dask) seems better suited for pipelining, which ultimately consumes larger amounts of data.
@robmulla 2 роки тому ⁺¹
Yea. I see your point. Sometimes you have data in between or just want a faster pipeline for a small job you run on a regular basis. Either way, if it was identical to python and faster then people would use it for sure!
@AaronWoodrow1 2 роки тому ⁺¹
@@robmulla True, just a minor nit. Great video btw!
@neronjp9909 Рік тому
how come everytime when u click the column name, the column name then copied into yr tpying code.. is there a hot key for that? my company raw data column name is so long and with _ / space / dot...i always get slow down when typing code across the column name, may i know how u do that 8:07..thx
@cryptoworkdonkey 2 роки тому ⁺⁴
I think Polars must be replace Pandas in ETL tasks. But it have some struggles for comfortable Exprs construction.
And in Arrow universe there is Data Fusion project as alternative.
@robmulla 2 роки тому
I agree. I haven't fully tested out the expressions to notice what I use in pandas that polars is missing. What is the Data Fusion project, I'm not familiar with that?
@cryptoworkdonkey 2 роки тому
@@robmulla , DataFusion is more "arrow-society" convented project (part of Apache Arrow project) as Spark/Hive/MR challenger. This is designed more modularity with SQL and DataFrame APIs. This project can be used as library (it positioned self as query engine for arrow) for more high level projects.
Polars positioning self as classical DataFrames libraries challenger. But with both you can use as SQL CLI. Both has plan optimizers, Rayon parallelism, simd optimizations etc.
Both are cool. I don't know about larger-than-memory capabilities of DataFusion. DataFusion is fundament of Blaze/Ballista distributed computing engines. Polars Dask integration repo currently not active.
@nikjs 2 роки тому ⁺¹
3:35 - some audio interference starts from around this point, pls check the video
@robmulla 2 роки тому
Thanks for the heads up. I noticed that when editing. Sorry about it.
@adityasrivastav7159 11 днів тому
Polars is not working in my Jupyter Notebook, whenever I am importing it its showing kernel died.
@simplemanideas4719 2 роки тому ⁺¹
Speed is always priority, because it is equal to resource optimization. However, this leads to question how effizient are both libs in per core efficiency?
@robmulla Рік тому
Good question. I'd guess polars is faster on all fronts but it would depend on a lot of things.
@JustinGarza 2 роки тому ⁺¹
i like this, but i wish i covered graphs? does this use matplotlib or something use to make graphs and charts ?
@robmulla 2 роки тому
It doesn’t. But you can always convert it back to a pandas data frame to plot.
@JustinGarza 2 роки тому
@@robmulla umm maybe I’ll wait til it gets more graphic/chart support or until pandas gets updated
@Myektaie Рік тому ⁺¹
Hi, thanks for this great video! It looks like polar is very similar to spark, do you know how they compare?
@robmulla Рік тому
Thanks for the comment. They are very similar. Check out my most recent video where I compare the two.
@JordiRosell 2 роки тому ⁺²
For ploting polars, I think plotnine is a good option.
@robmulla 2 роки тому ⁺¹
I have a video all about my favorite plotting libraries (including plotnine): ua-cam.com/video/4O_o53ag3ag/v-deo.html&feature=shares
@chintansawla 2 роки тому ⁺³
The library feels like it's based off the syntax/methods of pyspark. A lot of the methods used are similar to how RDDs are converted to DataFrames in pyspark
@robmulla 2 роки тому ⁺²
Yes, definitely a lot of similarities between pyspark and polars. Pyspark has always been much slower for me when running on a single node.
@chintansawla 2 роки тому
@@robmulla that's a bit shocking! Both seem to be performing in a similar fashion theoretically (lazy evaluation, parallel computing). Going to try and compare polars soon. Thanks
@jordanfox470 2 роки тому ⁺¹
@@robmulla have you tried pandas on spark? Databricks has that running.
@robmulla 2 роки тому
@@jordanfox470 no. Have you? How does it compare?
@Pedro_Israel 2 роки тому ⁺¹
Hey Rob can you do a video about automatic EDA librearies? I used them and they blew my mind. I am amazed I didn´t know them earlier.
@robmulla 2 роки тому
That's a good suggestion. What libraries have you used that you like? The main one I've seen is pandas profiling.
@donnillorussia 2 роки тому ⁺¹
Isn't this "split-apply-combine" approach similar to map-reduce? Just curious 😉
@robmulla 2 роки тому ⁺¹
Yes! Exactly. Map reduce (like in spark) is very similar. Polars only runs single node, and map reduce I believe can be done across nodes.
@user-fv1576 11 місяців тому
Looks a bit like SQL with the select. Newbie question, why not just use pandasql library?
@georgiyveter6391 2 роки тому ⁺¹
Use python 3.10.
Created dictionary:
d = {'a': [1,2,3], 'b': [4, -5, 6]}
Created dataframe:
df = pl.DataFrame(d)
print(type(df))
print(df)
It all works. But if I change in dictionary d any number to float, for example 6.8, then functions print type still shows it's a dataframe, but next print silently do nothing, like 'pass', and script ends. Why?
@robmulla 2 роки тому
That’s a great question. Is it only with 3.10?
@K-mk6pc Рік тому ⁺¹
I am working on large data in pandas.But its not a problem for me. Pandas is doing fine in few mins.
@bazoo513 2 роки тому ⁺¹
I wonder what authors of these tabular data manipulation libraries didn't adopt relational algebra terminology (or even SQL as a, if not the, manipulation language). For example, why is not choosing only some columns called "projection"?
Subtle syntax (and _especially_ semantics) differences between libraries designed to do essentially the same tasks make life of users unnecessarily more difficult.
@robmulla 2 роки тому ⁺¹
That’s a good point. Some libraries (like spark) do have the ability to write SQL directly on flat files like this.
@suvidani 2 роки тому ⁺¹
How does the performance compares to pyspark? The syntax very similar to pyspark.
@robmulla 2 роки тому
Good question. I might need to test it out. Haven’t used spark in years and had some bad experiences but it’s probably gotten better since then.
@grabani 2 роки тому ⁺¹
Interesting.
@robmulla 2 роки тому
Glad you think so!
@ankan650 2 роки тому ⁺¹
Wow. It looks like Apache Spark might be obsolete soon. Can you also compare Ray packages with Polar. I think Ray is not exactly for data processing instead for more compute intensive tasks. Thanks.
@robmulla 2 роки тому ⁺¹
I benchmark ray in a different video if you want to check it out.
@FabioRBelotto 7 місяців тому
You should have tested polars with the same test as you did with dask, modin and vaex
@Matias-eh2pn 2 роки тому ⁺¹
How did you configured that theme on jupyter?
@robmulla 2 роки тому
I have a whole video on my setup. Check it out here: ua-cam.com/video/TdbeymTcYYE/v-deo.html
@valuetraveler2026 2 роки тому ⁺¹
URLError:
@robmulla 2 роки тому
Strange. Did you get this error when trying to pip install? Otherwise polars shouldn't be using anything to connect to the internet.
@michaeldeleted 2 роки тому ⁺²
OMG I just completely replaced pandas with polars and all the regular pandas commands worked
@robmulla 2 роки тому
Wait, what? I think the syntax should be very different. Unless they released a new version that I don't know about. Can you show an example?
@michaeldeleted 2 роки тому ⁺¹
Oops, didn't change all my pd to pl. LOL was still using pandas
@robmulla 2 роки тому
@@michaeldeleted oh! That explains it.
@akshaydushyanth9720 2 роки тому ⁺¹
Is it similar to pyspark? Whats the difference between both?
@robmulla 2 роки тому ⁺¹
Only runs on a single node. Much faster than pyspark when working with data that can fit in memory.
@JonLikesStats 5 місяців тому
Why do we compare polars to pandas instead of polars to dask? I dabble in Rust myself, so im interested in polars. But the comparison most people make seems inherently unfair because of multithreading.
@yayasssamminna 11 місяців тому
Please make a tutorial on Dask!!!
@JohannPetrak 2 роки тому ⁺²
Your timeit presentation includes the time to read the data which might not be such a good idea.
@robmulla 2 роки тому
Nice catch, but I actually did that intentionally because data I/O is one area where polars can be much faster.
@JohannPetrak 2 роки тому ⁺¹
@@robmulla it is just very bad practice to do this and there other issues which may totally distort the measurements like the OS caching read data in buffers from a previous read.
@robmulla 2 роки тому
@@JohannPetrak that’s a good point. Any idea how I could properly compare the read time in a way that wouldn’t be messed up by the caching?
@JohannPetrak 2 роки тому
@@robmulla i think there is no way to avoid it, but it may be possible to reduce the effect by loading files that are much larger than what the OS might use for caching, and also load a sequence of many different files for a single benchmarking run, then repeat this several times and take the average (and stdev). Also maybe check how much the external storage is the bottleneck by also loading from SSDs or memcached files.
With HDDs this will be A LOT slower than the CPU based benchmarks, so I would argue to separate these benchmarks from each other.
But even with the CPU based ones, running on larger data structures (on a computer that has even larger RAM) may give better results as the impact of other OS, memory management, (JIT) interpreter etc optimizations gets reduced.
Sorry, I do not want to claim I know how to do proper benchmarks, but I do know (from experience) it is easy to not do it properly :)
@ArnabAnimeshDas 2 роки тому ⁺¹
I would import another plotting library which produces a better plot anyways.
@robmulla 2 роки тому
Yep, that's totally reasonible. Thanks for watching.
@ArnabAnimeshDas 2 роки тому ⁺¹
@@robmulla also you can convert polars dataframe to pandas if you want to
@张世濠-j8e 2 роки тому ⁺²
somehow it's very similar to Spark on AWS Glue ?
@robmulla 2 роки тому ⁺¹
Yes, very similar but I think polars is intended for a single machine vs. spark which can be distributed across nodes.
@EircWong 2 роки тому ⁺¹
Nosie at 3:29, about 10 seconds
@robmulla 2 роки тому ⁺¹
Yes! I noticed that. I forgot to put my phone further away from the mic. I tried to edit it out as much as possible. Hopefully it wasn't too distracting.
@praveenmogilipuri4524 2 роки тому ⁺¹
Hi, anyone can help me how to connect polars with snowflake. Through pandas i can but i don't want to use pandas.
@robmulla 2 роки тому
I’ve never done anything like that before but maybe others will know how.
@fredgavin 2 роки тому ⁺²
Tried Polars multiple times, and felt that it was too verbose. Just cannot give up R's data.table, which is the best data manipulation package in the data science world, no competitor at all.
@robmulla 2 роки тому
Yea. Definitely more verbose than pandas. I haven’t used R in years but don’t remember it ever being the fastest.
@JayRodge 2 роки тому ⁺¹
Have you tried RAPIDS cuDF?
@robmulla 2 роки тому
A little bit. It can be really fast but requires that your data is small enough to fit into your GPU memory.
@AyahuascaDataScientist Рік тому
Polars doesn’t have a .info() method? I can’t use it…
@XavierSoriaPoma 2 роки тому ⁺¹
So why should we use polars instead of pandas?
@robmulla 2 роки тому ⁺¹
Did you watch the video? 😂 speed is the main reason.
@XavierSoriaPoma 2 роки тому ⁺¹
@@robmulla yeah but still I'm not convinced, it's like tensorflow or pytorch they are not as fast as Flux, but we still use them in python
@Capsaicinophile 2 роки тому ⁺¹
Unless you need to run your scripts over and over, I believe Polars cannot replace Pandas, as it takes more effort to write a simple aggregation. 2 seconds of faster execution is not worth 20 seconds of writing a line for every aggregation column and giving it an alias.
@robmulla 2 роки тому ⁺¹
Yea. For quick scripts on small data and EDA, I’m sticking with pandas.
@rhard007 2 роки тому
Is it not possible to use Matplotlib or Seaborn with Polars?
@robmulla 2 роки тому
It probably is possible. It's just not built into the dataframe as methods like it is in pandas. Just one additional step or you can convert the final data to pandas after processing.
@mishmohd 2 роки тому ⁺¹
Can we suggest they change the name to Polaris
@robmulla 2 роки тому
Why do you suggest that?
@jay_wright_thats_right 3 місяці тому
Orders of magnitude faster? What does that even mean?
@rahulrjb Рік тому
Very pysparke syntax
@hanabimock5193 2 роки тому ⁺¹
I already see books and videos about polars. The same as with pandas. It is like come on, who needs a book for pandas? Are you kidding me ?
@robmulla 2 роки тому
Why do you dislike the fact that there are books about it? Honestly curious. Thanks for watching!
@leonidgrishenkov 2 роки тому ⁺¹
In some cases Polars syntax seems like PySpark
@robmulla 2 роки тому
I've been hearing that a lot :D
@leonidgrishenkov 2 роки тому ⁺¹
@@robmulla ahaha sorry, I’m just a captain obvious 😂
@robmulla 2 роки тому
@@leonidgrishenkov No it's a good point that I didn't realize until people pointed it out. I personally don't use pyspark a ton. Thanks for watching.
@ibekweobinna3514 2 роки тому ⁺¹
Rob,can I add you to website as one of the best tutors of data science. Man you are good. But funny enough I am still learning pandas,then boom came polars.
@robmulla 2 роки тому
Thanks Ibekwe. Never stop learning!
@rolandheinze7182 Рік тому
Polars syntax seems very similar to pyspark, and in my opinion therefore hurts readability vs pandas
@richardbennett4365 2 роки тому ⁺¹
It is the problem with people who use pandas. They don't by and large know about polars. But why? Polars creator's fault for not promoting his product or laziness by pandas operators who just don't look for something better.
Also, if one writes import polars as pd, then one doesn't need to rewrite code written for pandas. Or, one can import polars s po. I never understood why people import this package as pl. That would be for a package called plank, line the dock replacement.
@robmulla 2 роки тому
Importing as pl makes the most sense to me and it’s what their docs recommend.
@AWest-ns3dl 2 роки тому ⁺¹
Polars syntax is similar to spark
@robmulla 2 роки тому ⁺¹
I’ve been hearing that 😃
@BillyT83 2 роки тому ⁺¹
So... Pandas + Dask = Polars?
@robmulla 2 роки тому
Kinda… but it’s really just it’s own thing.
@whitebai6367 2 роки тому ⁺¹
Okay, I'd like to use rust directly.
@robmulla 2 роки тому
You can do it! Polars has a rust API too. Try it out and let me know what you think.
@commonsense1019 2 роки тому ⁺¹
Well the core of pandas can also be changed using RUST no big deal
@robmulla 2 роки тому ⁺¹
It can. But will it?
@NickWindham Рік тому ⁺¹
Just use Julia instead of Python. Then you can do all this with speed similar to Rust in one language that even simpler syntax than Python.
@robmulla Рік тому
Oh really? I haven’t had a chance to need to use Julia but I know it’s popular to use with spark.
@cradleofrelaxation6473 2 роки тому ⁺¹
Is it just me, the syntax is a bit more complicated than pandas whenever they differ!!
@robmulla 2 роки тому
Yes. I agree, it ends up being more verbose.
@richardbennett4365 2 роки тому ⁺¹
He said 15, but he wrote 10 at 7min 05s.
@robmulla 2 роки тому
Good catch!
@ErikS- 2 роки тому ⁺¹
Just take a huge amount of RAM.
I did that also...
@robmulla Рік тому
I used polars on a live stream and crashed my computer during it because it ate all my memory. There is a way to set it to limit the amount it uses I think
@nitinkumar29 2 роки тому ⁺¹
I will let it mature before dealing with this.
@robmulla 2 роки тому
That’s a fair approach. Adopting things too early can be problematic.
@ryanwhite7887 Рік тому ⁺¹
At 6:59 in the video, you can clearly hear him or her say "fifteen", but he or she types a 10 and continues without acknowledging his or her mistake. This is the sign of unambiguous processing and clearly his or her words can only be taken at face value. This has totally discredited all tutorials produced by this channel and I (they/them) will be withdrawing the like that I (they/them) had previously awarded the video.
@robmulla Рік тому
Ok!
@richardbennett4365 2 роки тому ⁺¹
What??? Polars is supposed to give the same result as pandas. Duh. Polars is a pandas replacement.
@robmulla 2 роки тому
Yep

Наступне

Автоматичне відтворення

The BEST library for building Data Pipelines...