Dask in 15 Minutes | Machine Learning & Data Science Open-source Spotlight #5

Dan Bochman

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 лип 2024
Should you use Dask or PySpark for Big Data? 🤔
Dask is a flexible library for parallel computing in Python.
In this video I give a tutorial on how to use Dask for parallel computing, handling Big Data and integration with Deep Learning frameworks.
I compare Dask to PySpark and list the relative advantages I see of choosing Dask as your primary choice for Big Data handling.
Link to Notebook:
nbviewer.jupyter.org/github/d...
With these "Machine Learning & Data Science Open Source Spotlight" weekly videos, my objective is to introduce many game-changing libraries, which I believe many people can benefit from.
I would love to hear your feedback!
Did this video teach you something new?
Are there any open-source libraries you think deserve a spotlight?
Let me know in the comments! 👇🏻

КОМЕНТАРІ • 87

@julianurrea 4 роки тому ⁺¹
Good job man, I'm starting with Dask and I am excited about its capabilities!, thanks a lot
@skyleryoung1010 3 роки тому ⁺¹⁴
Great video! Great use of examples and explained in a way that made Dask very approachable. Thanks!
@guideland1 2 роки тому ⁺¹
Really great Dask introduction and the explanation is so easy to understand. That was useful. Thank you!
@ChaitanyaBelhekar 3 роки тому
Thanks for a great introductory video on Dask!
@mwd6478 3 роки тому ⁺¹¹
Really liked how you showed how to pipeline the data to keras! None of the other Dask videos I've seen show those next steps, it's just pandas stuff
@summerxia7474 2 роки тому
Very nice video!!!! Thank you so much!!! Pls make more about this hands-on video! You explain them very clear and helpful!!!!
@apratimdey6118 Рік тому
Hi Dan, this was a great introductory video. I am learning Dask and this was very helpful.
@sripadvantmuri89 2 роки тому
Great explanations for beginners!! Thanks for this...
@vitorruffo9431 2 роки тому
Good work sir, your video has helped me to get started with Dask. Thank you very much.
@Rafaelkenjinagao 4 роки тому
Many thanks Dan. Good job, sir!
@terraflops 3 роки тому
very helpful in understanding Dask, think i will start using it on my projects
@teslaonly2136 3 роки тому
Very nice explanation! Thanks!
@cristian-bull 2 роки тому
I really appreciate the batch-on-the-fly example with keras.
@vireshsaini705 3 роки тому
Really good..explanation of working with dask :)
@the-data-queerie 3 роки тому
Thanks for this super informative video!
@priyabratasinha1478 3 роки тому
Thank you so much man... you saved a project... 🙂🙂❤❤🙏🙏
@af121x 3 роки тому
Thank you so much. This video is really helpful to me.
@datadiggers_ru 2 роки тому
Great intro. Thank you
@arashkhosravi7611 3 роки тому ⁺¹
Great explanation, It was not long, it is very interesting .
@someirrelevantthings6198 3 роки тому
Literally man, you saved me. Thanks a ton
@sanketwebsites9764 3 роки тому
Awesome Video. Thanks a ton !
@user-wr5rc5pp8r 2 роки тому
amazing. Very interesting theme.
@krocodilnaohote1412 2 роки тому
Man, great video, thank you!
@alikoko1 2 роки тому
Great video! Thank you man 👍
@magicpotato1707 3 роки тому
Thanks man! very informative
@haydn.murray 3 роки тому
Fantastic video! Keep it up!
@piotr780 Рік тому
very good ! thank you ! :)
@dwivedys 2 роки тому
Excellent!
@_Apex 3 роки тому
Amazing video!
@shandi1241 3 роки тому
thanks man you made my morning
@bodenseeboys 2 роки тому
Chapeau, well explained and healthy usage of memes!
@-mwolf Рік тому
awesome intro
@DanielWeikert 2 роки тому ⁺¹
Great. I would also like to see more on DASK and Deep Learning. How exactly would this generator be used in pytorch? Instead of the DataLoader. Thanks for the video(s)
@adityaanvesh 4 роки тому ⁺¹
Great intro of dask for a pandas guy like me...
@FabioRBelotto Рік тому
Thanks man. It's really hard to find some information about dask
Рік тому
Hi 😀 Dan.
Thank you for this video.
Do you an example which uses the apply() function?
I want to create a new column based on a data transformation.
Thank you!
@joseleonardosanchezvasquez1514 Рік тому
Great
@DontMansion 3 роки тому
Hello. Can I use Dask for Numpy without using "defs"? Thank you.
@NitinPasumarthy 4 роки тому
Very well explained! Thank you.
@danbochman 4 роки тому
Very happy to hear!
If there's any topic in a particular which interests you, feel free to suggest!
@NitinPasumarthy 4 роки тому
@@danbochman May be how to build custom UDFs like in Spark?
@danbochman 4 роки тому
@@NitinPasumarthy hmm to be honest I don't know much about this topic. But I just released a video about visualizing big data with Datashader + Bokeh (I remember you've asked for that before):
www.linkedin.com/posts/danbochman_datashader-in-15-minutes-machine-learning-activity-6639434661288263680-g62W
Hope you'll like it!
@unathimatu 2 роки тому
This is really greAT
@FabioRBelotto Рік тому
I've added dask delayed to some functions. When I visualize it, there are several parallel works planned, but my cpu does not seems to be affected (its only using a small percentage of it)
@anandramm235 3 роки тому
This was awesome, more tutorials for Dask sir please!!
@danbochman 3 роки тому ⁺¹
Thank you! Glad you liked it. What kind of Dask tutorials specifically will help you?
@anandramm235 3 роки тому
@@danbochman data preprocessing ,not sure if dummy encoding can be done through dask
@jodiesimkoff5778 3 роки тому ⁺¹
Great video, looking forward to exploring more of your content! If you don’t mind, could you share any details about your jupyter notebook setup (extensions, etc)? It looks great.
@danbochman 3 роки тому
Hey Jodie, really glad you liked it!
It's actually not a Jupyter Notebook, it's Google Colaboratory, which is essentially very similar, but runs on Google's servers and not your local machine.
I highly recommend it if you're doing work that is not hardware/storage demanding.
@jameswestwood1268 3 роки тому
Amazing. Please expand and create a section on Dask for Machine Learning.
@danbochman 3 роки тому
Thank you! Dask's machine learning support got even better since I've made this video.
The Dask-ML package mimics the scikit-learn API and mode selection, and the new SciKeras package wraps Keras models in the scikit-learn API, so the transition is pretty seamless.
Is there anything specific you wish you could have more guidance on?
@parikannappan1580 Рік тому
4:12 ,visualize() , where can in get documetation? I tried to use for sorted() did not work
@vegaalej Рік тому
Many thanks for this excellent video! It is really clear and helpful!
I just have one question, I tried to run the notebook, and ran pretty well after some minor updates. Just the last line I was not able to make it run:
# never run out-of-memory while training
"model.fit_generator(generator=dask_data_generator(df_train), steps_per_epoch=100)"
Gives me an error message:
InvalidArgumentError: Graph execution error:
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Traceback (most recent call last):
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
[[{{node PyFunc}}]]
[[IteratorGetNext]] [Op:__inference_train_function_506]
Any recommendation on how I should modify it to make it run?
Thanks
AG
@luisalbertovivas5174 4 роки тому
I liked it Dan. I wan to try how it works with Scikit Learn and all the models.
@danbochman 4 роки тому
For Scikit-learn and other popular models such as XGBoost, it's even easier! Everything is already implemented in the dask-ml package:
ml.dask.org/
@user-nd9rd9mo9f 2 роки тому
Many thanks. Now I understand why the file was not read
@felixtechverse Рік тому
How to schedule a dask task? for example how to put a script to run every day at 10:00 with dask for example
@dendi1076 3 роки тому
Hi sir, for Dask to work on my laptop, do I need to have more than 1 core? what if I only have 1 core on my laptop and no other nodes to work with, will Dask still be helpful for reading a csv file that has millions of rows and help me speed up the process
@danbochman 3 роки тому
Hello dendi,
If you only have 1 core, you won't be able to speed up performance with parallelization, as you don't have the ability to open different processes/threads. However, you would still be more memory efficient if you have a huge dataset on a limited RAM.
Dask can still help you work with data that is bigger or close to your RAM size (e.g. 4GB-8GB for most laptops), by only fetching what is relevant for the current computation in-memory.
@raafhornet 3 роки тому
Would it still work if the fraction of data you're sampling for the model is still larger than your memory?
@danbochman 3 роки тому
Sorry for the late reply, this comment didn't pop up in my notifications...
If a fraction of the data is larger than your memory, than no, can't handle data bigger than your in-memory capabilities.
It just means that fraction should be fractioned even more :)
@asd222treed 2 роки тому ⁺¹
Great video! Thank you for sharing.But I think your code would have some incorrect codes in machine learning with dask part.
There is no X in the code (model.add(..., input_dim=X.shape[1], ... )
and when I training model.fit_generator, the tensor flow saids model.fit_generator is deprecated.. and finally displayed error - AttributeError: 'tuple' object has no attribute 'rank'
@danbochman 2 роки тому ⁺²
Hey!
Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace.
You can either change df_train to X or change X to df_train X df_train.
Just be consistent and it should work!
@vegaalej 2 роки тому ⁺²
Great Video an Explanation! Thank you very much for it! IT is really helpful!
I tried to run the notebook, and it ran pretty well after some minor updates. I just had problems to run the latest part. "never run out-of-memory while training", seems the generator or steps per epoch part is giving some prblem I cant fogure hout how to solve.
Any possible suggestion on how to fix the code?
Thanks!
InvalidArgumentError: Graph execution error:
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Traceback (most recent call last):
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
@AlejandroGarcia-ib4kb 2 роки тому ⁺¹
Interesting, I have the same problem in the last part of the notebook. Seems it is related to IDE, it needs and update.
@indian-inshorts5786 3 роки тому
it's really helpful for me to thank you may I provide the link of your channel on my notebook on Kaggle?
@danbochman 3 роки тому
Very happy to hear that you've found the video helpful!
(About link) Sure! The whole purpose of these videos is to share them ;)
@BilalAslamIsAwesome 4 роки тому ⁺²
Hello! Thank you for your tutorial, i downloaded and ran your notebook, at the keras steps i'm getting an ( 'X' is not defined) error. I cant see where it was created either. Any ideas on how i can fix this to run it?
@danbochman 4 роки тому ⁺¹
Hey Bilal!
Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace.
You can either change df_train to X or change X to df_train X df_train.
Just be consistent and it should work!
@BilalAslamIsAwesome 4 роки тому ⁺¹
@@danbochman great! Thank you I will try that. Have you been using dask for everything now or do you still use pandas and numpy?
@danbochman 4 роки тому ⁺¹
@@BilalAslamIsAwesome Hope it worked for you! Please let me know if it didn't.
Yes, I use Dask for mostly anything. Some exceptions:
1. Pandas Profiling - A great package (I have an old video on it) for EDA which I ALWAYS use when I first approach a new dataset, doesn't play well with Dask.
2. Complex (interactive) Visualizations - My favorite package for this is Plotly, doesn't synergize well with Dask. If plotting all the data points is a must than I will use Dask + Datashader + Bokeh.
@madhu1987ful 2 роки тому
Hey thanks for the awesome video and the explanation. I have a use case. I am trying to build a Deep learning tensorflow model for time series forecasting. For this I need to use multinode cluster for parallelization across all nodes. I have a single function which can take data for any 1 store and predict for that store.
Likewise I need to do predictions for 2 lakh outlets. How can I use dask to parallelize this huge task across all nodes of my cluster? Can you please guide me. Thanks in advance.
@danbochman 2 роки тому ⁺¹
Hi Madhu,
Sorry, wish I could help, but node cluster parallelization tasks are more dependent on the framework iteself (e.g. Kubernetes), than Dask.
You have the dask.distributed module (distributed.dask.org/en/stable/), but handling the multi-worker infrastructure is where the real challenge lies...
@markp2381 2 роки тому
Great content! One question, isn't it strange to use function in this form: function(do_something)(on_variable) instead of function(do_something(on_variable)) ?
@danbochman 2 роки тому ⁺³
Hey Mark! Thanks.
I understand what you mean, but when a function returns a function this makes sense, as opposed to a function which outputs the input to the next function.
delayed(fn) returns a a new function (e.g. "delayed_fn"), and this new function is then called regularly delayed_fn(x). So its delayed(fn)(x).
All decorators are functions which return callable functions. In this example they are used quite unnaturally because I wanted to keep both versions of the functions.
Hope the explanation helped!
@emmaf4320 3 роки тому
Hey Dan, I really like the video! I have a question with regards to the for loop example. Why it only saved half of the time? To my understanding, delayed(inc) took about 1s because of the parallel computing? What else took time to operate?
@danbochman 3 роки тому
Hey Lyann,
There’a a limit to how much you can parallelize
It depends on how many cores you have and how many threads can be opened within each process, in addition, how much of your computer resources are available (e.g. several chrome tabs are open)
@emmaf4320 3 роки тому
@@danbochman Thank you!
@kalagaarun9638 4 роки тому
Thanks sir... very well exlpained and covered a wide range of topics!!
btw what are chunks in dask.array, I find it a bit hard to understand...
Can you explain?
@danbochman 4 роки тому ⁺³
Hey! Sorry for the delay, for some reason UA-cam doesn't notify me on comments on my videos... (although set to do so)
If your question is still relevant,
Chunks are basically, as the name suggests, chunks of your big arrays split into smaller arrays (in Dask -> Numpy ndarrays).
There exists a certain sweet spot of computational efficiency (parallelization potential) and memory capability (in-memory computations) for which a chunk size is optimal.
Given your amount of CPU cores, available RAM and file size, Dask find the optimal way to split your data to chunks which would be best optimally processed by your machine.
You might ask: "Isn't fetching only a certain number of data rows each time is sufficient?"
Well not exactly, the bottleneck maybe be at the column dimension (e.g. 10 rows, 2M columns), so row operations should split each row into chunks of columns for more efficient computation.
Hope it helped clear some concepts for you!
@kalagaarun9638 4 роки тому
@@danbochman It helped a lot, thanks sir !
@FindMultiBagger 3 роки тому
Thanks !
Nice tutorial , Thanks 🙌🏼
Can we use any ml library on it ?
Like scikit , pycaret etc
@danbochman 3 роки тому ⁺¹
To the best of my knowledge not directly.
You can use any ML/DL framework after you persist your dask array to memory. However, Dask has it's own dask-ml package in which contributors migrate most of the common use cases in scikit-learn and PyCaret.
@FindMultiBagger 3 роки тому
@@danbochman Can you make simple use-case on 🙂,
1) Integrate Dask with pycaret , it will so helpful.
Beacuse I have used all approaches like RAY , MODIN , RAPID , PYSPARK. But all have some limitations.
If you help to integrate DASK with pycaret it helps lots for open-source community :)
Looking forward to hearing from you :)
Thanks
@abhisekbus 3 роки тому ⁺¹
Came to handle a 5gig csv, stayed for capybara with friends.
@anirbanaws143 3 роки тому
How do I handle if my data is .xlsx?
@danbochman 3 роки тому
To be honest, if you're working with a huge .xlsx file and best performance is necessary, then I would recommend rewriting the .xlsx to .csvs with:
from openpyxl import load_workbook
wb = load_workbook(filename = "data.xlsx", read_only=True, values_only=True)
list_ws = wb.sheetnames
nws = len(wb.sheetnames)
for i in range(0, nws):
ws = wb[list_ws[i]]
with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f:
c = csv.writer(f)
for r in ws.rows:
c.writerow([cell.value for cell in r])
If this is not an option for you, you can call Dask's delayed decorator on the Panda's read_excel function like so:
delayed_ddf = dask.delayed(pd.read_excel)( "data.xlsx", sheet_name=0)
ddf = dd.from_delayed(delayed_ddf )
But you won't see a huge performance increase.
Hope it helped!

Наступне

Автоматичне відтворення

Datashader in 15 Minutes | Machine Learning & Data Science Open-source Spotlight #6