Dask in 15 Minutes | Machine Learning & Data Science Open-source Spotlight #5
Вставка
- Опубліковано 21 лип 2024
- Should you use Dask or PySpark for Big Data? 🤔
Dask is a flexible library for parallel computing in Python.
In this video I give a tutorial on how to use Dask for parallel computing, handling Big Data and integration with Deep Learning frameworks.
I compare Dask to PySpark and list the relative advantages I see of choosing Dask as your primary choice for Big Data handling.
Link to Notebook:
nbviewer.jupyter.org/github/d...
With these "Machine Learning & Data Science Open Source Spotlight" weekly videos, my objective is to introduce many game-changing libraries, which I believe many people can benefit from.
I would love to hear your feedback!
Did this video teach you something new?
Are there any open-source libraries you think deserve a spotlight?
Let me know in the comments! 👇🏻
Good job man, I'm starting with Dask and I am excited about its capabilities!, thanks a lot
Great video! Great use of examples and explained in a way that made Dask very approachable. Thanks!
Really great Dask introduction and the explanation is so easy to understand. That was useful. Thank you!
Thanks for a great introductory video on Dask!
Really liked how you showed how to pipeline the data to keras! None of the other Dask videos I've seen show those next steps, it's just pandas stuff
Very nice video!!!! Thank you so much!!! Pls make more about this hands-on video! You explain them very clear and helpful!!!!
Hi Dan, this was a great introductory video. I am learning Dask and this was very helpful.
Great explanations for beginners!! Thanks for this...
Good work sir, your video has helped me to get started with Dask. Thank you very much.
Many thanks Dan. Good job, sir!
very helpful in understanding Dask, think i will start using it on my projects
Very nice explanation! Thanks!
I really appreciate the batch-on-the-fly example with keras.
Really good..explanation of working with dask :)
Thanks for this super informative video!
Thank you so much man... you saved a project... 🙂🙂❤❤🙏🙏
Thank you so much. This video is really helpful to me.
Great intro. Thank you
Great explanation, It was not long, it is very interesting .
Literally man, you saved me. Thanks a ton
Awesome Video. Thanks a ton !
amazing. Very interesting theme.
Man, great video, thank you!
Great video! Thank you man 👍
Thanks man! very informative
Fantastic video! Keep it up!
very good ! thank you ! :)
Excellent!
Amazing video!
thanks man you made my morning
Chapeau, well explained and healthy usage of memes!
awesome intro
Great. I would also like to see more on DASK and Deep Learning. How exactly would this generator be used in pytorch? Instead of the DataLoader. Thanks for the video(s)
Great intro of dask for a pandas guy like me...
Thanks man. It's really hard to find some information about dask
Hi 😀 Dan.
Thank you for this video.
Do you an example which uses the apply() function?
I want to create a new column based on a data transformation.
Thank you!
Great
Hello. Can I use Dask for Numpy without using "defs"? Thank you.
Very well explained! Thank you.
Very happy to hear!
If there's any topic in a particular which interests you, feel free to suggest!
@@danbochman May be how to build custom UDFs like in Spark?
@@NitinPasumarthy hmm to be honest I don't know much about this topic. But I just released a video about visualizing big data with Datashader + Bokeh (I remember you've asked for that before):
www.linkedin.com/posts/danbochman_datashader-in-15-minutes-machine-learning-activity-6639434661288263680-g62W
Hope you'll like it!
This is really greAT
I've added dask delayed to some functions. When I visualize it, there are several parallel works planned, but my cpu does not seems to be affected (its only using a small percentage of it)
This was awesome, more tutorials for Dask sir please!!
Thank you! Glad you liked it. What kind of Dask tutorials specifically will help you?
@@danbochman data preprocessing ,not sure if dummy encoding can be done through dask
Great video, looking forward to exploring more of your content! If you don’t mind, could you share any details about your jupyter notebook setup (extensions, etc)? It looks great.
Hey Jodie, really glad you liked it!
It's actually not a Jupyter Notebook, it's Google Colaboratory, which is essentially very similar, but runs on Google's servers and not your local machine.
I highly recommend it if you're doing work that is not hardware/storage demanding.
Amazing. Please expand and create a section on Dask for Machine Learning.
Thank you! Dask's machine learning support got even better since I've made this video.
The Dask-ML package mimics the scikit-learn API and mode selection, and the new SciKeras package wraps Keras models in the scikit-learn API, so the transition is pretty seamless.
Is there anything specific you wish you could have more guidance on?
4:12 ,visualize() , where can in get documetation? I tried to use for sorted() did not work
Many thanks for this excellent video! It is really clear and helpful!
I just have one question, I tried to run the notebook, and ran pretty well after some minor updates. Just the last line I was not able to make it run:
# never run out-of-memory while training
"model.fit_generator(generator=dask_data_generator(df_train), steps_per_epoch=100)"
Gives me an error message:
InvalidArgumentError: Graph execution error:
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Traceback (most recent call last):
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
[[{{node PyFunc}}]]
[[IteratorGetNext]] [Op:__inference_train_function_506]
Any recommendation on how I should modify it to make it run?
Thanks
AG
I liked it Dan. I wan to try how it works with Scikit Learn and all the models.
For Scikit-learn and other popular models such as XGBoost, it's even easier! Everything is already implemented in the dask-ml package:
ml.dask.org/
Many thanks. Now I understand why the file was not read
How to schedule a dask task? for example how to put a script to run every day at 10:00 with dask for example
Hi sir, for Dask to work on my laptop, do I need to have more than 1 core? what if I only have 1 core on my laptop and no other nodes to work with, will Dask still be helpful for reading a csv file that has millions of rows and help me speed up the process
Hello dendi,
If you only have 1 core, you won't be able to speed up performance with parallelization, as you don't have the ability to open different processes/threads. However, you would still be more memory efficient if you have a huge dataset on a limited RAM.
Dask can still help you work with data that is bigger or close to your RAM size (e.g. 4GB-8GB for most laptops), by only fetching what is relevant for the current computation in-memory.
Would it still work if the fraction of data you're sampling for the model is still larger than your memory?
Sorry for the late reply, this comment didn't pop up in my notifications...
If a fraction of the data is larger than your memory, than no, can't handle data bigger than your in-memory capabilities.
It just means that fraction should be fractioned even more :)
Great video! Thank you for sharing.But I think your code would have some incorrect codes in machine learning with dask part.
There is no X in the code (model.add(..., input_dim=X.shape[1], ... )
and when I training model.fit_generator, the tensor flow saids model.fit_generator is deprecated.. and finally displayed error - AttributeError: 'tuple' object has no attribute 'rank'
Hey!
Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace.
You can either change df_train to X or change X to df_train X df_train.
Just be consistent and it should work!
Great Video an Explanation! Thank you very much for it! IT is really helpful!
I tried to run the notebook, and it ran pretty well after some minor updates. I just had problems to run the latest part. "never run out-of-memory while training", seems the generator or steps per epoch part is giving some prblem I cant fogure hout how to solve.
Any possible suggestion on how to fix the code?
Thanks!
InvalidArgumentError: Graph execution error:
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Traceback (most recent call last):
TypeError: `generator` yielded an element of shape (26119,) where an element of shape (None, None) was expected.
Interesting, I have the same problem in the last part of the notebook. Seems it is related to IDE, it needs and update.
it's really helpful for me to thank you may I provide the link of your channel on my notebook on Kaggle?
Very happy to hear that you've found the video helpful!
(About link) Sure! The whole purpose of these videos is to share them ;)
Hello! Thank you for your tutorial, i downloaded and ran your notebook, at the keras steps i'm getting an ( 'X' is not defined) error. I cant see where it was created either. Any ideas on how i can fix this to run it?
Hey Bilal!
Whoops, I must've changed the variable name X to df_train and wasn't consistent in the process, it probably didn't pop a message to me because X was still in my notebook workspace.
You can either change df_train to X or change X to df_train X df_train.
Just be consistent and it should work!
@@danbochman great! Thank you I will try that. Have you been using dask for everything now or do you still use pandas and numpy?
@@BilalAslamIsAwesome Hope it worked for you! Please let me know if it didn't.
Yes, I use Dask for mostly anything. Some exceptions:
1. Pandas Profiling - A great package (I have an old video on it) for EDA which I ALWAYS use when I first approach a new dataset, doesn't play well with Dask.
2. Complex (interactive) Visualizations - My favorite package for this is Plotly, doesn't synergize well with Dask. If plotting all the data points is a must than I will use Dask + Datashader + Bokeh.
Hey thanks for the awesome video and the explanation. I have a use case. I am trying to build a Deep learning tensorflow model for time series forecasting. For this I need to use multinode cluster for parallelization across all nodes. I have a single function which can take data for any 1 store and predict for that store.
Likewise I need to do predictions for 2 lakh outlets. How can I use dask to parallelize this huge task across all nodes of my cluster? Can you please guide me. Thanks in advance.
Hi Madhu,
Sorry, wish I could help, but node cluster parallelization tasks are more dependent on the framework iteself (e.g. Kubernetes), than Dask.
You have the dask.distributed module (distributed.dask.org/en/stable/), but handling the multi-worker infrastructure is where the real challenge lies...
Great content! One question, isn't it strange to use function in this form: function(do_something)(on_variable) instead of function(do_something(on_variable)) ?
Hey Mark! Thanks.
I understand what you mean, but when a function returns a function this makes sense, as opposed to a function which outputs the input to the next function.
delayed(fn) returns a a new function (e.g. "delayed_fn"), and this new function is then called regularly delayed_fn(x). So its delayed(fn)(x).
All decorators are functions which return callable functions. In this example they are used quite unnaturally because I wanted to keep both versions of the functions.
Hope the explanation helped!
Hey Dan, I really like the video! I have a question with regards to the for loop example. Why it only saved half of the time? To my understanding, delayed(inc) took about 1s because of the parallel computing? What else took time to operate?
Hey Lyann,
There’a a limit to how much you can parallelize
It depends on how many cores you have and how many threads can be opened within each process, in addition, how much of your computer resources are available (e.g. several chrome tabs are open)
@@danbochman Thank you!
Thanks sir... very well exlpained and covered a wide range of topics!!
btw what are chunks in dask.array, I find it a bit hard to understand...
Can you explain?
Hey! Sorry for the delay, for some reason UA-cam doesn't notify me on comments on my videos... (although set to do so)
If your question is still relevant,
Chunks are basically, as the name suggests, chunks of your big arrays split into smaller arrays (in Dask -> Numpy ndarrays).
There exists a certain sweet spot of computational efficiency (parallelization potential) and memory capability (in-memory computations) for which a chunk size is optimal.
Given your amount of CPU cores, available RAM and file size, Dask find the optimal way to split your data to chunks which would be best optimally processed by your machine.
You might ask: "Isn't fetching only a certain number of data rows each time is sufficient?"
Well not exactly, the bottleneck maybe be at the column dimension (e.g. 10 rows, 2M columns), so row operations should split each row into chunks of columns for more efficient computation.
Hope it helped clear some concepts for you!
@@danbochman It helped a lot, thanks sir !
Thanks !
Nice tutorial , Thanks 🙌🏼
Can we use any ml library on it ?
Like scikit , pycaret etc
To the best of my knowledge not directly.
You can use any ML/DL framework after you persist your dask array to memory. However, Dask has it's own dask-ml package in which contributors migrate most of the common use cases in scikit-learn and PyCaret.
@@danbochman Can you make simple use-case on 🙂,
1) Integrate Dask with pycaret , it will so helpful.
Beacuse I have used all approaches like RAY , MODIN , RAPID , PYSPARK. But all have some limitations.
If you help to integrate DASK with pycaret it helps lots for open-source community :)
Looking forward to hearing from you :)
Thanks
Came to handle a 5gig csv, stayed for capybara with friends.
How do I handle if my data is .xlsx?
To be honest, if you're working with a huge .xlsx file and best performance is necessary, then I would recommend rewriting the .xlsx to .csvs with:
from openpyxl import load_workbook
wb = load_workbook(filename = "data.xlsx", read_only=True, values_only=True)
list_ws = wb.sheetnames
nws = len(wb.sheetnames)
for i in range(0, nws):
ws = wb[list_ws[i]]
with open("output/%s.csv" %(list_ws[i].replace(" ","")), "w", newline="") as f:
c = csv.writer(f)
for r in ws.rows:
c.writerow([cell.value for cell in r])
If this is not an option for you, you can call Dask's delayed decorator on the Panda's read_excel function like so:
delayed_ddf = dask.delayed(pd.read_excel)( "data.xlsx", sheet_name=0)
ddf = dd.from_delayed(delayed_ddf )
But you won't see a huge performance increase.
Hope it helped!