Coiled
Coiled
  • 121
  • 111 909
AWS Batch is Kinda Hard to Use
We ask ChatGPT how to use AWS Batch to run a simple Hello world application from the terminal.
This is in contrast to other tools like Coiled Batch, which we think are easier to use.
Переглядів: 156

Відео

Submit Job Scripts with Coiled Job Arrays
Переглядів 117Місяць тому
HPC-style job scripts now available in the cloud. Coiled job arrays make it easy to submit a script to run many times on parallel hardware in the cloud. This video gives a basic walkthrough with simple code examples.
Coiled Overview
Переглядів 3232 місяці тому
Coiled makes it easy to use the cloud right from within your Python environment. It can help you run a single script in the cloud, or scale up to thousands of VMs, all with robust security and efficient techniques. This two-minute video gives a brief overview of what's possible. For more information visit coiled.io or ua-cam.com/users/coiled to learn more
Schedule Python Jobs with Prefect and Coiled
Переглядів 54110 місяців тому
Prefect makes it easy to write production workflows in Python. Getting started on a laptop usually takes just a few minutes. Coiled makes it easy to deploy Prefect in the cloud. You might want to run a workflow, or specific task within a workflow, on the cloud because: - You want an always-on machine for long-running, or regularly scheduled, jobs - You want to run close to your cloud-hosted dat...
Churn Through Cloud Files in Parallel
Переглядів 17111 місяців тому
People often want to run the same function over many files. However, processing files in cloud storage is often slow and expensive due to transferring cloud data in and out of AWS/GCP/Azure. In this webinar recording we’ll show how to run this “same function on many files” pattern on the cloud with Coiled, so you can run existing code faster and cheaper with minimal changes. We’ll also highligh...
Analyzing the National Water Model with Xarray, Dask, and Coiled
Переглядів 512Рік тому
Mean weekly water table depth for US counties from 1979-2020. Water table depth fluctuates seasonally, decreasing with more precipitation in the winter and increasing with more periods of drought in the summer. 1m is optimal for many types of agriculture. Blog post: docs.coiled.io/blog/coiled-xarray.html Code: github.com/coiled/examples/tree/main/national-water-model
Dask DataFrame is Fast Now
Переглядів 1,3 тис.Рік тому
In this webinar, Patrick Höfler and Rick Zamora show how recent development efforts have driven performance improvements in Dask DataFrame. Key Moments 00:00 Intro 00:19 Dask DataFrame is fast now 02:06 Historical pain points 03:51 PyArrow-backed strings in Dask 06:04 Demo: PyArrow strings 08:53 Demo: Task-based shuffling is slow 11:11 Better performance with P2P shuffling 16:29 Sub-optimal que...
Spark, Dask, DuckDB, Polars: TPC-H Benchmarks at Scale
Переглядів 8 тис.Рік тому
We run the common TPC-H Benchmark suite at 10 GB, 100 GB, 1 TB, and 10 TB scale on the cloud a local machine and compare performance for common large dataframe libraries. No tool does universally well. We look at common bottlenecks and compare performance between the different systems. This talk was originally given at PyData NYC 2023. These results are preliminary, and come from only a couple ...
How do I Set Up Coiled?
Переглядів 442Рік тому
Set up Coiled to run Dask or other cloud processing APIs easily 1. Create an account 2. Register an API token 3. Connect to your cloud 00:00 Introduction 00:34 pip install coiled 00:51 Authenticate 01:25 Connect your Cloud 03:48 Add a Region 05:00 Hello, world! 06:25 Teams 07:11 Summary
Run Your Jupyter Notebooks in the Cloud
Переглядів 1,1 тис.Рік тому
When you're only processing 10-100GB of data, a hundred-worker cluster is probably overkill when a single, big VM will do. You can use Coiled notebooks to start a JupyterLab instance on any machine you’d like, whether that’s a better GPU or a single VM with hundreds of GBs of memory. Examples in our docs: docs.coiled.io/user_guide/usage/notebooks/index.html Get started with Coiled: coiled.io/st...
Coiled Overview
Переглядів 522Рік тому
Learn how to easily process data on the cloud with Coiled. This 15m video is an overview over many aspects of Coiled. For a more in-depth treatment, please consider the more topic-specific videos at youtube.com/@coiled 00:00 Introduction 01:14 API: CLI commands 02:41 API: Serverless Functions 03:40 API: Dask 06:25 API: Jupyter Notebooks 07:38 Management Dashboard 09:56 Architecture and Data Pri...
Run Python Scripts with Coiled Functions & Coiled Run
Переглядів 387Рік тому
Run a script or Python function in any cloud region on any hardware. Sometimes you don’t need a huge cluster for your workflows, and you just want to run your Python function on a VM in the cloud. In this webinar, we'll walk through these two APIs: Coiled Functions and Coiled Run. We'll see how to run a computation on a VM close to our data, train a PyTorch model on a GPU in the cloud, and scal...
Run Python Scripts in the Cloud with Coiled
Переглядів 955Рік тому
Sometimes you don’t need a huge cluster for your workflows, and you just want to run your Python function on a VM in the cloud. You might want to do this for a few reasons: You want a big machine You want a GPU You want to run close to your data You want to run the script many times while scaling out With Coiled, you can run any Python function, script, or executable in your AWS or GCP account,...
How do I get my software onto cloud VMs? Automatic Package Synchronization with Coiled
Переглядів 173Рік тому
Getting your software onto cloud VMs is hard. Coiled makes it easy...mostly. This video talks about how Coiled manages software for Python development in the cloud, and methods to escape when things go wrong. More information available at docs.coiled.io/user_guide/software/ Blog posts: How many PEPs does it take to install a package? medium.com/coiled-hq/how-many-peps-does-it-take-to-install-a-...
Coiled Cluster Configuration
Переглядів 220Рік тому
Learn how to configure your Coiled resources, including selecting instance types, regions, and different hardware choices. Documentation at docs.coiled.io/user_guide/clusters/ More videos to help you setup Coiled ua-cam.com/video/QXql9O8kSPk/v-deo.html ua-cam.com/video/ukkOJPF2URY/v-deo.html ua-cam.com/video/eXP-YuERvi4/v-deo.html Get started with Coiled for free: coiled.io/start
Jupyter Notebooks with Coiled
Переглядів 439Рік тому
Jupyter Notebooks with Coiled
Dask Futures Tutorial: Parallelize Python Code with Dask
Переглядів 2 тис.Рік тому
Dask Futures Tutorial: Parallelize Python Code with Dask
Dask DataFrames Tutorial: Best practices for larger-than-memory dataframes
Переглядів 2,7 тис.Рік тому
Dask DataFrames Tutorial: Best practices for larger-than-memory dataframes
Databricks vs. Dask and Coiled
Переглядів 499Рік тому
Databricks vs. Dask and Coiled
Coiled Xarray Example
Переглядів 639Рік тому
Coiled Xarray Example
Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently
Переглядів 221Рік тому
Coiled Dashboard: Monitor Teams and Manage Costs Easily and Efficiently
Dask + Pandas for Parallel ETL
Переглядів 1,3 тис.Рік тому
Dask Pandas for Parallel ETL
XGBoost and HyperParameter Optimization
Переглядів 939Рік тому
XGBoost and HyperParameter Optimization
Dask Futures for General Parallelism
Переглядів 1 тис.Рік тому
Dask Futures for General Parallelism
Engineering a Technical Newsletter: A transparent analysis of the Coiled newsletter
Переглядів 59Рік тому
Engineering a Technical Newsletter: A transparent analysis of the Coiled newsletter
Six Coiled features for Dask users
Переглядів 467Рік тому
Six Coiled features for Dask users
Dask Infrastructure with Coiled for Pangeo
Переглядів 398Рік тому
Dask Infrastructure with Coiled for Pangeo
Dask on Single Machine with Coiled
Переглядів 446Рік тому
Dask on Single Machine with Coiled
Dask and Optuna for Hyper Parameter Optimization
Переглядів 2,4 тис.Рік тому
Dask and Optuna for Hyper Parameter Optimization
Measuring the GIL | Does pandas release the GIL?
Переглядів 5812 роки тому
Measuring the GIL | Does pandas release the GIL?

КОМЕНТАРІ

  • @RyPeck
    @RyPeck 25 днів тому

    Hey - Great videos! Would be nice if they were in the proper order in the UA-cam playlist.

  • @FabioRBelotto
    @FabioRBelotto 2 місяці тому

    How would delayed with over pandas functions that are not available on dask. (Ex. Json_normalize)?

  • @edzme
    @edzme 4 місяці тому

    thanks for making this, coiled seems to be what I'm looking for

  • @fida47
    @fida47 4 місяці тому

    can someone share dataset link? from where to download 10 csv files of nyc flights dataset?

  • @Andikan4U
    @Andikan4U 4 місяці тому

    Thank you

  • @FabioRBelotto
    @FabioRBelotto 5 місяців тому

    If I run Dask without importing the client, it does not work on many workers ?

  • @FabioRBelotto
    @FabioRBelotto 5 місяців тому

    The source was one only big parquet file ? Dask set partitions by itself ?

  • @FabioRBelotto
    @FabioRBelotto 5 місяців тому

    My main issue with dask is the lack of support of the community (very different from pandas!)

  • @richerite
    @richerite 6 місяців тому

    Great talk! What would you recommend for ingesting about 100-200GB of geospatial data on premise?

  • @mohitparwani4235
    @mohitparwani4235 7 місяців тому

    { "name": "CancelledError", "message": "('mul-floordiv-3770c7fe5e6231d62ed3d68e48276fbd', 0)", "stack": "--------------------------------------------------------------------------- CancelledError Traceback (most recent call last) File <timed eval>:2 File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask_expr\\_collection.py:476, in FrameBase.compute(self, fuse, **kwargs) 474 out = out.repartition(npartitions=1) 475 out = out.optimize(fuse=fuse) --> 476 return DaskMethodsMixin.compute(out, **kwargs) File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask\\base.py:375, in DaskMethodsMixin.compute(self, **kwargs) 351 def compute(self, **kwargs): 352 \"\"\"Compute this dask collection 353 354 This turns a lazy Dask collection into its in-memory equivalent. (...) 373 dask.compute 374 \"\"\" --> 375 (result,) = compute(self, traverse=False, **kwargs) 376 return result File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\dask\\base.py:661, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) 658 postcomputes.append(x.__dask_postcompute__()) 660 with shorten_traceback(): --> 661 results = schedule(dsk, keys, **kwargs) 663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File c:\\Users\\mohit.parwani\\.conda\\envs\\parApat\\Lib\\site-packages\\distributed\\client.py:2235, in Client._gather(self, futures, errors, direct, local_worker) 2233 else: 2234 raise exception.with_traceback(traceback) -> 2235 raise exc 2236 if errors == \"skip\": 2237 bad_keys.add(key) CancelledError: ('mul-floordiv-3770c7fe5e6231d62ed3d68e48276fbd', 0)" } I'm getting this error when i use client can someone please help with any possible solution i definitely need that. please!

  • @as978
    @as978 7 місяців тому

    So happy to see this. Better late than never. Hopefully Dask gets the popularity it deserves and becomes a serious contender to Spark down the line.

  • @gemini_537
    @gemini_537 7 місяців тому

    Gemini 1.5 Pro: This video is about an introduction to Dask DataFrames, and it covers when to use them, how to use them, and performance tips. In the video, it is explained that pandas is great for tabular data sets that fit into memory, but Dask is useful for working with data sets that are larger than your machine can handle. Dask can cut up your big data set into smaller bits and execute those smaller parts in parallel. Here are the key points covered in the video: * **When to use Dask DataFrames:** You should use Dask DataFrames if your data doesn't fit into memory and your computations are complex. Pandas might run into a memory error if the data is too large, but Dask can handle those types of large-scale computations comfortably. * **Dask DataFrames vs Pandas DataFrames:** Dask DataFrames are similar to Pandas DataFrames and implement a well-used portion of the Pandas API. This means that a lot of Dask DataFrames code will look and feel pretty familiar to Pandas users. However, there are some key differences. For instance, unlike Pandas DataFrames, Dask DataFrames are lazy, meaning they only create the task graph (a recipe or a root map) to get to the final result but doesn't actually execute it until you specifically tell Dask to do so by calling compute. * **Working with Partitions:** Dask DataFrames are cut up into small bits which are partitions and each partition is actually just a Pandas DataFrame. This means you can perform Pandas operations on these partitions. * **Performance tips:** The video also covers performance tips, such as when to call compute. It is recommended to call compute when you want to combine computations into a single task graph. This is because task graphs for these results have been merged which means that Dask only needs to read the data from the CSV file once instead of twice. The video concludes by mentioning that this is module two of the introduction to Dask tutorial and the next module will cover processing array data with Dask Arrays.

  • @zapy422
    @zapy422 8 місяців тому

    How this setup is solving dependencies for the python code?

    • @MatthewRocklin
      @MatthewRocklin 8 місяців тому

      We scrape the local environment for package versions, move those to the target architecture, use mamba to solve and fill in any missing pieces, then we download the new packages on the fly onto each machine. It all happens seamlessly in the background. Users don't need to care about this detail (other than that it works)

  • @maksimhajiyev7857
    @maksimhajiyev7857 9 місяців тому

    The problem is that in fact RUST based tooling actually wins and all the paid promotions just suck . The actual reason why RUST based tooling is sort of suppressed is very simple , hyperscalers (big cloud tech) earn a lot of money and if things are faster there is no huge bills for your spark clusters 😊)) , I was playing with RUST and huge datasets myself without external benchmarks course I don t trust all this market shit .Rust based EDA is maybe witch kraft but this thing runs as beast . try yourself guys with a huge datasets .

  • @carlostph
    @carlostph 9 місяців тому

    When you say "now", from what version are we talking about? To future-proof the video.

  • @manojjoshi4321
    @manojjoshi4321 10 місяців тому

    It's a great introduction with very cool and easy to follow illustrations. Great job....!!

  • @kokizzu
    @kokizzu 10 місяців тому

    Clickhouse ftw

  • @giselleandreaulloadelarosa1869
    @giselleandreaulloadelarosa1869 11 місяців тому

    Would you please share a link to the github ?

  • @henrywittler5046
    @henrywittler5046 11 місяців тому

    Great work 🙂 Dask will fascilitate to solve some computational data analysis issues of many people

  • @snowaIker
    @snowaIker 11 місяців тому

    How delayed gets around GIL?

  • @wayne7936
    @wayne7936 11 місяців тому

    This is such a clear, simple, yet extremely powerful introduction. Alright, you convinced me to try coiled again.

    • @Coiled
      @Coiled 11 місяців тому

      Acheivement unlocked! If you tried out Coiled more than a year ago then it's definitely worth trying again. Admittedly, the product was kinda bad early on. Now it is quite delightful.

  • @ravishmahajan9314
    @ravishmahajan9314 11 місяців тому

    But DuckDB is good if your data fits one single machine. But the benchmarks shows different story when data is distributed. What about that?

  • @henrywittler5046
    @henrywittler5046 11 місяців тому

    Thanks for this tutorial and the other material at Dask and Coiled, will help heaps in a large data project 🙂

  • @henrywittler5046
    @henrywittler5046 11 місяців тому

    Thanks for this tutorial and the other material at Dask and Coiled, will help heaps in a large data project 🙂

  • @taylorpaskett3703
    @taylorpaskett3703 Рік тому

    What software did you use for generating / displaying your plots? It looked really nice

    • @taylorpaskett3703
      @taylorpaskett3703 Рік тому

      Nevermind, if I just kept watching you showed the GitHub where it says ibis and altair. Thanks!

  • @randywilliams7696
    @randywilliams7696 Рік тому

    Great video! Recently switched from Dask to Duckdb on my ~1TB workloads, interesting to see some of the same issues I found brought up here. One gotcha I've found is that it is REALLY easy to blunder your way into making non-performant queries in dask (things that end up shuffling, partitioning, etc. a lot behind the scenes). It was more straightforward for my use case to write performant SQL queries for duckdb since that is much more of a common, solved problem. The scale-out feature of Dask and Spark is interesting too, as we are considering the merits of a natively clustered solution vs just breaking up our queries into chunks that can fit on multiple single instances for duckdb.

    • @MatthewRocklin
      @MatthewRocklin Рік тому

      Yup. Totally agreed. The query optimization in Dask Dataframe should handle what you ran into historically. The problem wasn't unique to you :)

    • @ravishmahajan9314
      @ravishmahajan9314 11 місяців тому

      But what about distributed databases. Is DuckDB able to query distributed databases? Is this technology replacing spark framework??

  • @rjv
    @rjv Рік тому

    Such a good video! So many good insights clearly communicated with proper data. Also love the interfaces you've built, very meaningful, clean and minimalistic. Have you got comparison benchmarks where cloud cost is the only constraint and the number of machines or their size and type (GPU machines with cudf) is not restricted?

  • @mooncop
    @mooncop Рік тому

    you are most welcome (suffered well) worth it for the duck

  • @bbbbbbao
    @bbbbbbao Рік тому

    It's not clear to me if you can use autoscaling with coiled.

    • @Coiled
      @Coiled Рік тому

      You can use autoscaling with Coiled. See the `coiled.Cluster.adapt` method.

  • @o0o0oo00oo00
    @o0o0oo00oo00 Рік тому

    I don’t see duckdb and polars kick spark dask ass on 10gb level in my practical usage.😅 we can’t always trust TPC-H benchmarks.

  • @andrewm4894
    @andrewm4894 Рік тому

    Great talk, thanks

  • @Amapramaadhy
    @Amapramaadhy Рік тому

    Some ppl were meant to teach and Matt is one of them! One feedback: I know you have covered it elsewhere but it might be helpful to talk about the graphs (like what does a yellow vs red block mean). You have them up on the screen. They must be serving some purpose. Again, brilliant presentation

  • @kamranpersianable
    @kamranpersianable Рік тому

    Thanks, this is amazing! I have tried integrating Optuna hyperparameter search with Dask and it works great, but I have noticed if I increase the number of iterations, at some point my system crashes due to insufficient memory. From what I can see dask keeps a copy of each iteration so it ends up consuming more memory than needed; any way I can release all the memory usages after each iteration?

    • @Coiled
      @Coiled Рік тому

      The copy that Dask keeps is just the result of the objective function (scores, metrics). This should be pretty lightweight. That's not to say that there isn't some memory leak somewhere (XGBoost, Pandas, ...). If you're able to provide a reproducer to a Dask issue tracker that would be welcome. Alternatively if you run on Coiled infrastructure there's lots of measurement tools there that get run automatically that could help to diagnose.

    • @kamranpersianable
      @kamranpersianable Рік тому

      @@Coiled thanks, I will check further to see what is going wrong! From what I can see for 500 iterations, there is 9GB of added materials into the memory.

  • @ButchCassidyAndSundanceKid

    Does the Task Delayed use GPU as well ?

  • @UmmadikTas
    @UmmadikTas Рік тому

    I had an issue with parallelization and the random sampler for hyperparameter search. When I submit optimize function in parallel, optuna keeps repeating the same hyper-paremeters across all processes. I could not figure out how to reseed the sampler for different processes.

    • @Coiled
      @Coiled Рік тому

      Are the different processes communcating hyperparameters with a central Optuna Storage object? This video shows using the DaskStorage, which helps all of the Optuna search functions coordinate and share results between each other using Dask. Other ways to do this include using things like a database (although we think that Dask is easier).

  • @ButchCassidyAndSundanceKid

    What about Dask Bag and Dask Future ?

  • @irfams
    @irfams Рік тому

    Would you please share a link to the notebook ?

  • @UmmadikTas
    @UmmadikTas Рік тому

    Thank you so much. This is very helpful with my research.

  • @chaitanyamadduri5826
    @chaitanyamadduri5826 Рік тому

    The video is very informative and kudos to Richard for making Intuitive. Could you help me with below questions? 1. How can we perform a Time series regression using DASk. I see we are breaking the huge dataset to chunks how are gonna maintain the time continuity between the chunks. 2. You have used coiled clusters and i beleive these are external CPU clusters and how DASK is powerful over Pyspark in this case? 3. So DASK can be only utilised when there is CPU executions and it might be used in case of parallel GPU execution right ? Share your comments on this Thanks in advance

    • @Coiled
      @Coiled Рік тому

      Thanks for the questions! First, you can always post more detailed questions on the Dask Forum dask.discourse.group/. For your question on a time series regression, you may find this example helpful examples.dask.org/applications/forecasting-with-prophet.html If you're curious to learn more about pros/cons of Dask vs. Spark, check out our blog post: www.coiled.io/blog/spark-vs-dask You can use Dask (and Coiled!) with GPU-enabled machines. Learn more in the Coiled docs.coiled.io/user_guide/clusters/gpu.html or Dask documentation docs.dask.org/en/stable/gpu.html

  • @Lemuz90
    @Lemuz90 Рік тому

    This looks great! I remember trying to use coiled jobs to do something like this a while ago.

    • @Coiled
      @Coiled Рік тому

      Thank you! Let us know how you end up using this!

  • @orlandogarcia885
    @orlandogarcia885 Рік тому

    What are the coming features that coiled plans to do?

    • @Coiled
      @Coiled Рік тому

      We are working on lots of new things - check out Coiled Notebooks: ua-cam.com/video/mibhDHYun0M/v-deo.html and our upcoming webinar about Coiled Functions and Jobs, which allow you to run any python function in the cloud: ua-cam.com/video/JuBmG39zLY8/v-deo.html.

  • @thomasmoore3175
    @thomasmoore3175 Рік тому

    great stuff, Matt !

  • @bvenkateshx
    @bvenkateshx Рік тому

    I have a use case to read data from Oracle table - split this into files and zip it. Move to s3. Would Dask be a benefit or overhead for such a use case? (Cx_Oracle is used. Currently using mutiprocessing on 20 core server)

    • @Coiled
      @Coiled Рік тому

      Thanks for the question! It's hard to answer without more details on the size of your data, but feel free to post your question on the Dask Forum dask.discourse.group/

  • @Coiled
    @Coiled Рік тому

    Update: pandas 2.0 has been released! See www.coiled.io/blog/pyarrow-strings-in-dask-dataframes for the latest on PyArrow strings improvements.

  • @ДаниилСеров-ж4ч

    Thank you very much for this usefull information

  • @billyblackburn864
    @billyblackburn864 Рік тому

    the one at 15min is really nice...what is the cluster you're running it on?

  • @exeb1t_solopharm
    @exeb1t_solopharm Рік тому

    Большое спасибо вам! Отличная серия видео, продолжайте работать!

  • @Akademik-o2g
    @Akademik-o2g Рік тому

    Good video! Can you help me? Where can i find notebook from this video?

  • @mikecmw8492
    @mikecmw8492 Рік тому

    This is a very good video. I have to ask cause I am in the situation of setting up a DASK cluster that will be querying large weather datasets in AWS S3. I have never done it. Do you have a video on setting up the cluster? Have not explored your channel yet...thx

  • @pieter5466
    @pieter5466 Рік тому

    33:00 surprising that there aren’t existing open source solutions that support “marginal “ arrays, so to speak… has this changed?