Hi, this video is really beneficial. I have some question about the best practive of handling data transmission btw tasks. I am building MLops using airflow. In my model training dag, it contains data preprocess-> model training. So there would be massive data transmission btw this 2 dags. I am using Xcom to transmit data btw them. But there's like a 2G limitation in Xcom. So what's the best practice to deal with this problem? Using a S3 to sned/pull data from tasks? Or should I simply combine these 2 tasks(data preprocess-> model training)? Thank you.
Thank you! For passing larger amounts of data between tasks you have two main options: a custom XCom backend or writing to intermediary storage directly from within the tasks. In general we recommend a custom XCom backend as a best practice in these situations, because you can keep your DAG code the same, the change happens in how the data sent to and retrieved from XCom is processed. You can find a tutorial on how to set up a custom XCom backend here: docs.astronomer.io/learn/xcom-backend-tutorial. Merging the tasks is generally not recommended because it makes it harder to get observability and rerun individual actions.
@@Astronomer Hi, Thanks for your valuable reply. I would also like to ask what level of granularity should we aim for when allocating tasks. Since the more tasks there are, the more push/pull data from the external storage happens, and when the data is large, it brings some level of network overhead.
Hi, Oleg! If and when possible, the process should be broken up into separate tasks. The crucial aspect of the decision is where the compute will actually happen. It would be best for Airflow to submit the ML process as a compute service and then retrieve back the results. For this, the Airflow task doesn't have to be "always-on," so you don't have to be limited by Airflow's constraints. If you would like to discuss the matter further, you can set up a meeting with us and get the support you need: www.astronomer.io/office-hours
That was an excellent illustration. Superb!
Thank you! Cheers!
@@Astronomer Certainly!
I love the presentation. would you mind to share with me your presentation? Thanks.
Hey Joseph, definitely, would you mind emailing me and I'll send it over that way? My email is george.yates@astronomer.io
Hi, this video is really beneficial. I have some question about the best practive of handling data transmission btw tasks. I am building MLops using airflow. In my model training dag, it contains data preprocess-> model training. So there would be massive data transmission btw this 2 dags. I am using Xcom to transmit data btw them. But there's like a 2G limitation in Xcom. So what's the best practice to deal with this problem? Using a S3 to sned/pull data from tasks? Or should I simply combine these 2 tasks(data preprocess-> model training)? Thank you.
Thank you! For passing larger amounts of data between tasks you have two main options: a custom XCom backend or writing to intermediary storage directly from within the tasks.
In general we recommend a custom XCom backend as a best practice in these situations, because you can keep your DAG code the same, the change happens in how the data sent to and retrieved from XCom is processed. You can find a tutorial on how to set up a custom XCom backend here: docs.astronomer.io/learn/xcom-backend-tutorial.
Merging the tasks is generally not recommended because it makes it harder to get observability and rerun individual actions.
@@Astronomer Hi, Thanks for your valuable reply. I would also like to ask what level of granularity should we aim for when allocating tasks. Since the more tasks there are, the more push/pull data from the external storage happens, and when the data is large, it brings some level of network overhead.
Thanks. Exciting topic. How reasonable is it to run long-running ML processes in airflow?
Hi, Oleg! If and when possible, the process should be broken up into separate tasks. The crucial aspect of the decision is where the compute will actually happen. It would be best for Airflow to submit the ML process as a compute service and then retrieve back the results. For this, the Airflow task doesn't have to be "always-on," so you don't have to be limited by Airflow's constraints. If you would like to discuss the matter further, you can set up a meeting with us and get the support you need: www.astronomer.io/office-hours
could you share with us the presentation?
Sure, just email me george.yates@astronomer.io and I'll send it over!