Machine Learning in Production with Airflow

Поділитися
Вставка
  • Опубліковано 18 лис 2024

КОМЕНТАРІ • 13

  • @chrisogonas
    @chrisogonas Рік тому +3

    That was an excellent illustration. Superb!

  • @JosephRivera517
    @JosephRivera517 Рік тому

    I love the presentation. would you mind to share with me your presentation? Thanks.

    • @Astronomer
      @Astronomer  Рік тому

      Hey Joseph, definitely, would you mind emailing me and I'll send it over that way? My email is george.yates@astronomer.io

  • @ryank8463
    @ryank8463 7 місяців тому

    Hi, this video is really beneficial. I have some question about the best practive of handling data transmission btw tasks. I am building MLops using airflow. In my model training dag, it contains data preprocess-> model training. So there would be massive data transmission btw this 2 dags. I am using Xcom to transmit data btw them. But there's like a 2G limitation in Xcom. So what's the best practice to deal with this problem? Using a S3 to sned/pull data from tasks? Or should I simply combine these 2 tasks(data preprocess-> model training)? Thank you.

    • @Astronomer
      @Astronomer  7 місяців тому

      Thank you! For passing larger amounts of data between tasks you have two main options: a custom XCom backend or writing to intermediary storage directly from within the tasks.
      In general we recommend a custom XCom backend as a best practice in these situations, because you can keep your DAG code the same, the change happens in how the data sent to and retrieved from XCom is processed. You can find a tutorial on how to set up a custom XCom backend here: docs.astronomer.io/learn/xcom-backend-tutorial.
      Merging the tasks is generally not recommended because it makes it harder to get observability and rerun individual actions.

    • @ryank8463
      @ryank8463 7 місяців тому

      @@Astronomer Hi, Thanks for your valuable reply. I would also like to ask what level of granularity should we aim for when allocating tasks. Since the more tasks there are, the more push/pull data from the external storage happens, and when the data is large, it brings some level of network overhead.

  • @ministryNoiz
    @ministryNoiz 2 роки тому

    Thanks. Exciting topic. How reasonable is it to run long-running ML processes in airflow?

    • @Astronomer
      @Astronomer  2 роки тому

      Hi, Oleg! If and when possible, the process should be broken up into separate tasks. The crucial aspect of the decision is where the compute will actually happen. It would be best for Airflow to submit the ML process as a compute service and then retrieve back the results. For this, the Airflow task doesn't have to be "always-on," so you don't have to be limited by Airflow's constraints. If you would like to discuss the matter further, you can set up a meeting with us and get the support you need: www.astronomer.io/office-hours

  • @mohamedchafiq7793
    @mohamedchafiq7793 Рік тому

    could you share with us the presentation?

    • @Astronomer
      @Astronomer  Рік тому

      Sure, just email me george.yates@astronomer.io and I'll send it over!