Manage your data pipelines with Dagster | Software defined assets | IO Managers | Updated project

Поділитися
Вставка
  • Опубліковано 22 лип 2024
  • In this video we will revisit dagster. We will talk about changes to this workflow orchestration system due to recent updates (update from version 0.15 to 1.3.1)
    Dagster is an orchestrator that's designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.
    We will cover Software Defined Assets as dagster is pushing towards the Software Defined Assets. By default our pipeline outputs are stored as a pickle file in the dagster home folder. What if we want to store the outputs in a database table, or in a readable file such as a csv or parquet file. Dagster provides us with Input and Output managers (IO managers) that enable reading and writing data to storage systems. Using Store IO managers we can save the outputs in a file system or store our data as tables in a database. We will define file csv/parquet and database IO Managers.
    Link to previous video: • Getting started with D...
    Link to GitHub repo: github.com/hnawaz007/pythonda...
    Get started with Dagster in just three quick steps:
    Install Dagster, Define assets and Materialize assets.
    Create a virtual environment: python -m venv env
    Activate the virtual environment: env\Scripts\activate
    To install Dagster into an existing Python environment, run:
    pip install dagster dagit
    Command to create a new project
    dagster project scaffold --name my-dagster-project
    Additional libraries required: Pandas, psycopg2
    To run dagster issue following command:
    dagit
    dagster-daemon run
    Access Dagit UI on port 3000: 127.0.0.1:3000
    💥Subscribe to our channel:
    / haqnawaz
    📌 Links
    -----------------------------------------
    #️⃣ Follow me on social media! #️⃣
    🔗 GitHub: github.com/hnawaz007
    📸 Instagram: / bi_insights_inc
    📝 LinkedIn: / haq-nawaz
    🔗 / hnawaz100
    -----------------------------------------
    #Python #ETL #Dagster
    Topics covered in this video:
    ==================================
    0:00 - Introduction to Dagster
    2:11 - Dagster create new project
    3:03 - Dagster Project Structure
    4:18 - Software Defined Assets
    5:35 - Install Required Libraries
    5:58 - Source DB Connection
    6:27 - Source Asset
    10:05 - File IO Manager
    14:16 - Second Asset
    16:19 - Parquet IO Manager
    16:26 - Database IO Manager
    19:05 - Materialize Assets
  • Наука та технологія

КОМЕНТАРІ • 17

  • @BiInsightsInc
    @BiInsightsInc  Рік тому

    Link to previous video on Dagster: ua-cam.com/video/t8QADtYdWEI/v-deo.html&t
    ETL with Python: ua-cam.com/video/dfouoh9QdUw/v-deo.html&t

  • @MrMal0w
    @MrMal0w Рік тому +4

    Love it ! Dasgter is my favorite tool for data orchestration and you video is very well built 🎉 need more on this topic :)

  • @jeanguerrapty
    @jeanguerrapty 11 місяців тому

    Hi @BiInsightsInc, thank you very much for posting this awesome content. Could you please create an ETL video or series that work with these tools and MongoDB?

    • @BiInsightsInc
      @BiInsightsInc  10 місяців тому +1

      I will try and add the IO Manager for MongoDB.

  • @whalesalad
    @whalesalad 10 місяців тому

    A popular practice with BigQuery is to process data in stages where each stage is effectively a table. So you might have a raw table that takes all the raw data in, and then a pivot or aggregation process that would take the data from table A and write it to table B. I am trying to wrap my head around how to do this correctly with Dagster. The data would always live inside of BQ, never coming out into these python functions. Is there a best practice for this sort of thing? Effectively there is no IO, it is all remote, and Dagster would just be orchestrating the commands. Is this possible?

    • @BiInsightsInc
      @BiInsightsInc  10 місяців тому

      I think this is a standard elt approach if you are buidling data mart or database using SQL. dbt will be perfect for this use case. Your data lives in your database. You can transform it with sql using dbt. You can have raw sources, build intermediate tables for transformation and final dims and facts for analytics. Dagster can orchestrate the whole process ad-hoc or on a schedule.

  • @zamanganji1262
    @zamanganji1262 Рік тому

    If we need to execute multiple .sav files and convert them into multiple CSV files and do some modifications on them, how can we accomplish this using Dagster?

    • @BiInsightsInc
      @BiInsightsInc  Рік тому +1

      I saw your comment on the reference data ingestion video. You can borrow the code on how to ingest multiple files from there. You can easily covert the Python functions to "op" or and/or "asset" with the help of Dagster decorators.
      I have covered how to covert a Python script to "op" in this video here:
      ua-cam.com/video/t8QADtYdWEI/v-deo.html&t
      Code to convert sav files:
      import pandas as pd
      df = pd.read_spss("input_file.sav")
      df.to_csv("output_file.csv", index=False)

  • @MrMal0w
    @MrMal0w Рік тому +1

    Question : to implement an incremental load io manager we need to use the ‘append’ arg istead of ‘replace’ to sqlAlchemy. Is it possible to send this paramter directly from the asset ?

    • @BiInsightsInc
      @BiInsightsInc  Рік тому +1

      It is possible. I have seen an example of this on stack overflow but it requires a little more configuration, link below. Another idea would be to have two versions of IO Manager one for incremental (append) and a second one for truncate and load (replace).
      stackoverflow.com/questions/76173666/how-to-implement-io-manager-that-have-a-parameter-at-asset-level

    • @MrMal0w
      @MrMal0w Рік тому

      @@BiInsightsInc thanks a lot, I well check it :)

    • @henrikvaher697
      @henrikvaher697 Рік тому +2

      This is grear, I've had similar issues. I want to query an API and APPEND the retrieved data to the existing asset.

  • @Sebastian-xw4mp
    @Sebastian-xw4mp 3 місяці тому

    @BiInsightsInc, between 03:05 and 05:39 the requirements.txt magically appears in your etl folder. Makes it hard to follow along your video...

    • @BiInsightsInc
      @BiInsightsInc  3 місяці тому +1

      You can clone the repo, this way you will have all the requirements, then follow along. All links are in the description. Here is the link to the repo:
      github.com/hnawaz007/pythondataanalysis/tree/main/dagster-project/etl

  • @akmalhafiz7830
    @akmalhafiz7830 9 місяців тому +1

    Thanks this is helpful, however I do have a question, let say if I want to come out with ELT pipeline and ingest entire database into a data warehouse, is it better for me to separate the table into multiple data assets and ingest one by one? or just use one data asset?

    • @BiInsightsInc
      @BiInsightsInc  9 місяців тому +2

      It’s better to split each table as an asset. Each source table should have an asset, then stage this data after this step it descends on your data modeling strategy on how you want to model this data.

    • @akmalhafiz7830
      @akmalhafiz7830 9 місяців тому

      @@BiInsightsInc thank you for the input