The Missing Piece in Many Data Pipelines

Поділитися
Вставка
  • Опубліковано 26 лип 2024
  • ►► The Starter Guide for Modern Data → bit.ly/starter-mds
    Simplify “modern” architectures + better understand common tools & component
    All data teams (large & small) have at least one thing in common.
    Source data.
    But not everyone handles it the same way in their pipelines.
    For some, they'll reference raw source tables directly in many queries.
    For others, they'll create ad-hoc custom tables to address subtle formatting changes.
    But without any real over arching strategy or consistent naming behind it.
    While a more popular topic is data modeling (ex. kimball, one big table, etc.)
    I believe an equally more important area to consider is what you do BEFORE you start creating those core data models.
    For many, this "before" layer doesn't exist at all.
    In previous videos I've talked about a 3-Layered Data Model.
    And today I want to focus solely on Layer 1, which addresses this concept.
    It's called a "Staging" layer.
    When done right, it can help you establish reliable pipelines from the very start.
    Timestamps:
    00:00 - Intro
    00:52 - What is a Staging Layer?
    03:23 - Reason # 1: Modularity
    05:03 - Reason # 2: Consistency
    07:21 - Reason #3: Clarity
    Title & Tags:
    The Missing Piece in Many Data Pipelines
    #kahandatasolutions #dataengineering #datamodeling

КОМЕНТАРІ • 13

  • @KahanDataSolutions
    @KahanDataSolutions  23 дні тому

    ►► The Starter Guide for Modern Data → bit.ly/starter-mds
    Simplify “modern” architectures + better understand common tools & component

  • @andresarmua
    @andresarmua 14 днів тому

    Nice! I use a staging layer as a view and then 4 more layers for the pipeline until I get to the mart. I usually alternate between views and materialized tables, but I am not quite sure how to know the optimal way to decide between tables and views at each time. How do you compare performance, storage and other practical factors?

  • @bertjanvdberg
    @bertjanvdberg 23 дні тому +2

    Nice! Question: Do you also use views in your warehouse and mart layers? I've been at companies where the marts were basically views based on views based on views times 10 which was terrible for the performance of getting the data.

    • @ramtadam1469
      @ramtadam1469 23 дні тому +2

      We always use tables as marts and then sometimes on top build views that do things with the materialized marts data.

  • @thedavidabides
    @thedavidabides 23 дні тому +2

    Nice work! Where should the staging layer come when using a bronze, silver, gold medallion structure ?

    • @muhammadbadar6089
      @muhammadbadar6089 23 дні тому +2

      from my understanding you would use your bronze layer as a staging layer pulling from all source systems

    • @personalbranddata
      @personalbranddata 23 дні тому +1

      It's the silver layer. Bronze = raw data in this video. Silver = "staging"/cleaned data in this video. Gold = Warehouse in this video. I don't like that he's using the term "staging" to refer to cleaned data because in traditional data warehousing a staging table typically refers to uncleaned data straight after you've loaded it from a source system and the cleaning happens later.

    • @ArmandsPutnis
      @ArmandsPutnis 23 дні тому +2

      it does not really matter how you call them if you have agreed on the purpose. Bronze layer can be raw_source or it can be staging.
      personally i like to keep the source out of the way and use bronze for staging - cleaning/transforming.
      silver for joining multiple bronze tables, what i know can be reused for multiple use cases in a gold layer.
      gold layer for the final solution/consumption joining some silver and bronze tables.

    • @gatorpika
      @gatorpika 20 днів тому

      @@ArmandsPutnis yeah, this. Bronze, silver and gold is an abstraction to help you think about your structure, not something with set rules you have to follow dogmatically. Figure out what layers you need to solve your problems and then just structure your layers appropriately. Staging serves a purpose to help you shift the transforms left so changes are easier down the road given they will propagate through all your downstream transforms. Then transform on top of that assuming the stage takes care of most of the cleaning/formatting for you. If your management makes you pick a metal, I suggest the titanium layer.

  • @johnpower1458
    @johnpower1458 19 днів тому

    Do you truncate the data each batch pipeline run on staging and capture the cleaned data in snapshots? If not, how do you avoid duplicates down stream if you’re using say SCD Type 2?

  • @williamchurch711
    @williamchurch711 11 днів тому

    The staging layer would be equivalent to a landing zone?

    • @senarl
      @senarl 6 днів тому

      Migh be wrong but I take that the staging layer would be a bronze layer in the Medallion architecture, so we would have landing with raw data, bronze with cleaned raw data, silver with any new columns or any enhancement to the data and Gold with the joins and business logic. But thats just how I use at work and it can be changed to fit your needs

  • @Milhouse77BS
    @Milhouse77BS 23 дні тому

    Stage All the Things