Це відео не доступне.
Перепрошуємо.

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Поділитися
Вставка
  • Опубліковано 15 сер 2024
  • The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is 'many small files', and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
    About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
    Read more here: databricks.com...
    Connect with us:
    Website: databricks.com
    Facebook: / databricksinc
    Twitter: / databricks
    LinkedIn: / databricks
    Instagram: / databricksinc
    Get insights on how to launch a successful lakehouse architecture in Rise of the Data Lakehouse by Bill Inmon, the father of the data warehouse. Download the ebook: dbricks.co/3IM...

КОМЕНТАРІ • 56

  • @prabhumaganur
    @prabhumaganur 4 роки тому +38

    The best representation of Parquet file structure!! Simply Awesome!!

    • @ikercrew8830
      @ikercrew8830 3 роки тому

      You prolly dont give a shit but does any of you know of a tool to get back into an Instagram account..?
      I was stupid forgot my login password. I appreciate any help you can give me

    • @kristophergunnar9551
      @kristophergunnar9551 3 роки тому

      @Iker Crew Instablaster :)

    • @ikercrew8830
      @ikercrew8830 3 роки тому

      @Kristopher Gunnar Thanks for your reply. I got to the site on google and I'm trying it out now.
      Looks like it's gonna take quite some time so I will get back to you later when my account password hopefully is recovered.

    • @ikercrew8830
      @ikercrew8830 3 роки тому

      @Kristopher Gunnar it worked and I actually got access to my account again. I'm so happy:D
      Thank you so much, you saved my account :D

    • @kristophergunnar9551
      @kristophergunnar9551 3 роки тому

      @Iker Crew glad I could help :D

  • @manishsingh455
    @manishsingh455 3 роки тому +13

    This content explained most of the thing and It is really amazing .

  • @robinjamwal1
    @robinjamwal1 3 роки тому +7

    Great talk, Great Teach, Excellent Tutor! One of the best presentation I have ever viewed and listened.

  • @SunilBuge
    @SunilBuge 3 роки тому +7

    Great overview to address performance issues with storage layer design 👍

  • @lhok
    @lhok 11 місяців тому

    Best Parquet File presentation I watch

  • @YinghuaShen-kw5ys
    @YinghuaShen-kw5ys 3 місяці тому

    Great, this makes me know more about Parquet. Thanks for the pre!

  • @vt1454
    @vt1454 Рік тому +1

    Great presentation 👏 👌

  • @BuvanAlmighty
    @BuvanAlmighty 3 роки тому +1

    Best presentation in Parquet.

  • @raviiit6415
    @raviiit6415 Рік тому

    great talk with simple explanations.

  • @mallikarjunyadav7839
    @mallikarjunyadav7839 2 роки тому +1

    Awesome video with great content and explanation. Very very useful.

  • @kehaarable
    @kehaarable 3 роки тому +1

    Awesome video - not too much extraneous or labored points. Thank you!

  • @flaviofukabori2149
    @flaviofukabori2149 3 роки тому +1

    Amazing. All concepts really well explained.

  • @payalbhatia6927
    @payalbhatia6927 24 дні тому

    Superb

  • @AM-iz8gk
    @AM-iz8gk Рік тому

    Impressive presentation well structured explanations.

  • @higiniofuentes2551
    @higiniofuentes2551 Рік тому

    Thank you for this very useful video!

  • @raghudesparado
    @raghudesparado 3 роки тому +1

    Great Presentation. Thank you

  • @tadastadux
    @tadastadux 3 роки тому +2

    @databricks - what is the best practice to use or not use nested columns. For Example, I have struct of customer with Age, Gender, Name, etc attributes. Is it better to keep it as struct or separate into its own columns?

  • @hatemsiyala4944
    @hatemsiyala4944 Рік тому

    Great talk. Thank you!

  • @Pavi950
    @Pavi950 4 роки тому +2

    Thanks for the content!

  • @ashokkumarsivasankaran5428
    @ashokkumarsivasankaran5428 Рік тому

    Great! Well explained!

  • @user-zz9lk2op1f
    @user-zz9lk2op1f Рік тому

    Just excellent 👍

  • @AmitParopkari
    @AmitParopkari 5 місяців тому

    Finally understood what parquet format, thanks.
    So I have one small doubt, does it mean that footer metadata is nothing but schema details, like underlying table details? Like way to mention table name, column names? etc.
    I'll also dig from my side, but just meanwhile ....

  • @aratithakare8016
    @aratithakare8016 2 роки тому

    too good video. Excellent

  • @pavanreddy3321
    @pavanreddy3321 3 роки тому

    Thanks for great explanation

  • @ravann123
    @ravann123 2 роки тому

    Very helpful, thank you 😊

  • @tasak_5542
    @tasak_5542 10 місяців тому

    great talk

  • @dayserivera
    @dayserivera Рік тому

    Great!

  • @higiniofuentes2551
    @higiniofuentes2551 Рік тому

    Seems the time and i/o needed before use the data in doing the sort first is not considered?

  • @chrisjfox8715
    @chrisjfox8715 2 роки тому

    I haven't watched this yet but for the sake of prioritizing when I do, how well does this topic apply to platforms and systems other than Spark?

  • @Azureandfabricmastery
    @Azureandfabricmastery 3 роки тому

    Thank you!

  • @salookie8000
    @salookie8000 10 місяців тому

    interesting how Parquet (columnar analytical focused) data can be optimized using dictionary-based compression and partitioning

  • @maxcoteclearning
    @maxcoteclearning 2 роки тому

    Thankyou :)

  • @rum81
    @rum81 3 роки тому +12

    anyone who says parquet is columnar format is having just bookish knowledge

    • @immaculatesethu
      @immaculatesethu 3 роки тому +1

      Its a mixture of both Horizontal and Vertical partitioning and combines best of both worlds

    • @jeremygiaco
      @jeremygiaco 2 роки тому

      i like the way it compresses the data into dictionaries per file. reminds me a bit of an EAV database stored as a file

  • @spacedustpi
    @spacedustpi 4 роки тому +1

    Thanks for posting this presentation. Could you clarify something? How does performance improve when you compress pages only to decompress it again to read it? I'm sure I'm not understanding something, but not sure what.

    • @rescuemay
      @rescuemay 4 роки тому +4

      He mentions around @19:30 that you only see a benefit when the I/O savings outweigh the cost of decompressing.

    • @SQLwithManoj
      @SQLwithManoj 4 роки тому +1

      I/O is more expensive compared to the time taken by CPU to decompress the data, thus ColumnStore is faster compared to RowStore.

    • @rajeshgupta4466
      @rajeshgupta4466 4 роки тому +1

      Snappy provides good compression with a low CPU overhead during compression/decompression. The real win in performance comes from reduced I/O cost when reading a column chunk's page. The overall cost (CPU+I/O) is generally lower for reading snappy compressed as compared to uncompressed.

    • @spacedustpi
      @spacedustpi 4 роки тому +10

      @MGLondon How old are you? I am American (and not from China), and stick to common meats. This is an example of hate/harassment. Are you a high school kid?

    • @spacedustpi
      @spacedustpi 4 роки тому +1

      @harsh savla Good for you. Ecoli enters the body on vegetables.

  • @jeremygiaco
    @jeremygiaco 3 роки тому +1

    How is storing json/xml (not parquet) more efficient than csv? You literally store the "column names" in each "row" in xml/json (at least when stored in a text file) . Also, there is definitely the notion of a "record" in csv.

    • @happywednesday6741
      @happywednesday6741 2 роки тому

      Example 1. If you wanted to add a new properties to records overtime, you only need to add it to the new records (no need to back date blanks for legacy records for example). So think scale and change at scale.

    • @happywednesday6741
      @happywednesday6741 2 роки тому

      Example 2. You can leverage hash/dictionary data structures in programming, these can find records at a much better scaling, look up hash functions and big o. Again think scaling related to data access, hashing vs at best search trees.

    • @happywednesday6741
      @happywednesday6741 2 роки тому

      Example 3. You can more easily partition records via collections paradigm. Again storage and access at scale.

    • @happywednesday6741
      @happywednesday6741 2 роки тому

      Example 4. You will more easily access and operate xml / json - like data from applications via APIs. Systems and interoperability at scale.

    • @jeremygiaco
      @jeremygiaco 2 роки тому +1

      @@happywednesday6741 i asked how it was more efficient to store it. if i have 500 million "entries" in a text file, I'm definitely storing it in a delimited format or parquet to take advantage of said dictionaries and not json/xml. you can parse either into objects directly from the file, or bulk insert into a db table. the json/xlm formats would be 10x slower to parse/read in based on sheer disk/network i/o alone... if we're talking about efficiency in processing it. no one is going to load csv into memory and start trying to scan row by row for data, it's going to get converted into objects or a db anyways. my concern is when people store json formatted files to disk to be read into objects later. what does that buy you?

  • @thevijayraj34
    @thevijayraj34 2 роки тому

    Bucketing explanation was not great. Rest was fantabulous.

  • @chriskeo392
    @chriskeo392 3 роки тому

    Or whatever.... 😂

  • @lax976
    @lax976 8 місяців тому

    Worst lecture ever