[Notes Part-2] Netflix tried to address the problems with the Hive table format, by creating a new table format. The goals Netflix wanted to achieve with this format were: - Table correctness/consistency - Faster query planning and inexpensive execution - eliminate file listing and better file pruning - Allows users to not worry about the physical layout of the data. Users should be able to write queries naturally and be able to get benefit or partitioning. - Table evolution - Allow adding, removal of table columns, changes to partition schema. - Accomplish all of these at scale. With this we landed with the concept of iceberg. "With Iceberg, a table is a canonical list of files." Instead of saying a table is all the files in a directory, theoretically the files in iceberg table could be anywhere. Files could be in different folders and does not have to be in nicely organized directory system. This is because iceberg is going to maintain the list of files in its metadata and the engine would leverage this metadata. This allows to get to the files faster.
[Notes Part-1] What is a table format? - A way to organize a dataset’s files to present them as a single table. Our datalake may contain millions of data files that represent multiple datasets. The analytical tools need to know which files are part of which dataset. This is the problem that table formats solve. - It’s a way to answer the question “what data is in this table”? Hive Table Format: - Old table format. - Abstracted the complexity of Map Reduce. Internally it converted SQL statements in to map reduce jobs. - Created at Facebook - Used “directories” on storage (HDFS/Object Storage) to indicate which files are part of which table. This format applies a simpler approach that says a particular directory is a table. All the contents of the directory are part of this table. Any subfolders would be treated as partitions of the table. - Pros: o Works with almost every engine since it became the de-facto standard and stayed so for long. o Partitioning allows more efficient query access patterns than full table scans. o File format agnostic. o Atomically update a while partitions. o Single, central answer to “what data is in this table” for the whole ecosystem. - Cons: o Small updates are very inefficient. Updating a few records wasn’t easy. A partition was a smallest unit of transaction. To update few records, the entire partition or entire table had to be swapped/rewritten. o No way to atomically update multiple partitions. There is a risk that after one partition is changed and before we start work on the next partition, someone could query the table and read inconsistent data o Multiple jobs modifying the same dataset do not do it safely. ,.i.e. Safe concurrent writes not possible. o All of the directory listing needed for large table take a long time. Engines have to list all the contents of the directories prior to returning the results so that it can understand the metadata of the files that make up the table/partition. o To filter out the files (pruning) required the engines to read those files. Opening and closing of all the files to check if they have the data of interest made it time consuming. o Users of the datasets have to know the physical layout of the table. Without knowing the physical layout of the table, we might write inefficient queries that could lead to full table scans that are costly. o Hive table statistics are often stale and data engineers have to keep executing Analyze queries to keep the stats up to date. Collecting stats needs more compute and stats are only fresh as often we run the analyze jobs.
[Notes Part-4] Iceberg Design Benefits: - Efficiently make smaller updates. Updates do not happen at the directory level now. Changes are made at the file level. - Snapshot isolation for transactions. Every change to the table creates a new snapshot. All read are on the newest snapshot. Reads and writes do not interfere with each other and all writes are atomic. Read does not read a partial snapshot. - Faster planning and execution. A lot of metadata is maintained at individual file, partition level allowing better file pruning while running the query. Column stats are maintained in the manifest files. These column stats are used to eliminate files. - Reliable metrics for CBOs(vs hive). The column stats and metrics are calculated on write instead of frequent expensive jobs. - Abstracting the physical, expose a logical view. Users don’t have to be familiar with the underlying physical structure of the table. Features like hidden partitioning, compaction of small files , table can change over time , ability to experiment with the table layout with breaking the table. - Rich schema evolution support. - All engines see changes immediately.
[Notes Part-3] What iceberg is? 1) Table Format Specification: It’s a standard for how do we organize the data around the table. Any engine reading/writing data from iceberg table should implement this open specification. 2) A set of APIs and libraries of interaction with that specification (Java, Python API) These libraries are leveraged in other engines and tools that allow them to interact with iceberg tables. What iceberg is not? 1) Not a storage engine. (Storage options we could use are HDFS, object stores like S3) 2) Not an execution engine (Engine options we could use Spark, Presto, Flink etc) 3) Not a service. We do not have to run a server of some sort. It’s just a specification for storing and reading data and metadata files.
The location of the metadata is elaborated in later videos in this series but essentially there are three types of files that are created: - metadata.json: table level definitions (table id, list of snapshots, current/previous schema/partitioning scheme, etc.) - snapshot/manifest list: A list of manifest files belonging to one table snapshot with partition stats to do partition pruning - manifests: files listing a batch of files with the tables data with metadata on each column that can be used for min/max pruning.
Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?
When you start talking about 10TB+ datasets yo run into issues on whether database can hold the dataset and performantly. Also different purposes need different tools so you need your data in a way that be used by different teams with different tools.
Also with data lakehouse tables there doesn’t have to be any running database server when no one is querying the dataset since they are just files in storage while traditional database tables need a persistently running environment.
100% agreed, I was trying to saying a “Java” script not JavaScript but said it too fast, but 100%. I always fumble trying to express writing a single Java “thing” like saying a “Python Script” or “Ruby Script”. - Alex
[Notes Part-2]
Netflix tried to address the problems with the Hive table format, by creating a new table format.
The goals Netflix wanted to achieve with this format were:
- Table correctness/consistency
- Faster query planning and inexpensive execution - eliminate file listing and better file pruning
- Allows users to not worry about the physical layout of the data. Users should be able to write queries naturally and be able to get benefit or partitioning.
- Table evolution - Allow adding, removal of table columns, changes to partition schema.
- Accomplish all of these at scale.
With this we landed with the concept of iceberg.
"With Iceberg, a table is a canonical list of files."
Instead of saying a table is all the files in a directory, theoretically the files in iceberg table could be anywhere. Files could be in different folders and does not have to be in nicely organized directory system.
This is because iceberg is going to maintain the list of files in its metadata and the engine would leverage this metadata. This allows to get to the files faster.
[Notes Part-1]
What is a table format?
- A way to organize a dataset’s files to present them as a single table. Our datalake may contain millions of data files that represent multiple datasets. The analytical tools need to know which files are part of which dataset. This is the problem that table formats solve.
- It’s a way to answer the question “what data is in this table”?
Hive Table Format:
- Old table format.
- Abstracted the complexity of Map Reduce. Internally it converted SQL statements in to map reduce jobs.
- Created at Facebook
- Used “directories” on storage (HDFS/Object Storage) to indicate which files are part of which table. This format applies a simpler approach that says a particular directory is a table. All the contents of the directory are part of this table. Any subfolders would be treated as partitions of the table.
- Pros:
o Works with almost every engine since it became the de-facto standard and stayed so for long.
o Partitioning allows more efficient query access patterns than full table scans.
o File format agnostic.
o Atomically update a while partitions.
o Single, central answer to “what data is in this table” for the whole ecosystem.
- Cons:
o Small updates are very inefficient. Updating a few records wasn’t easy. A partition was a smallest unit of transaction. To update few records, the entire partition or entire table had to be swapped/rewritten.
o No way to atomically update multiple partitions. There is a risk that after one partition is changed and before we start work on the next partition, someone could query the table and read inconsistent data
o Multiple jobs modifying the same dataset do not do it safely. ,.i.e. Safe concurrent writes not possible.
o All of the directory listing needed for large table take a long time. Engines have to list all the contents of the directories prior to returning the results so that it can understand the metadata of the files that make up the table/partition.
o To filter out the files (pruning) required the engines to read those files. Opening and closing of all the files to check if they have the data of interest made it time consuming.
o Users of the datasets have to know the physical layout of the table. Without knowing the physical layout of the table, we might write inefficient queries that could lead to full table scans that are costly.
o Hive table statistics are often stale and data engineers have to keep executing Analyze queries to keep the stats up to date. Collecting stats needs more compute and stats are only fresh as often we run the analyze jobs.
[Notes Part-4]
Iceberg Design Benefits:
- Efficiently make smaller updates. Updates do not happen at the directory level now. Changes are made at the file level.
- Snapshot isolation for transactions. Every change to the table creates a new snapshot. All read are on the newest snapshot. Reads and writes do not interfere with each other and all writes are atomic. Read does not read a partial snapshot.
- Faster planning and execution. A lot of metadata is maintained at individual file, partition level allowing better file pruning while running the query. Column stats are maintained in the manifest files. These column stats are used to eliminate files.
- Reliable metrics for CBOs(vs hive). The column stats and metrics are calculated on write instead of frequent expensive jobs.
- Abstracting the physical, expose a logical view. Users don’t have to be familiar with the underlying physical structure of the table. Features like hidden partitioning, compaction of small files , table can change over time , ability to experiment with the table layout with breaking the table.
- Rich schema evolution support.
- All engines see changes immediately.
[Notes Part-3]
What iceberg is?
1) Table Format Specification:
It’s a standard for how do we organize the data around the table.
Any engine reading/writing data from iceberg table should implement this open specification.
2) A set of APIs and libraries of interaction with that specification (Java, Python API)
These libraries are leveraged in other engines and tools that allow them to interact with iceberg tables.
What iceberg is not?
1) Not a storage engine. (Storage options we could use are HDFS, object stores like S3)
2) Not an execution engine (Engine options we could use Spark, Presto, Flink etc)
3) Not a service. We do not have to run a server of some sort. It’s just a specification for storing and reading data and metadata files.
Your voice is like an angel to fall asleep😇
Thank you for this Series! It’s great!
amazing content!! great work
Great start to Iceberg series. Thanks a lot !
Lot more content here as well www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/
18:31 Doesn't the Hive Metadata Store already allows to reference the data files making up a table, and collect columns statistics?
For the next video, highly recommend uncovering the problem within a minute. At at 10:00 and wondering why I should integrate this into my stack...
you speak so nice!
Thanks
Great Video...can you please advise where is the metadata information of Iceberg format stored for a table ?
The location of the metadata is elaborated in later videos in this series but essentially there are three types of files that are created:
- metadata.json: table level definitions (table id, list of snapshots, current/previous schema/partitioning scheme, etc.)
- snapshot/manifest list: A list of manifest files belonging to one table snapshot with partition stats to do partition pruning
- manifests: files listing a batch of files with the tables data with metadata on each column that can be used for min/max pruning.
@@AlexMercedCoder Got you. Thank you so much , You are doing a favor - this a an awesome learning channel
Hi thanks this is really a great information to start with Apache Iceberg. But I have a question, when modern databases are already doing it with so much advance technology to prune and scan the data, why would we need to store the data in files format instead of directly loading them to a table ?
When you start talking about 10TB+ datasets yo run into issues on whether database can hold the dataset and performantly. Also different purposes need different tools so you need your data in a way that be used by different teams with different tools.
Also with data lakehouse tables there doesn’t have to be any running database server when no one is querying the dataset since they are just files in storage while traditional database tables need a persistently running environment.
@@Dremio wow! Now I have got full clarity. Thank you so much for your response.
@@Dremio cost saving. Thanks for the tip 😀.
Could you please share the slides? Thanks in advance.
Send an email to alex.merced@dremio.com
thanks very good.
Nice video but a little mistake :)
Java was used to develop Map Reduce jobs. Not JavaScript.
You probably know that, you just got them mixed up.
100% agreed, I was trying to saying a “Java” script not JavaScript but said it too fast, but 100%. I always fumble trying to express writing a single Java “thing” like saying a “Python Script” or “Ruby Script”.
- Alex
@@Dremio nice save!!
haha
Alright. It makes sense. Anyway I knew that you knew.
Please tell me , how this is different or better from Databricks Lakehouse?
I would point you to watching out State of the Data Lakehouse presentation on this channel.
Very poor sound quality!