When a DE at my work showed me how to rewrite some of my queries with indexed temp tables to join all the model features into a final table I was so happy
Love the video! I am a huge proponent of CTEs. Nothing worst when you open a table’s source code and see on massive block of join to multiple tables and sub queries all with their own unique filtering logic at the bottom of the joins or nested in the sub query. I feel like I can understand the logic and read the code much quicker with the CTEs at the top.
The CROSS APPLY or LATERAL JOIN clause is also nice if you’re doing multiple steps of transformations on the same relation and don’t want to break out a separate CTE for each layer.
You mentioned troubleshooting as a potential issue with temp tables at around 13:00, but I would say that if you want to troubleshoot, you can insert statements in the middle of your script that clone the tempt tables, so you can take a look at the intermediary steps. Something like CREATE TEMP TABLE staging_data AS {some sql logic}; -- Dump data into a sandbox table for debugging, delete snippet once you're done. CREATE TABLE CLONE staging_data; MERGE INTO target_schema.table_name {some sql logic}
this is so helpful Zach, we get table apis and do a bunch of dataset out of it to build up our sass applications. This sheds light since my current role as a DE in this company is really heavy on SQL but not that part of the etl, they kind of separated that and is more on the backend de. Im trying to bridge that since this is also my first role as de
Not sure about other DB systems but i do know that Oracle does offer Materilze hints for CTE that basically store the query results in a GTT (Global Temp Table) like structure and then drops it once the query finishes. I have used this strategy a lot at my work. Thanks for the explanation though Zach.
Important thing to note here is that at 6:30, if you are creating a permanent view in your DB, you need to do it only once as its a DDL statement adding a new object to your schema. It does not have to be repeated everytime else your code will start throwing error. Modern DBs may let you create temporary views whose scope is only during the execution and is not global. You may choose to use that.
I’m a BI developer who occasionally has to dabble in some heavy SQL scripting and your content is hitting some nails on the head that I’ve always been curious about. Humorously, I arrived at the same idea that CTEs are (most often) the best use case scenario. We have another developer that uses subqueries heavily and the performance is really bad. Great content. You def earned a sub!
If only this was posted earlier id probably pass my technical interview on materialized transformation. 😢 thank you for the great explanation and use cases.
i used CTEs , views , materialized views, temp tables but never found any detail in depth explanations how they works internally and performance comparison. Now everything is clear to me Thank you for the contents you shared. Another thing, as you described here 15:14 does this happens only in the relational DBMS ? or in both relational and data warehouses like snowflake BigQuery ?
Great question! I haven’t tested that exact case. My guess is it’s still blocking based on how refreshes work but that might be something that varies with data warehouses
Great video! What could have been briefly included, in my opinion, is the comparison between materialized views vs (normal) tables, as they seem pretty similar. When to use which one. Keep rocking the great work👍
@@EcZachly_ I'd be interested to know your thoughts on the different architecture patterns like the lakehouse, vault, mesh, etc. Looking forward to more from you! :D
Great content, as ever! I'd like to see a video on Data Vault (with hashes) or at least if you can recommend some resources for it... a practical example would be best. Keep up the good work
Hey Zach, thank for such informative videos. Next can you please explain something about dbt usecases. Points like how it tests sql, orchestration, pipline. Other alternativer if any.
Hey Zach, been following you almost religiously now, cheers for the clear conceptual content. What I am not clear on is why the computation for the CTE 9:55 teammates and enemies took that long. Shouldn't the creation of a CTE and the creation of a materialized table take the same amount of time? Does it have anything to do with the complex JOIN condition?
If an MV is not too complex, they can be incrementally (delta) refreshed on some RDBMSes. MVs are great for rollup ( counts, min,max, histograms). Sadly pivots usually are "complex" and require a full rebuild.
Wow I never heard about Jane Table API before. Something new to look into. Does it work together with any other database platform like MySQL or PostgreSQL? I have only worked with MySQL so far and never used combinations of databases in projects so I am curious to know
The spark "registerAsTempTable" goes into what type of those? It works like a temporary table, saving the data just for that session? I guess I've been using it wrong =p
I was reviewing some SQL code today, and I came across an example of nested SQL query aka an infinite series of subqueries, and I was asking myself: WHY?
Great explanation Zach! This is one of the best comparisons I've heard between all these - with all their pros and cons! 🌟🌟🌟 On this topic, I would be really thankful for an advice about performance between temporary tables and using staging tables, which are truncated and reloaded on every execution of the ETL. My use case is a complex transformation in which we have multiple stages of preparing the dataset moving it from one temporary table to another after applying some logic, joining a table or deleting some rows. It all happens within one script/session, so we don't need to keep the intermediary results, only the final dataset. We use Redshift and the performance is definitely not great. Do you think we can get any performance gains if we turn these temp tables into actual staging tables and we truncate and reload them? Also, do you know if defining DISTRIBUTION/SORT key on a temporary table in Redshift is as efficient as dist/sort key on an actual table?
Personally I use staging tables over temp tables for debugging purposes. When your end/user-facing table is returning incorrect or duplicate data, you can always trace back to an upstream staging table layer by layer to identify the root cause of the issue. Also, it's a lot easier to fine-tune the performance with staging table because you can test and validate different optimization techniques on them.
I have a question Zach. In snowflake, you cannot use join in materialized view. So, does materialized views vary from software to software? Also, there are many restrictions on using materialized view on snowflake.
Great video thank you Zach! Where do stored procedures come in with this conversation? Are sprocs materialized or non-materialized? Or are sprocs used for a purpose besides storing transformations?
Delivery tip: if something is awesome, you don't have to say so. It will be apparent. Having said that, seeing the 1 to 1 translation between different concepts WAS awesome.
we have a really long sql statement that uses a sequence of CTE. Each CTE is then used in the next CTE through a join condition. Do you think we would benefit from going to temp tables instead?
hi Zach, great video. What would be the difference between creating a materialized view and creating a table of the results? Would it be the same thing performance wise?
If the materialized view has the same indices as the table, they’ll be the same. The main difference is you can automatically refresh the materialized view. But you can incrementally build a table
This is really great. Can you make videos about SQL aggregate functions, triggers, and transactions? especially aggregate functions and transactions if you can
I don't really get the benefits of using a materialized view instead of a regular table other that it's a little easier to update (with regular table you would need tp update using keys and check for changed rows, probably using hash)
Hi Zach, thanks for this vid, it was such a great breakdown! I had 1 question if you have time:
At 17:40 you mention how temp tables have incremental gains, I'm struggling to understand how that is, seeing as how similar to materialized views it appears. Is this because if you're adding new data 'A' to existing data 'B', temp tables can just focus on adding 'A', whereas materialized views require you to handle 'B' as well due to the refresh? This might be me not understanding the concept of incremental gains, thanks!
Materialized views are oftentimes like the fine print or the auction speak at the end of a commercial..... SO many limitations (depending on the vendor or edition or featureset). It's like dating the perfect 10...... as long as you pay the 5k dinner bill and have the person home by 8pm because that person has a data engineering conference to speak at in the morning.... 😂😂😂😂
When a DE at my work showed me how to rewrite some of my queries with indexed temp tables to join all the model features into a final table I was so happy
fist time learning that materialized can be run concurrently with unique index. thanks for great content.
Thanks! Anything else you’d want to learn from me?
Yep, it's really helpful function.
Great video! How do you go about learning complex data engineer concepts? What's your learning algorithm? Thanks
If you want them to be current then yep
Subscribed on TikTok when I came across your videos a couple of months ago. Now UA-cam has recommended you as well. Seems like a good sign for you!
Huge thanks for this informative video! Picking up some handy stuff even after +10 years of data engineering or whatever we call it now 🙂
Man you're on fire, producing content at the pace I can't consume it :D
Nice job!
Love the video! I am a huge proponent of CTEs. Nothing worst when you open a table’s source code and see on massive block of join to multiple tables and sub queries all with their own unique filtering logic at the bottom of the joins or nested in the sub query. I feel like I can understand the logic and read the code much quicker with the CTEs at the top.
The CROSS APPLY or LATERAL JOIN clause is also nice if you’re doing multiple steps of transformations on the same relation and don’t want to break out a separate CTE for each layer.
You mentioned troubleshooting as a potential issue with temp tables at around 13:00, but I would say that if you want to troubleshoot, you can insert statements in the middle of your script that clone the tempt tables, so you can take a look at the intermediary steps.
Something like
CREATE TEMP TABLE staging_data AS
{some sql logic};
-- Dump data into a sandbox table for debugging, delete snippet once you're done.
CREATE TABLE
CLONE staging_data;
MERGE INTO target_schema.table_name
{some sql logic}
Zach…sir…this video is effin awesome!!! Like the speed of explanation and the examples!!! 👊
this is so helpful Zach, we get table apis and do a bunch of dataset out of it to build up our sass applications. This sheds light since my current role as a DE in this company is really heavy on SQL but not that part of the etl, they kind of separated that and is more on the backend de. Im trying to bridge that since this is also my first role as de
Not sure about other DB systems but i do know that Oracle does offer Materilze hints for CTE that basically store the query results in a GTT (Global Temp Table) like structure and then drops it once the query finishes. I have used this strategy a lot at my work. Thanks for the explanation though Zach.
Important thing to note here is that at 6:30, if you are creating a permanent view in your DB, you need to do it only once as its a DDL statement adding a new object to your schema. It does not have to be repeated everytime else your code will start throwing error.
Modern DBs may let you create temporary views whose scope is only during the execution and is not global. You may choose to use that.
That's a nice suggestion to use materialized views or Indexed views when the sub query is taking longer time than the actual query.
Awesome video! Would love to see you cover Change Data Capture - with all your breadth and experience, i'm sure it would result in an amazing video.
Thanks for the suggestion!
I’m a BI developer who occasionally has to dabble in some heavy SQL scripting and your content is hitting some nails on the head that I’ve always been curious about. Humorously, I arrived at the same idea that CTEs are (most often) the best use case scenario. We have another developer that uses subqueries heavily and the performance is really bad.
Great content. You def earned a sub!
If only this was posted earlier id probably pass my technical interview on materialized transformation. 😢 thank you for the great explanation and use cases.
i used CTEs , views , materialized views, temp tables but never found any detail in depth explanations how they works internally and performance comparison. Now everything is clear to me Thank you for the contents you shared. Another thing, as you described here 15:14 does this happens only in the relational DBMS ? or in both relational and data warehouses like snowflake BigQuery ?
Great question! I haven’t tested that exact case. My guess is it’s still blocking based on how refreshes work but that might be something that varies with data warehouses
Great video! What could have been briefly included, in my opinion, is the comparison between materialized views vs (normal) tables, as they seem pretty similar. When to use which one.
Keep rocking the great work👍
Mv is updated simply by refresh and you can keep the log in mv_stats. Tables can be updated theough complex etl
9:29 can you explain why this is slow in more detail? I assumed that everything would run once and get reused in-memory. What exactly re-ran and why?
Great work, thank you so much Zach! Very informative and concise. Keep these coming!
I appreciate your support! Any topics in particular you'd want me to cover?
@@EcZachly_ I'd be interested to know your thoughts on the different architecture patterns like the lakehouse, vault, mesh, etc. Looking forward to more from you! :D
Great content, as ever! I'd like to see a video on Data Vault (with hashes) or at least if you can recommend some resources for it... a practical example would be best. Keep up the good work
Hey Zach, thank for such informative videos.
Next can you please explain something about dbt usecases. Points like how it tests sql, orchestration, pipline. Other alternativer if any.
Great idea!
Hey Zach, been following you almost religiously now, cheers for the clear conceptual content. What I am not clear on is why the computation for the CTE 9:55 teammates and enemies took that long. Shouldn't the creation of a CTE and the creation of a materialized table take the same amount of time? Does it have anything to do with the complex JOIN condition?
No because the CTE also had an aggregation step afterward
@@EcZachly_ so temp table + aggregation was far lower time than CTE with aggregation. would it always be the case though?
@@muhammadraza3290 Temp Tables can be useful for improving performance though. It's very difficult to say something will always be the case
This is new to me. Great comparison.
If an MV is not too complex, they can be incrementally (delta) refreshed on some RDBMSes. MVs are great for rollup ( counts, min,max, histograms). Sadly pivots usually are "complex" and require a full rebuild.
Wow I never heard about Jane Table API before. Something new to look into. Does it work together with any other database platform like MySQL or PostgreSQL? I have only worked with MySQL so far and never used combinations of databases in projects so I am curious to know
The spark "registerAsTempTable" goes into what type of those?
It works like a temporary table, saving the data just for that session?
I guess I've been using it wrong =p
I was reviewing some SQL code today, and I came across an example of nested SQL query aka an infinite series of subqueries, and I was asking myself: WHY?
So u mean cte are better tha subqueries
Great explanation Zach! This is one of the best comparisons I've heard between all these - with all their pros and cons! 🌟🌟🌟
On this topic, I would be really thankful for an advice about performance between temporary tables and using staging tables, which are truncated and reloaded on every execution of the ETL.
My use case is a complex transformation in which we have multiple stages of preparing the dataset moving it from one temporary table to another after applying some logic, joining a table or deleting some rows. It all happens within one script/session, so we don't need to keep the intermediary results, only the final dataset. We use Redshift and the performance is definitely not great.
Do you think we can get any performance gains if we turn these temp tables into actual staging tables and we truncate and reload them? Also, do you know if defining DISTRIBUTION/SORT key on a temporary table in Redshift is as efficient as dist/sort key on an actual table?
Personally I use staging tables over temp tables for debugging purposes. When your end/user-facing table is returning incorrect or duplicate data, you can always trace back to an upstream staging table layer by layer to identify the root cause of the issue. Also, it's a lot easier to fine-tune the performance with staging table because you can test and validate different optimization techniques on them.
I have a question Zach. In snowflake, you cannot use join in materialized view. So, does materialized views vary from software to software? Also, there are many restrictions on using materialized view on snowflake.
Please make more videos like this. Can you just do a series on query optimization
Great video thank you Zach! Where do stored procedures come in with this conversation? Are sprocs materialized or non-materialized? Or are sprocs used for a purpose besides storing transformations?
They are mostly non-materialized
Delivery tip: if something is awesome, you don't have to say so. It will be apparent. Having said that, seeing the 1 to 1 translation between different concepts WAS awesome.
we have a really long sql statement that uses a sequence of CTE. Each CTE is then used in the next CTE through a join condition. Do you think we would benefit from going to temp tables instead?
Probably. Try it out!
Timeseries databases like TimescaleDB has continuous aggregates on materialiazed views , which is incremental.
That’s rad!
hi Zach, great video. What would be the difference between creating a materialized view and creating a table of the results? Would it be the same thing performance wise?
If the materialized view has the same indices as the table, they’ll be the same.
The main difference is you can automatically refresh the materialized view. But you can incrementally build a table
Can u create a video on airflow best practices with sql / python?
This is really great. Can you make videos about SQL aggregate functions, triggers, and transactions? especially aggregate functions and transactions if you can
Thanks for the suggestion!
thanks for great content.
Glad you enjoy it!
So good. Thanks!
I don't really get the benefits of using a materialized view instead of a regular table other that it's a little easier to update (with regular table you would need tp update using keys and check for changed rows, probably using hash)
If the scripy is not complex, i prefer mv
Hi Zach, thanks for this vid, it was such a great breakdown! I had 1 question if you have time:
At 17:40 you mention how temp tables have incremental gains, I'm struggling to understand how that is, seeing as how similar to materialized views it appears.
Is this because if you're adding new data 'A' to existing data 'B', temp tables can just focus on adding 'A', whereas materialized views require you to handle 'B' as well due to the refresh?
This might be me not understanding the concept of incremental gains, thanks!
Your understanding is right!
thank you sir, you rock !!!
Materialized views are oftentimes like the fine print or the auction speak at the end of a commercial..... SO many limitations (depending on the vendor or edition or featureset).
It's like dating the perfect 10...... as long as you pay the 5k dinner bill and have the person home by 8pm because that person has a data engineering conference to speak at in the morning.... 😂😂😂😂
Intresting
Thanks!
So many collateral contexts to any of these...
Good