How do you know the "where" clause is performed before the join (5:17) ? Don't you have to do a sub query on the left joined (t2) table ? (also does this sub query be less performant ?)
You should be able to tell from the execution details :) In the first stage of the query, the workers will read from distributed storage and filter the data before shuffling and joining (which usually happens in the next stage). You can see the WHERE clause and the JOIN applied in the details of the stage to understand the order. You don't need the where clause in a subquery to filter the data before joining, but subqueries can be helpful if you want to pre-aggregate the data before joining to reduce the amount of data joined.
@@leighajarett221 so I don't need a where clause in subquery because everytime the where clause is being done at the distributed level ? Like everytime ? Or should I put a subquery to be sure ? (and if I put this subquery to be sure will I get bader performances ?)
Nice one! At how much data processing do you recommend to start working on optimizing the queries? As in when will the effect really start to show an impact? 1gb of processing, 10, 1000 or 1tb?
That's a great question! I think it depends on your pricing model. For example, if you're paying on-demand for each query you may be inclined to start optimizing queries that are scanning whatever #GB translates to a reasonable cost savings. This will also depend on how often the query is being run - for example if a query scanning 1TB of data is only run once that's $5, but if you run it everyday that's a lot! I would try to pinpoint queries that are run on a regular basis and are consuming the most resources, but set some threshold based on the total estimated ROI. For example, (the number of times the query will be run each year) x (the total cost of the query) > $1000 (or whatever is significant to your business) Check out the pricing calculator if you're estimating query prices: cloud.google.com/products/calculator/
Some these concepts could apply to other SQL database management systems (DBMS). Also some DBMS' are smart enough to apply some of these optimizations when generating the execution plan without having the user making changes to their SQL query.
This is the simply most important video to watch by the users of BigQuery 🔥
Perfect and easy examples to describe the impact of optimization.
How do you know the "where" clause is performed before the join (5:17) ? Don't you have to do a sub query on the left joined (t2) table ? (also does this sub query be less performant ?)
You should be able to tell from the execution details :) In the first stage of the query, the workers will read from distributed storage and filter the data before shuffling and joining (which usually happens in the next stage). You can see the WHERE clause and the JOIN applied in the details of the stage to understand the order. You don't need the where clause in a subquery to filter the data before joining, but subqueries can be helpful if you want to pre-aggregate the data before joining to reduce the amount of data joined.
@@leighajarett221 so I don't need a where clause in subquery because everytime the where clause is being done at the distributed level ? Like everytime ? Or should I put a subquery to be sure ? (and if I put this subquery to be sure will I get bader performances ?)
amazing video for our new joiners :))
great video i thought it gonna be light but is very deep in knowledge
nice way of presentation .....good stuff
5:50
Does the common practice of putting WHERE 1=1 as the first expression has any affect on this?
Great video thanks Leigha
Nice one! At how much data processing do you recommend to start working on optimizing the queries? As in when will the effect really start to show an impact? 1gb of processing, 10, 1000 or 1tb?
That's a great question! I think it depends on your pricing model. For example, if you're paying on-demand for each query you may be inclined to start optimizing queries that are scanning whatever #GB translates to a reasonable cost savings. This will also depend on how often the query is being run - for example if a query scanning 1TB of data is only run once that's $5, but if you run it everyday that's a lot!
I would try to pinpoint queries that are run on a regular basis and are consuming the most resources, but set some threshold based on the total estimated ROI. For example, (the number of times the query will be run each year) x (the total cost of the query) > $1000 (or whatever is significant to your business)
Check out the pricing calculator if you're estimating query prices: cloud.google.com/products/calculator/
thanks for the great explanation. I like "stay curious"! :)
really curious why such simple examples can not be automatically handled by query optimizer (which is already done by other databases)
At 5:25, why does clustering based on common join keys improve the efficiency more than partitioning?
Clustering on join keys leads to least number of shuffles to perform the join
is the concept can also be applied to other SQL databases?
Some these concepts could apply to other SQL database management systems (DBMS). Also some DBMS' are smart enough to apply some of these optimizations when generating the execution plan without having the user making changes to their SQL query.
Thank you
You're welcome
A little surprise how little the default optimizer can do for us...
thanks!
Google Google many businesses are always liked by people all over the world Thank you Google
Recommended is broken
Maal ho yrr tum toh....