I've been tuning into both @SQLBI and Guy in a Cube's videos to enhance my Power BI Best Practices knowledge - what an epic crossover! 👏 Kudos to both of you for your valuable contributions! 👏
I've taken courses and consumed SQLbi and Guy in a Cube videos which are really good, but as a group you guys were great! We're starting strategy meetings for data governance, administration and modeling for a large enterprise system and appreciate your insights.
Great talk gentlemen. What kinda put the bow on the whole thing for me was the impromptu demo of how "follow up" meetings are formed at the very end of the segment. That really hit home, so thank you again!
Thanks! I hope you have more content on very large dataset issues & solutions! My take-away is "don't waste" and "Import whenever possible". In our case, the modeling team is using Direct Query to Views that sometimes have 50+ joins, that are non-Scalar (use ranges) that themselves are joined to others in the model. I'm just the front-end guy, but I've offered them the following suggestions: 1. Make all keys Integer. 2. Use in-memory Materialled Views for the most complex queries. 3. On SSAS, bring the most common tables into memory. 4. Place values most used as filters and parameters out of the Snowflakes (like Employee Status Code, Unit Code) and into their parent dimension tables, because they're always referenced and relationships are costly. However, the restrictions on Live Connect to an SSAS model (using the Analysis Services connector) are many. I struggle daily with how I can get by without being able to create columns or tables, how to join other datasets or models when needed, or in case I need to bypass their model and go straight to the OLTP, then how to preserve Row Level Security. A series would be A W E S O M E!!!
Yes, 4 billion. ;-) 14:18 . I do want to add that to Patrick's point it is a balance of what's in your model and what you need. 4 billion worked very well regardless, but after internal discussions we decided to leave the less-frequently-used grains in direct query and free up some P2 space for other models. So we are back down to slim 1.5B on that one model which is enabling us to deploy more models to.
Patrick! even if the load happens once a day stakeholders don't want to introduce any additional lag. The problem is yes it runs once a day but we don't know at what time exactly so setting up a refresh schedule becomes difficult. Also, partitions management in powerbi is still behind what we used to have in SSAS and incremental refresh is only available on premium.
Hi Marco, Alberto, Adam and Patrick I am fan and follower for four of you, thanks for the informative discussion, we are telecom operator and we have tremendous amount of data from all our nodes, we may need a consultation from your side to optimize our data models. Our backend is SSAS tabular and our front end is Tableau.
Thank you for sharing real-time solutions for real world problems. Adam was referring to dictionary on a column. Can you please elaborate on this. Thanks
Thanks for the great content. You guys talked about direct query being last resort in scenarios like latest data required, or client not willing to store data to Power BI. What I have seen with a client is having database side RLS as one of the reason of not using the import mode. Any thoughts around that or any other content I can go though related to this? I'm thinking now if there is any workaround to that.
What was that term brought up a couple times, "filter dags" (e.g. around the 49:00 mark)? Have not heard of it before and I don't think I even understood the word correctly. Many thanks for the great video folks, gave me a lot to think about when developing my next reports :)
It's "filtered aggs" meaning "filtered aggregations" - it's a specific optimization technique for SQL Server (materialized view, filtered indexes, ...) that could improve performance in a DirectQuery scenario.
It will be very interesting to have another discussion after the "new"Datamart feature will have been implemented in production with a few enterprise customers. It seems like the DataMart could be an amazing tool for handling multiple large data sources.
With the introduction of Direct Lake connection mode with data in Lakehouse, would you recommend Direct Lake or Import? Which one is more efficient and faster? Assume that I am using Fabric Lakehouse as a source.
We are talking about using Direct Query for large tables. However we need to remember that No more than 1 million rows are supported in the query. The visual is going to throw an error if it has to query 1 million rows. Need more settings to control how DQ behaves with source.
The limitation is related to the temporary tables moved to the formula engine. You can have 10 billion rows in DirectQuery, but you have to be aware of the limitation depending on the queries. Very generally speaking, dimension size could be an issue in a limited relationship, but there are many other elements in play. But yes, this limit is definitely something we'd like to customize. It is customizable in Analysis Services, not in Power BI (yet).
Hi Adam Saxton, Patrick LeBlanc, Alberto Ferrari, and Marco Russo I have been new to Power Bi where I'm trying to fetch the on- premise data into Azure cloud and then to visualized through Power Bi. I'm struggling here what azure architure to develop for Azure to fetch data from POS and what Azure components is to be used to get the project going and how this Data Modeling and relationship knowledge I have got can be implemented.
I heard you talken about text search in large dimensions. At least the filter visuals I have tested couldn’t fold a SQL query supporting SQL full text index search. Is this your experience to?
As per the Analysis Services Documentation, The cardinality of the columns should not be greater than 2 billion rows. I think this is also applied for Power Bi. Imagine having ticket number in a fact table that you need to display in your report and have more than 2 billion unique values.
The cardinality can be bigger, the issue is mainly related to performance when the column is used in a relationship. Another scenario is DISTINCTCOUNT, where the cardinality has a big role in performance. Approximate algorithms can be very useful there, take a look at this article: dax.tips/2019/12/20/dax-approx-distinct-count/ Absolute numbers also depend on the hardware, over time the "limit" can change because of hardware improvements. In 2012 1 million was a high cardinality, today 4 million rows could have reasonable performance on dedicated hardware (I would stay below 2m rows on cloud services, though).
I think the issue is business requirements change over time and it is expected that the existing model should be able to answer the future questions. That's why is it safe to keep as many columns as available. So if you have narrow fact tables there is a risk that in future your model will not serve the purpose.
In companies with high turnover, it can be easier to start over as opposed to trying to decompose what other teams implemented. If you can't decompose it in a week, chances are you could be better off starting from scratch. The big companies are actually the worst offenders.
Oracle's OBIEE enforce Star Schemas religiously. You simply think thrice before adding another attribute to your fact table. Let alone you CANNOT create standalone fact tables without any dimensions.
I'm dealing with a view in our Data Warehouse that has 57 columns and 225 million rows of records. I also have another table with 3 billion rows of records. My refreshes drains the life energy out of me because they take so long. I know it's not a good idea to have everyhting in a single view/table. I'm here trying to find something to show the company that what we have is bad approach. Anybody out there can give me a summary and clean insight that I can pass on to our data warehouse team?
Implement an incremental refresh: you build your house every day when you go camping, but you don't bring the furniture. Your historical data is like the content of the furniture, you would not relocate every day, right?
Wooow! Greetings! I got a question for you, big guys, on how do you welcome calculated columns in a fact table. Whenever the fact table gets fat enough, say 1 GB, the star schema does not shine anymore. What works for me in terms of performance and speeding up reports is to add calculated columns to the fact table. It is efficient not only if there is any complexity in the measure. I can notice the speed jump by staying away from such a simple function as SUMX to calculate Sales Value. Adding the additional calculated column to FactTable increases the total size of the report negligibly, so it is a relatively low cost. Even replacing measure SUMX with a pair of the calculated column of sales value and then using a measure with a simple SUM function increases the speed approximately five times. If the measure contains any grain of complexity, say the IF function, then the speed increase manifold. Am I missing something here?
Details are important. If you have an IF, the calculated column is faster, but it would be better to compute the column in the data source or in Power Query, because a calculated column in a large table is not a good idea for other reasons (lower compression, higher memory consumption at refresh time, longer processing time). If you have a simple expression that can be pushed to the storage engine (like Quantity * Price using two columns of the same table), usually you don't see any performance advantage at query time from the calculated column.
To see all 4 of these people in one video is kinda like seeing the Avengers of Big Data. I've definitely learned a lot from these guys!
I've been tuning into both @SQLBI and Guy in a Cube's videos to enhance my Power BI Best Practices knowledge - what an epic crossover! 👏 Kudos to both of you for your valuable contributions! 👏
I've taken courses and consumed SQLbi and Guy in a Cube videos which are really good, but as a group you guys were great! We're starting strategy meetings for data governance, administration and modeling for a large enterprise system and appreciate your insights.
Hear, hear!
This is one of the best Power BI videos ever! Why? Because a lot of concepts are being touched and should lead the spectator to investigate further.
Loved the entire session. There's so much you are doing for the bi community. Keep up the good work, guys.
Great talk gentlemen. What kinda put the bow on the whole thing for me was the impromptu demo of how "follow up" meetings are formed at the very end of the segment. That really hit home, so thank you again!
The Big Four of Power BI in the same panel! Thank you guys!
Thanks! I hope you have more content on very large dataset issues & solutions!
My take-away is "don't waste" and "Import whenever possible".
In our case, the modeling team is using Direct Query to Views that sometimes have 50+ joins, that are non-Scalar (use ranges) that themselves are joined to others in the model.
I'm just the front-end guy, but I've offered them the following suggestions:
1. Make all keys Integer.
2. Use in-memory Materialled Views for the most complex queries.
3. On SSAS, bring the most common tables into memory.
4. Place values most used as filters and parameters out of the Snowflakes (like Employee Status Code, Unit Code) and into their parent dimension tables, because they're always referenced and relationships are costly.
However, the restrictions on Live Connect to an SSAS model (using the Analysis Services connector) are many. I struggle daily with how I can get by without being able to create columns or tables, how to join other datasets or models when needed, or in case I need to bypass their model and go straight to the OLTP, then how to preserve Row Level Security.
A series would be A W E S O M E!!!
All masters together. Great!
Thanks for this great conversation guys !
Wait a minute, so you have ALL my favorite people in 1 video? I cant handle this :)
Yes, 4 billion. ;-) 14:18 .
I do want to add that to Patrick's point it is a balance of what's in your model and what you need. 4 billion worked very well regardless, but after internal discussions we decided to leave the less-frequently-used grains in direct query and free up some P2 space for other models. So we are back down to slim 1.5B on that one model which is enabling us to deploy more models to.
OMG my favorite Power BI People in 1 video. Love it!
You are freaking intelligent!!!!!
I love your channels.
Billion Thanks for sharing such a valuable knowledge
Fantastic session, thanks!
i love the way they discussed most
great discussion. My issue with import over DQ is that you lose the centrally managed RLS applied at the source.
Thank you guys!
Amazing!!! Thanks for sharing!!
Marco Alberto expecting a video on XMLA endpoints inc refresh
Patrick! even if the load happens once a day stakeholders don't want to introduce any additional lag.
The problem is yes it runs once a day but we don't know at what time exactly so setting up a refresh schedule becomes difficult.
Also, partitions management in powerbi is still behind what we used to have in SSAS and incremental refresh is only available on premium.
Hi Marco, Alberto, Adam and Patrick I am fan and follower for four of you, thanks for the informative discussion, we are telecom operator and we have tremendous amount of data from all our nodes, we may need a consultation from your side to optimize our data models.
Our backend is SSAS tabular and our front end is Tableau.
You can contact us on www.sqlbi.com/consulting/
Thank you for sharing real-time solutions for real world problems. Adam was referring to dictionary on a column. Can you please elaborate on this. Thanks
Read this: www.sqlbi.com/articles/measuring-the-dictionary-size-of-a-column-correctly/
Excelente video, saludos.
Thanks for the great content. You guys talked about direct query being last resort in scenarios like latest data required, or client not willing to store data to Power BI. What I have seen with a client is having database side RLS as one of the reason of not using the import mode. Any thoughts around that or any other content I can go though related to this? I'm thinking now if there is any workaround to that.
What was that term brought up a couple times, "filter dags" (e.g. around the 49:00 mark)? Have not heard of it before and I don't think I even understood the word correctly.
Many thanks for the great video folks, gave me a lot to think about when developing my next reports :)
It's "filtered aggs" meaning "filtered aggregations" - it's a specific optimization technique for SQL Server (materialized view, filtered indexes, ...) that could improve performance in a DirectQuery scenario.
It will be very interesting to have another discussion after the "new"Datamart feature will have been implemented in production with a few enterprise customers.
It seems like the DataMart could be an amazing tool for handling multiple large data sources.
With the introduction of Direct Lake connection mode with data in Lakehouse, would you recommend Direct Lake or Import? Which one is more efficient and faster? Assume that I am using Fabric Lakehouse as a source.
It depends. Read: www.sqlbi.com/blog/marco/2024/04/06/direct-lake-vs-import-mode-in-power-bi/
amazing Discussion
We are talking about using Direct Query for large tables. However we need to remember that No more than 1 million rows are supported in the query. The visual is going to throw an error if it has to query 1 million rows. Need more settings to control how DQ behaves with source.
The limitation is related to the temporary tables moved to the formula engine. You can have 10 billion rows in DirectQuery, but you have to be aware of the limitation depending on the queries. Very generally speaking, dimension size could be an issue in a limited relationship, but there are many other elements in play.
But yes, this limit is definitely something we'd like to customize. It is customizable in Analysis Services, not in Power BI (yet).
Thank you very much ! Your contribution is priceless !
Chris Adamson's Mastering DWH Aggregates is a good reference for solving these issues, but also Spark is not a bad alternative in some cases.
This was a deluxe video.
Hi Adam Saxton, Patrick LeBlanc, Alberto Ferrari, and Marco Russo I have been new to Power Bi where I'm trying to fetch the on- premise data into Azure cloud and then to visualized through Power Bi.
I'm struggling here what azure architure to develop for Azure to fetch data from POS and what Azure components is to be used to get the project going and how this Data Modeling and relationship knowledge I have got can be implemented.
Do more on dax guys like this
Here in Brazil, analysts use Pentaho a lot in the processing of big data. Do you indicate a better tool?
I like the part “no cached” mode instead of “direct query”. Lol.
How to calculate max consecutive negative number in a column by DAX?
congratulation on the knowlage you shere, i need it and i will folow you...from Spain
I heard you talken about text search in large dimensions. At least the filter visuals I have tested couldn’t fold a SQL query supporting SQL full text index search. Is this your experience to?
Yes, the best we've seen is using LIKE. However, *depending on the hardware*, the same search could be faster on a relational database.
I laughed so hard at this comment of Kevin Hunt:
Marco did you take all of Alberto's books?
what about telematics data, that gets big pretty fast.
As per the Analysis Services Documentation, The cardinality of the columns should not be greater than 2 billion rows. I think this is also applied for Power Bi. Imagine having ticket number in a fact table that you need to display in your report and have more than 2 billion unique values.
The cardinality can be bigger, the issue is mainly related to performance when the column is used in a relationship. Another scenario is DISTINCTCOUNT, where the cardinality has a big role in performance. Approximate algorithms can be very useful there, take a look at this article: dax.tips/2019/12/20/dax-approx-distinct-count/
Absolute numbers also depend on the hardware, over time the "limit" can change because of hardware improvements. In 2012 1 million was a high cardinality, today 4 million rows could have reasonable performance on dedicated hardware (I would stay below 2m rows on cloud services, though).
Advice on connecting with SQL
I think the issue is business requirements change over time and it is expected that the existing model should be able to answer the future questions. That's why is it safe to keep as many columns as available. So if you have narrow fact tables there is a risk that in future your model will not serve the purpose.
❤❤❤❤
In companies with high turnover, it can be easier to start over as opposed to trying to decompose what other teams implemented. If you can't decompose it in a week, chances are you could be better off starting from scratch. The big companies are actually the worst offenders.
Summary?
Oracle's OBIEE enforce Star Schemas religiously. You simply think thrice before adding another attribute to your fact table. Let alone you CANNOT create standalone fact tables without any dimensions.
"We have a small youtube channel"
What's that on Patrick's chest😂
I'm dealing with a view in our Data Warehouse that has 57 columns and 225 million rows of records. I also have another table with 3 billion rows of records. My refreshes drains the life energy out of me because they take so long. I know it's not a good idea to have everyhting in a single view/table. I'm here trying to find something to show the company that what we have is bad approach. Anybody out there can give me a summary and clean insight that I can pass on to our data warehouse team?
Implement an incremental refresh: you build your house every day when you go camping, but you don't bring the furniture. Your historical data is like the content of the furniture, you would not relocate every day, right?
Wooow! Greetings! I got a question for you, big guys, on how do you welcome calculated columns in a fact table. Whenever the fact table gets fat enough, say 1 GB, the star schema does not shine anymore. What works for me in terms of performance and speeding up reports is to add calculated columns to the fact table. It is efficient not only if there is any complexity in the measure. I can notice the speed jump by staying away from such a simple function as SUMX to calculate Sales Value. Adding the additional calculated column to FactTable increases the total size of the report negligibly, so it is a relatively low cost. Even replacing measure SUMX with a pair of the calculated column of sales value and then using a measure with a simple SUM function increases the speed approximately five times. If the measure contains any grain of complexity, say the IF function, then the speed increase manifold. Am I missing something here?
Details are important. If you have an IF, the calculated column is faster, but it would be better to compute the column in the data source or in Power Query, because a calculated column in a large table is not a good idea for other reasons (lower compression, higher memory consumption at refresh time, longer processing time).
If you have a simple expression that can be pushed to the storage engine (like Quantity * Price using two columns of the same table), usually you don't see any performance advantage at query time from the calculated column.