Hi Sir... Perfect Great Explanation... Thank you for your effort... I have a doubt :-- After joining The Salting step should be - unsalted and then grouped by has to be applied, Right...? .....
I would have appreciated if you would have run the code of salting and showed us on spark UI for better clarity what is happening internally within spark
amazing video.. however, i don't know scala. So can you please give an example on how to implement the salting technique with Spark SQL queries ? that'll be of great help..
but the join output will not be correct because in previous scenario it would have joined with all the matching ids but with new salting method it will join with only newly slated key, that's weird
I have 2 questions: First one: I think that is wrong on your visual presentation of table 2 after salting. Why don't you have z_2 und z_3 there? Also why are you using capital letters sometimes, that's confusing. Secone question: I don't get the benefit of Key Salting in general. How is this different from broadcasting you second table? Because you explode it and then you will end up with sending the whole table to every executor anyway? No one can give an answer to this question.
Hi Aravind, If I understand your question correctly you wanted to take the first data frame count where we are appending a random number var df1 = leftTable .withColumn(leftCol, concat( leftTable.col(leftCol), lit("_"), lit(floor(rand(123456) * 10)))) We can simply do df1.select(col("id")).count() This should give the count of the first data frame column For more details, you can refer below git link github.com/gjeevanm/SparkDataSkewness/blob/master/src/main/scala/com/gjeevan/DataSkew/RemoveDataSkew.scala
Great Explanation, Thanks for sharing this. I think there is off by 1 error. You are using (0 to 3) which will have (0, 1, 2, 3) but random number range will be (0, 1, 2)
Hi, are you missing something in code ?? I used your code but its throwing an exception for the below code of lines //join after elminating data skewness df3.join( df4, df3.col("id") df4.col("id") ) .show(100,false) }
Hi Sir... Perfect Great Explanation... Thank you for your effort...
I have a doubt :--
After joining The Salting step should be - unsalted and then grouped by has to be applied, Right...?
.....
Thanks but if we have multiple columns as KEY how to handle it ?
I would have appreciated if you would have run the code of salting and showed us on spark UI for better clarity what is happening internally within spark
Amazing video.... How can we use the salting technique in PySpark for data skew?
amazing video.. however, i don't know scala. So can you please give an example on how to implement the salting technique with Spark SQL queries ? that'll be of great help..
Will update SQL query
@@jeevanmadhur3732 waiting for the query
@@ashwinc9867 did you get it?
Well, I must say, thanks a lot.....have been searching for this kind of explaination.
Excellent. Thank you
This really great and crystal clear explanations....thanks a lot for sharing and spreading knowledge!
Excellent video..thanks for the explanation and sharing the code
Good work, its better you show the ourput after the salting dataframes and explain udf more detail.
Excellent Description
but the join output will not be correct because in previous scenario it would have joined with all the matching ids but with new salting method it will join with only newly slated key, that's weird
Hey great video, could you also link the associated resources you referred to while making this video?
I have 2 questions:
First one: I think that is wrong on your visual presentation of table 2 after salting. Why don't you have z_2 und z_3 there? Also why are you using capital letters sometimes, that's confusing.
Secone question: I don't get the benefit of Key Salting in general. How is this different from broadcasting you second table? Because you explode it and then you will end up with sending the whole table to every executor anyway? No one can give an answer to this question.
Amazing video..!!
beautifully explained, thank you very much :)
Can u please explain how to take the random number count
Hi Aravind, If I understand your question correctly you wanted to take the first data frame count where we are appending a random number
var df1 = leftTable
.withColumn(leftCol, concat(
leftTable.col(leftCol), lit("_"), lit(floor(rand(123456) * 10))))
We can simply do
df1.select(col("id")).count()
This should give the count of the first data frame column
For more details, you can refer below git link
github.com/gjeevanm/SparkDataSkewness/blob/master/src/main/scala/com/gjeevan/DataSkew/RemoveDataSkew.scala
Great Explanation, Thanks for sharing this.
I think there is off by 1 error.
You are using (0 to 3) which will have (0, 1, 2, 3)
but random number range will be (0, 1, 2)
amazing sir! thanks a lot
Hi, are you missing something in code ?? I used your code but its throwing an exception for the below code of lines
//join after elminating data skewness
df3.join(
df4,
df3.col("id") df4.col("id")
)
.show(100,false)
}
Hi,
Thanks for highlighting, there is small issue with checked-in join code which I fixed now. Please pull latest code and try out
@@jeevanmadhur3732 Thank you Jeevan. your videos helps us a lot :)
best