In union, duplicate rows are removed right? So for removing those duplicate rows, the data should be fetched into single partition, that means data shuffle is there, then how UNION is a narrow transformation? we can say UNIONALL is a narrow transformation, bcs it does not remove duplicate rows. Please explain me, I'm confused on this.
@harshitgupta355 union() and unionAll() behave differently in Spark as compared to SQL. In spark: 1. unionAll() is deprecated. It used to work same as union() (different than SQL). 2. union() merges 2 DFs with same schema and duplicates are retained. 3. Because duplicates are retained, so it is a narrow transformation. Also check out unionByName()
Thank you
In union, duplicate rows are removed right? So for removing those duplicate rows, the data should be fetched into single partition, that means data shuffle is there, then how UNION is a narrow transformation?
we can say UNIONALL is a narrow transformation, bcs it does not remove duplicate rows.
Please explain me, I'm confused on this.
Please reply, this question was asked by an interviewer in DE interview.
@harshitgupta355
union() and unionAll() behave differently in Spark as compared to SQL.
In spark:
1. unionAll() is deprecated. It used to work same as union() (different than SQL).
2. union() merges 2 DFs with same schema and duplicates are retained.
3. Because duplicates are retained, so it is a narrow transformation.
Also check out unionByName()