The TRUTH About High Performance Data Partitioning

Afaque Ahmad

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 27 жов 2024

КОМЕНТАРІ • 33

@RaviSingh-dp6xc 23 години тому
again!! its a great content ,very much clearly explained. 🙏
@VenkatakrishnaGangavarapu 11 місяців тому ⁺²
super super super detailed way thanks for uploading. i was unable understand it before but now could understand clearly..... thanks a lot. .... please do this kind indeapth topic videos when ever you are free to do. (u may not get view and money like other entertainment vidoes. but you are helping people to grow in this field surly there are so many people benifitting from you're content. please continue to do this kind of videos)
@sayedsamimahamed5324 7 місяців тому
Till now the best explaination in youtube. Thank you very much.
@kartikjaiswal8923 Місяць тому
I love you bro for such crisp explanation, the way to experiment and teach helps a lot!
@afaqueahmad7117 Місяць тому
@kartikjaiswal8923 Appreciate it man :)
@Fullon2 11 місяців тому ⁺¹
tranks for share your knowlegde, your videos are amazing.
@iamexplorer6052 11 місяців тому ⁺¹
very great detailed way understandable way so... Great ...
@AviE-c3j 7 місяців тому
Thanks for the detailed video. I have few questions here on partitioning. 1. How does it decide the number of partitions if we dont specify the properties and is it good to do repartition(some 400) after read. Is it good practice? 2. How does we decide the number for repartition value before writing to disk? If we put large number to repartition method, will that be optimal?
@utsavchanda4190 10 місяців тому ⁺¹
Good video. In fact, all of your videos are. One thing, in this video majorly you were talking about actual physical partitions on the disk. But towards the end, when you were talking about "maxpartitionbytes" and doing only a READ operation, you were talking about shuffle partitions which is in-memory and not disk partitions. I had found that hard to grasp for a very long time, so wanted to confirm if my understanding is right here.
@afaqueahmad7117 10 місяців тому
Hey @utsavchanda4190, many thanks for the appreciation. To clarify, when talking about "maxpartitionbytes", I'm referring to partitions that Spark reads from files into memory. These are not shuffle partitions, shuffling will only come in picture in cases of wide transformations (e.g. groupby, joins). Therefore, "maxpartitionbytes" will dictate how many partitions will be read by Spark from the files into Dataframes in memory.
@utsavchanda4190 10 місяців тому ⁺¹
@@afaqueahmad7117 that's right. And that is still in memory and not physical partitions, right? I think this video covers both, physical disk partitions as well as in-memory partitions.
@afaqueahmad7117 10 місяців тому
Yes, those are partitions in memory :)
@EverythingDudes 11 місяців тому ⁺¹
Superb knowledge
@lunatyck05 10 місяців тому ⁺¹
great video as always - when can we get a video to set up our IDE like yours? really nice UI - visual studio I believe?
@afaqueahmad7117 10 місяців тому
Thanks for the appreciation. Yep, it's VS Code. It's quite simple, not a lot of stuff on top except the terminal. Can share the Medium article I referred to, for setting it up :)
@anandchandrashekhar2933 4 місяці тому ⁺¹
Thank you so much again! I have one follow up question about partition during writes. If I use a df.write but specify no partitioning column or use repartition, could you pls let me know how many partitions does spark write to by default?
Does it simply take the number of input partitions (total input size / 128m) or assuming if shuffling was involved and the default shuffle partitions being used were 200 , does it use that shuffled partition number ?
Thank you
@afaqueahmad7117 4 місяці тому
Hey @anandchandrashekhar2933, so basically it should fall into 2 categories:
1. If shuffling is performed: Spark will use the value of `spark.sql.shuffle.partitions` (defaults to 200) for the number of partitions during the write operation.
2. If shuffling is not performed: Spark will use the current number of partitions in the DataFrame, which could be based on the input data's size or previous operations.
Hope this clarifies :)
@vamsikrishnabhadragiri402 7 місяців тому ⁺¹
Thank for the informative videos. I have a question regarding repatiton(4).partitionby(key)
Does it mean 4 part files in each of the partition will be a separate partition while reading ?
Or it considers the maxpartitonbytes specified and depending upon the size it creates partition (combining two or more part files) if the both size is within the maxpartitionbytes limit
@afaqueahmad7117 7 місяців тому ⁺¹
Hey @vamsikrishnabhadragiri402, `spark.sql.files.maxPartitionBytes` will be taken into consideration when reading the files.
If each of the 4 part files is smaller than `spark.sql.files.maxPartitionBytes` e.g. each part is 64 MB and `spark.sql.files.maxPartitionBytes` is defined to be 128MB, then 4 files (partitions) will be read separately. Spark does not go into the overhead of merging files to bring it to 128MB.
Consider another example where each part is greater than `spark.sql.files.maxPartitionBytes` (as discussed in the video), each of those parts will be broken down into sizes defined by `spark.sql.files.maxPartitionBytes` :)
@danieldigital9333 6 місяців тому
Hello, thanks for this video and for the whole course. I have a question about high cardinality columns: Say you have table A and table B with customer_id on both. You want to perform a join on this column, how do you alleviate the performance issue that occurs?
@MuhammadUmer-ev8pk 24 дні тому
by doing bucketing
@Amarjeet-fb3lk 5 місяців тому
At 16.39 , when u use repartition(3) , why there are 6 files?
@afaqueahmad7117 4 місяці тому
Hey @Amarjeet-fb3lk, Good question, I should have pulled the editor sidebar to the right for clarity. It's 3 files actually, the remaining 3 files are `.crc` files which is created for data integrity by Spark - to make sure the written file is not corrupted.
@kvin007 10 місяців тому ⁺¹
Thanks for the content Afaque. Question regarding spark.sql.files.maxPartitionBytes. I was thinking about this would be beneficial when reading a file that you know the size of upfront. What about files you don’t know the size. Do you recommend repartition or coalesce in those cases to adjust the number of partitions for the Dataframe?
@afaqueahmad7117 10 місяців тому ⁺¹
Hey @kvin007, you could use a technique to determine the size of a DataFrame explained here ua-cam.com/video/1kWl6d1yeKA/v-deo.html at 23:30. The link used in the video is umbertogriffo.gitbook.io/apache-spark-best-practices-and-tuning/parallelism/sparksqlshufflepartitions_draft
@kvin007 9 місяців тому ⁺¹
@@afaqueahmad7117 awesome, thanks for the response!
@retenim28 10 місяців тому
I am a little bit confused: at minute 15:17 in a specific folder relating to a specific value of listen_date you say that there is only 1 file that corresponds to 1 partition. But I thought that partitions are created depending on the values of listen_date, so as far as I can see, I would say there are more than 30 partitions (each one corresponding to a specific value of listen_date). After that you used repartition function to change the number of partitions inside each folder. So the question is: the number of partitions is the number of listen_date folder or the number of file inside each folder?
@afaqueahmad7117 10 місяців тому ⁺¹
Hey @retenim28, each listen_date folder is a partition. So you're right in saying that each partition corresponds to a specific value of listen_date. Each unique value of listen_date would result in a separate folder (a.k.a partition). Each parquet file (those part-000.. files) inside a partition (folder) will represent the actual physical storage of data rows belonging to that partition.
Therefore, to answer your question,
number of partitions = number of listen date folders;
@retenim28 10 місяців тому
@@afaqueahmad7117 oh thank you sir, just got the point. But I have another question: since Spark is interested in the number of partitions, which is the advantage of creating more files for each partition? The number of partitions remains the same, so the parallelism is just the same in both cases where we consider 10 files inside a partition or 3 files inside.
@afaqueahmad7117 10 місяців тому ⁺²
Good question @retenim28. The level of parallelism during data processing (e.g. number of tasks to be launched, 1 task = 1 partition) is determined by the number of partitions. However, the number of parquet files inside each partition plays a role in read/write, I/O parallelism. Spark, when reading data from storage, would read each of the parquet files in parallel even if they're part of the same partition. It will hence be able to assign more resources to do a faster data load. Same is the case for writes. Just be cautious that we don't end up with too many parquet files (small file problem) or few large files (leading to data skew)
@retenim28 10 місяців тому
@@afaqueahmad7117 thank you very much sir. I also watched the series about data skew.. very clear explanation
@ComedyXRoad 6 місяців тому
thank you
@afaqueahmad7117 6 місяців тому
Appreciate it :)

Наступне

Автоматичне відтворення

Bucketing - The One Spark Optimization You're Not Doing