Thank you so much. Your explanation and integrating it with Enterprise Architecture is what everyone is looking for. Really, am not sure how your making time out, but thank you so much for your efforts in making such a wonderful learning sessions.
Very impressive explanation. I read details of Spark like what it is and how it works by going through Architecture from various sources but it was not very clear, however, after going through your's this video, I completely understood how does SPARK works. Thanks a lot Sir. Please keep making this type of videos.
Thank you so much. Really appreciate what you do to spread information and your explanation is so clear and well thought out. Really helping out someone who is trying to break into the data industry like me! Thank you
Thank you so much for these videos . Getting such quality content for free is very rare these days. Please continue making such videos . It has really helped us a lot . Also, I would like to ask when you say multiple nodes..does it mean multiple cores of the CPU or entirely different CPUs??
Thanks for the explanation..can you also give a demo how spark can be used for feature engineering and how same engineering methods can be called for serving too to avoid training serving skew
At 11:40 you say that Spark can also be used as storage...is that really the case? Spark relies on external storage to my understanding.....and thanks for putting such a comprehensive and understandable material
Sachin.. I meant end to end pipeline from data collection to data storage. Not spark can be used as storage here. Meaning I can create pipeline for processing data till the storage layer be it raw or aggregated using Spark
And another question, how do you compare the data platform architectures. For example, I see two patterns: 1. Cloud Analytical Databases - E.g. Snowflake/Redshift with an ETL tool 2. Big Data Platforms - Spark/Hive What would be the threshold for choice between 1 and 2
Sachin.. Both have a purpose based on consuming application. Spark is good at heavy lifting analytical workloads and Analytical DBs on low latency querying for analytics. While you might be able to do all in analytical databases the cost of doing so is higher there I see spark as a layer to onboard and create initial curated zones and then anaytical databases serving workload with low latency querying and dashboard
@@AIEngineeringLife Thanks a lot. So much helpful and prompt response. Above all so much clarity. Just a quick one, when you mean "analytical workload", I guess you mean "ETL data pipeline"?
Thanks for creating the videos,really helpful.At 10.21 - you have mentioned that bare-metal(single-tenant physical server)spark can run using YARN/MESOS. But YARN or MESOS work for clusters only. I may have not understood it correctly.Could you please explain this part in detail?
No I did not mean it that way.. My intention was to say either you can run single node or in cluster mode. Yarn and Mesos was only for reference of various scheduler. Sorry if I have confused you
When you have a aggregate based function, spark has to order and assemble similar keys from different node if data with similar keys is not partitioned to fit on single node. Hence the shuffling
Apache spark is distributed so can split job on hundreds of servers and consolidate results for larger dataset. Pandas run on single node and suitable for dataset that can be processed in memory Both API have some similarity but not exactly same. Spark has koalas API which is pandas drop in replacement
dikesh.. I will put it this way rather. If your data is going to be small then choose different framework but if your going to use the small data to join to other datasets then you can broadcast the small data or load it in cache and process it Now if spark is your primary framework and you have small as well as big data then why not Spark for both If not and you think the few years to be 3 years down the line then setting up spark can be overhead and maybe you can delay it considering down the line new technology might come in. Technology evolves rapidly :)
Thanks. Learning Spark is exciting. Could you help me to understand the ideal size of data where we call it as big data and process it with Spark? Does it in GB's?
Sheik.. Yes I would say typically in higher GBs. There are 2 aspects one is data size and second is computing complexity. Workloads like ML might need iterative processing and combined with large dataset Spark might be able to distribute ML as well
I will try to do it later in year Hitesh once I complete my scalable ML videos. But I have few overview videos on it, not hands on though ua-cam.com/video/9-MqHMnaQPE/v-deo.html ua-cam.com/video/mEiY5h6YKoU/v-deo.html
at time 6:48, you mentioned that customer data is distributed into multiple chunks (24 chunks for this data). And then you said that these chunks are distributed across multiple systems (100s of systems). I am assuming that this chunk is part of the data in some order? What I am not able to understand is that since we have 24 chunks how can we distributed them to 100s of systems. Isn't there 1 system 1 chunk relationship. And if not then how 2 systems sharing the same chunk coordinate.
Ajeet.. It is just And condition to say spark can utilize all the nodes but not necessary in all cases. Now in case of 24 chunks can reside on 24 nodes or even less but when you have 1000 chunks it can spread across all nodes
Hi Sir, Thank you for explaining clearly to us. Appreciate your help. Can you please share the slides if possible as it would be helpful to make notes and referring. Thanks
Hello, I am trying to learn data engineering but there is so much information I am trying to focus on it. Im starting today to follow your channel, this is a good place to start and get a detailed understanding correct?
@@AIEngineeringLife Thanks it is very informative. I have an interview tomorrow for a data engineer position. Is there any difference between data engineers and data scientists? Can you please share some interview tips. Regards.
data engineering is an activity part of data science while there are other roles as well. You can check these 2 videos for details - ua-cam.com/video/6oSzDU8kkB0/v-deo.html and this ua-cam.com/play/PL3N9eeOlCrP6Y73-dOA5Meso7Dv7qYiUU.html
Thank you so much. Your explanation and integrating it with Enterprise Architecture is what everyone is looking for. Really, am not sure how your making time out, but thank you so much for your efforts in making such a wonderful learning sessions.
Sudhan.. Realized late but realized it. Time is what we make for oneself not what we get :)
Very impressive explanation. I read details of Spark like what it is and how it works by going through Architecture from various sources but it was not very clear, however, after going through your's this video, I completely understood how does SPARK works. Thanks a lot Sir. Please keep making this type of videos.
Thank you so much for a clear explanation for a lot of topics for spark developer role. Really it's helpful for day to day role even.
Nice explanation of Apache Spark in terms of its architecture sir!
Thank you so much. Really appreciate what you do to spread information and your explanation is so clear and well thought out. Really helping out someone who is trying to break into the data industry like me! Thank you
Thank you so much for these videos . Getting such quality content for free is very rare these days. Please continue making such videos . It has really helped us a lot . Also, I would like to ask when you say multiple nodes..does it mean multiple cores of the CPU or entirely different CPUs??
It is both. You can leverage individual cores as well as run in multiple servers each with individual CPU
@@AIEngineeringLife Thank you so much. This was confusing me for a long time. Thanks for getting it cleared.
If possible please upload some videos regarding Log monitoring frameworks and tell the easiest way to manage the logs in Spark.
ELK :)
Thanks for the explanation..can you also give a demo how spark can be used for feature engineering and how same engineering methods can be called for serving too to avoid training serving skew
At 11:40 you say that Spark can also be used as storage...is that really the case? Spark relies on external storage to my understanding.....and thanks for putting such a comprehensive and understandable material
Sachin.. I meant end to end pipeline from data collection to data storage. Not spark can be used as storage here. Meaning I can create pipeline for processing data till the storage layer be it raw or aggregated using Spark
And another question, how do you compare the data platform architectures. For example, I see two patterns:
1. Cloud Analytical Databases - E.g. Snowflake/Redshift with an ETL tool
2. Big Data Platforms - Spark/Hive
What would be the threshold for choice between 1 and 2
Sachin.. Both have a purpose based on consuming application. Spark is good at heavy lifting analytical workloads and Analytical DBs on low latency querying for analytics. While you might be able to do all in analytical databases the cost of doing so is higher there
I see spark as a layer to onboard and create initial curated zones and then anaytical databases serving workload with low latency querying and dashboard
@@AIEngineeringLife Thanks a lot. So much helpful and prompt response. Above all so much clarity. Just a quick one, when you mean "analytical workload", I guess you mean "ETL data pipeline"?
i prefer first 2 then 1 ... as structure follows..........
Hi again, a basic question: In essence then Spark is an ETL tool or ETL with massive distributed processing capabilities?
Yes it is ETL tool that has inbuilt ML algorithms for machine learning and streaming framework as well
Thanks for creating the videos,really helpful.At 10.21 - you have mentioned that bare-metal(single-tenant physical server)spark can run using YARN/MESOS. But YARN or MESOS work for clusters only.
I may have not understood it correctly.Could you please explain this part in detail?
No I did not mean it that way.. My intention was to say either you can run single node or in cluster mode. Yarn and Mesos was only for reference of various scheduler. Sorry if I have confused you
@@AIEngineeringLife Thank you for clarifying.
At 8:00, why is the data shuffled Internally by spark? Could you elaborate it please.
When you have a aggregate based function, spark has to order and assemble similar keys from different node if data with similar keys is not partitioned to fit on single node. Hence the shuffling
Hi. What's the difference/similarity between Apache Spark's Dataframe and Pandas' Dataframe?
Apache spark is distributed so can split job on hundreds of servers and consolidate results for larger dataset. Pandas run on single node and suitable for dataset that can be processed in memory
Both API have some similarity but not exactly same. Spark has koalas API which is pandas drop in replacement
Would you recommend Apache Spark for small size data which would increase exponential over span of few years?
dikesh.. I will put it this way rather. If your data is going to be small then choose different framework but if your going to use the small data to join to other datasets then you can broadcast the small data or load it in cache and process it
Now if spark is your primary framework and you have small as well as big data then why not Spark for both
If not and you think the few years to be 3 years down the line then setting up spark can be overhead and maybe you can delay it considering down the line new technology might come in. Technology evolves rapidly :)
In the SPARK architecture diagram, what is the connectivity between one worker node and another worker node for? Thanks!
One thing I can think of is Shuffle process where data is moved across executors
Hi Sir kindly make a video on Data pipelining and ETL process
Sure Raja.. Do have it in plan for first half of this year. Will try to prioritize it
@@AIEngineeringLife Thanks for reply sir and eagerly waiting for the session on this.
Thank you. I appreciate your efforts!
Thanks. Learning Spark is exciting. Could you help me to understand the ideal size of data where we call it as big data and process it with Spark? Does it in GB's?
Sheik.. Yes I would say typically in higher GBs. There are 2 aspects one is data size and second is computing complexity. Workloads like ML might need iterative processing and combined with large dataset Spark might be able to distribute ML as well
@@AIEngineeringLife Thanks for reply. Helpful.
Can you please upload the presentation you used?
Well explained! Thanks
Please make video on complex event processing (rules egine) with spark
I will try to do it later in year Hitesh once I complete my scalable ML videos. But I have few overview videos on it, not hands on though
ua-cam.com/video/9-MqHMnaQPE/v-deo.html
ua-cam.com/video/mEiY5h6YKoU/v-deo.html
at time 6:48, you mentioned that customer data is distributed into multiple chunks (24 chunks for this data).
And then you said that these chunks are distributed across multiple systems (100s of systems).
I am assuming that this chunk is part of the data in some order?
What I am not able to understand is that since we have 24 chunks how can we distributed them to 100s of systems.
Isn't there 1 system 1 chunk relationship. And if not then how 2 systems sharing the same chunk coordinate.
Ajeet.. It is just And condition to say spark can utilize all the nodes but not necessary in all cases. Now in case of 24 chunks can reside on 24 nodes or even less but when you have 1000 chunks it can spread across all nodes
Okay. In that case it's clear to me. Thanks
Hi Sir, Thank you for explaining clearly to us. Appreciate your help. Can you please share the slides if possible as it would be helpful to make notes and referring. Thanks
Thank you and I have upload the slides in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark
@@AIEngineeringLife Thank you very much, Sir.
Hello, I am trying to learn data engineering but there is so much information I am trying to focus on it. Im starting today to follow your channel, this is a good place to start and get a detailed understanding correct?
Akash.. Yes. Mastering Apache Spark was created for someone to start from scratch and master data engineering end to end.
Is this part of the full course?
Yes it is.. Here is the course - ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html
@@AIEngineeringLife Thanks it is very informative. I have an interview tomorrow for a data engineer position. Is there any difference between data engineers and data scientists?
Can you please share some interview tips. Regards.
data engineering is an activity part of data science while there are other roles as well. You can check these 2 videos for details - ua-cam.com/video/6oSzDU8kkB0/v-deo.html
and this ua-cam.com/play/PL3N9eeOlCrP6Y73-dOA5Meso7Dv7qYiUU.html
thanks!
THIS COURSE IS FROM BASIC TO ADVANCE