Apache Spark for Data Engineering and Analysis - Overview

Поділитися
Вставка
  • Опубліковано 24 жов 2024

КОМЕНТАРІ • 58

  • @sudhan419
    @sudhan419 4 роки тому +16

    Thank you so much. Your explanation and integrating it with Enterprise Architecture is what everyone is looking for. Really, am not sure how your making time out, but thank you so much for your efforts in making such a wonderful learning sessions.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +7

      Sudhan.. Realized late but realized it. Time is what we make for oneself not what we get :)

  • @sanketmaheshwari1110
    @sanketmaheshwari1110 4 роки тому +1

    Very impressive explanation. I read details of Spark like what it is and how it works by going through Architecture from various sources but it was not very clear, however, after going through your's this video, I completely understood how does SPARK works. Thanks a lot Sir. Please keep making this type of videos.

  • @avik999
    @avik999 4 роки тому +2

    Thank you so much for a clear explanation for a lot of topics for spark developer role. Really it's helpful for day to day role even.

  • @ratulghosh3849
    @ratulghosh3849 3 роки тому +1

    Nice explanation of Apache Spark in terms of its architecture sir!

  • @seemunyum832
    @seemunyum832 3 роки тому

    Thank you so much. Really appreciate what you do to spread information and your explanation is so clear and well thought out. Really helping out someone who is trying to break into the data industry like me! Thank you

  • @tridipdas5445
    @tridipdas5445 3 роки тому +1

    Thank you so much for these videos . Getting such quality content for free is very rare these days. Please continue making such videos . It has really helped us a lot . Also, I would like to ask when you say multiple nodes..does it mean multiple cores of the CPU or entirely different CPUs??

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      It is both. You can leverage individual cores as well as run in multiple servers each with individual CPU

    • @tridipdas5445
      @tridipdas5445 3 роки тому

      @@AIEngineeringLife Thank you so much. This was confusing me for a long time. Thanks for getting it cleared.

  • @rameshthamizhselvan2458
    @rameshthamizhselvan2458 4 роки тому +2

    If possible please upload some videos regarding Log monitoring frameworks and tell the easiest way to manage the logs in Spark.

  • @amithbk12man
    @amithbk12man 3 роки тому

    Thanks for the explanation..can you also give a demo how spark can be used for feature engineering and how same engineering methods can be called for serving too to avoid training serving skew

  • @sachinsurana1814
    @sachinsurana1814 3 роки тому +1

    At 11:40 you say that Spark can also be used as storage...is that really the case? Spark relies on external storage to my understanding.....and thanks for putting such a comprehensive and understandable material

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Sachin.. I meant end to end pipeline from data collection to data storage. Not spark can be used as storage here. Meaning I can create pipeline for processing data till the storage layer be it raw or aggregated using Spark

  • @sachinsurana1814
    @sachinsurana1814 3 роки тому +1

    And another question, how do you compare the data platform architectures. For example, I see two patterns:
    1. Cloud Analytical Databases - E.g. Snowflake/Redshift with an ETL tool
    2. Big Data Platforms - Spark/Hive
    What would be the threshold for choice between 1 and 2

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Sachin.. Both have a purpose based on consuming application. Spark is good at heavy lifting analytical workloads and Analytical DBs on low latency querying for analytics. While you might be able to do all in analytical databases the cost of doing so is higher there
      I see spark as a layer to onboard and create initial curated zones and then anaytical databases serving workload with low latency querying and dashboard

    • @sachinsurana1814
      @sachinsurana1814 3 роки тому

      @@AIEngineeringLife Thanks a lot. So much helpful and prompt response. Above all so much clarity. Just a quick one, when you mean "analytical workload", I guess you mean "ETL data pipeline"?

    • @avinash7003
      @avinash7003 Рік тому

      i prefer first 2 then 1 ... as structure follows..........

  • @sachinsurana1814
    @sachinsurana1814 3 роки тому +1

    Hi again, a basic question: In essence then Spark is an ETL tool or ETL with massive distributed processing capabilities?

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Yes it is ETL tool that has inbuilt ML algorithms for machine learning and streaming framework as well

  • @roy11883
    @roy11883 4 роки тому +1

    Thanks for creating the videos,really helpful.At 10.21 - you have mentioned that bare-metal(single-tenant physical server)spark can run using YARN/MESOS. But YARN or MESOS work for clusters only.
    I may have not understood it correctly.Could you please explain this part in detail?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      No I did not mean it that way.. My intention was to say either you can run single node or in cluster mode. Yarn and Mesos was only for reference of various scheduler. Sorry if I have confused you

    • @roy11883
      @roy11883 4 роки тому

      @@AIEngineeringLife Thank you for clarifying.

  • @TheFunny298
    @TheFunny298 4 роки тому +1

    At 8:00, why is the data shuffled Internally by spark? Could you elaborate it please.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +3

      When you have a aggregate based function, spark has to order and assemble similar keys from different node if data with similar keys is not partitioned to fit on single node. Hence the shuffling

  • @kennylaikl299
    @kennylaikl299 4 роки тому +1

    Hi. What's the difference/similarity between Apache Spark's Dataframe and Pandas' Dataframe?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +2

      Apache spark is distributed so can split job on hundreds of servers and consolidate results for larger dataset. Pandas run on single node and suitable for dataset that can be processed in memory
      Both API have some similarity but not exactly same. Spark has koalas API which is pandas drop in replacement

  • @dikeshshah8129
    @dikeshshah8129 4 роки тому +1

    Would you recommend Apache Spark for small size data which would increase exponential over span of few years?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +2

      dikesh.. I will put it this way rather. If your data is going to be small then choose different framework but if your going to use the small data to join to other datasets then you can broadcast the small data or load it in cache and process it
      Now if spark is your primary framework and you have small as well as big data then why not Spark for both
      If not and you think the few years to be 3 years down the line then setting up spark can be overhead and maybe you can delay it considering down the line new technology might come in. Technology evolves rapidly :)

  • @kumarskasc
    @kumarskasc 4 роки тому +1

    In the SPARK architecture diagram, what is the connectivity between one worker node and another worker node for? Thanks!

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      One thing I can think of is Shuffle process where data is moved across executors

  • @rajapaul4076
    @rajapaul4076 3 роки тому +1

    Hi Sir kindly make a video on Data pipelining and ETL process

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Sure Raja.. Do have it in plan for first half of this year. Will try to prioritize it

    • @rajapaul4076
      @rajapaul4076 3 роки тому

      @@AIEngineeringLife Thanks for reply sir and eagerly waiting for the session on this.

  • @ijeffking
    @ijeffking 4 роки тому +2

    Thank you. I appreciate your efforts!

  • @Azureandfabricmastery
    @Azureandfabricmastery 3 роки тому +1

    Thanks. Learning Spark is exciting. Could you help me to understand the ideal size of data where we call it as big data and process it with Spark? Does it in GB's?

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Sheik.. Yes I would say typically in higher GBs. There are 2 aspects one is data size and second is computing complexity. Workloads like ML might need iterative processing and combined with large dataset Spark might be able to distribute ML as well

    • @Azureandfabricmastery
      @Azureandfabricmastery 3 роки тому

      @@AIEngineeringLife Thanks for reply. Helpful.

  • @ueeabhishekkrsahu
    @ueeabhishekkrsahu 2 роки тому

    Can you please upload the presentation you used?

  • @rajeshwarsehdev2318
    @rajeshwarsehdev2318 4 роки тому +2

    Well explained! Thanks

  • @hiteshtaneja
    @hiteshtaneja 4 роки тому +1

    Please make video on complex event processing (rules egine) with spark

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      I will try to do it later in year Hitesh once I complete my scalable ML videos. But I have few overview videos on it, not hands on though
      ua-cam.com/video/9-MqHMnaQPE/v-deo.html
      ua-cam.com/video/mEiY5h6YKoU/v-deo.html

  • @Ajeetsingh-uy4cy
    @Ajeetsingh-uy4cy 4 роки тому +1

    at time 6:48, you mentioned that customer data is distributed into multiple chunks (24 chunks for this data).
    And then you said that these chunks are distributed across multiple systems (100s of systems).
    I am assuming that this chunk is part of the data in some order?
    What I am not able to understand is that since we have 24 chunks how can we distributed them to 100s of systems.
    Isn't there 1 system 1 chunk relationship. And if not then how 2 systems sharing the same chunk coordinate.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Ajeet.. It is just And condition to say spark can utilize all the nodes but not necessary in all cases. Now in case of 24 chunks can reside on 24 nodes or even less but when you have 1000 chunks it can spread across all nodes

    • @Ajeetsingh-uy4cy
      @Ajeetsingh-uy4cy 4 роки тому

      Okay. In that case it's clear to me. Thanks

  • @gayathripujari9718
    @gayathripujari9718 4 роки тому +1

    Hi Sir, Thank you for explaining clearly to us. Appreciate your help. Can you please share the slides if possible as it would be helpful to make notes and referring. Thanks

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Thank you and I have upload the slides in my git repo here - github.com/srivatsan88/Mastering-Apache-Spark

    • @gayathripujari9718
      @gayathripujari9718 4 роки тому

      @@AIEngineeringLife Thank you very much, Sir.

  • @akashchandra2223
    @akashchandra2223 4 роки тому +1

    Hello, I am trying to learn data engineering but there is so much information I am trying to focus on it. Im starting today to follow your channel, this is a good place to start and get a detailed understanding correct?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Akash.. Yes. Mastering Apache Spark was created for someone to start from scratch and master data engineering end to end.

  • @Shahzada1prince
    @Shahzada1prince 4 роки тому +1

    Is this part of the full course?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Yes it is.. Here is the course - ua-cam.com/play/PL3N9eeOlCrP5PfpYrP6YxMNtt5Hw27ZlO.html

    • @Shahzada1prince
      @Shahzada1prince 4 роки тому

      @@AIEngineeringLife Thanks it is very informative. I have an interview tomorrow for a data engineer position. Is there any difference between data engineers and data scientists?
      Can you please share some interview tips. Regards.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      data engineering is an activity part of data science while there are other roles as well. You can check these 2 videos for details - ua-cam.com/video/6oSzDU8kkB0/v-deo.html
      and this ua-cam.com/play/PL3N9eeOlCrP6Y73-dOA5Meso7Dv7qYiUU.html

  • @demidrek-heyward
    @demidrek-heyward 4 роки тому +1

    thanks!

  • @navjotsingh-hl1jg
    @navjotsingh-hl1jg Рік тому

    THIS COURSE IS FROM BASIC TO ADVANCE