Caching and Broadcasting | Apache Spark | Interview Questions

Поділитися
Вставка
  • Опубліковано 15 вер 2024
  • Hi Friends,
    In this video, I have explained about caching and broadcasting in Apache Spark
    github.com/sra...
    Refer to these videos:
    Broadcast variables usage in Spark with Scala - • Broadcast variables us...
    Cache Vs Persist in Spark with Scala - Part 1 - • Cache Vs Persist in Sp...
    Storage Levels while persisting in Spark - Part 2 - • Storage Levels while p...
    Please subscribe to my channel for more interesting learnings.

КОМЕНТАРІ • 21

  • @sravankumar1767
    @sravankumar1767 Рік тому

    Superb explanation sravana

  • @sandeshkhade5513
    @sandeshkhade5513 9 місяців тому +1

    Hi sravana, i saw your videos its very good understanding for clearance for interview. Could you please create one video on dataWarehouse realted to spark & data modeling

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  9 місяців тому

      Thanks for the feedback Sandesh. Could you please elaborate on the topics that you are looking in data warehousing with spark?

    • @sandeshkhade5513
      @sandeshkhade5513 9 місяців тому

      @@sravanalakshmipisupati6533 Yes, Inside the data warehouses having concept is scd(slowly changing dimensions) this concept how internationally connection multiple table and applied operation and from data modeling what is use of this from spark side.

  • @sravankumar1767
    @sravankumar1767 Рік тому +1

    Nice concept

  • @NaveenKumar-kb2fm
    @NaveenKumar-kb2fm Рік тому +1

    Hi @sravana, I have a real time scenario of, how can we implement SCD TYPE 2 dynamically on multiple source tables, when ever updates came from multiple tables SCD tables should take those updates or inserts 😢

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  Рік тому +1

      Hi Naveen, could you please share more details on the requirements?

    • @NaveenKumar-kb2fm
      @NaveenKumar-kb2fm Рік тому

      @@sravanalakshmipisupati6533 thank you very much, we are taking the data from on premise client database using Adf and loading it into ADLS (source to landing pipeline), then other pipeline to load from ADLS to Staging tables in synapse DWH, now here comes the SCD layer ( one to one copy from Staging table), in Source we have more than 50 tables and Staging is every day refresh and we have to load new updates and inserts of new data to SCD type 2 tables with a flag column and retaining all the history of that record , so we can't create a script for each table so we make it dynamically to get all the tables from Staging and do the SCD TYPE 2 on all the tables.
      In our previous architecture we used Staging tables as source to the foundation layer tables(final tables after applying business logic before creating views on foundation tables for BI reporting use)
      But now in our new architecture they created a new layer SCD TYPE 2 layer between Staging and foundation, now we have to use SCD layer as source to load foundation table

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  Рік тому

      @@NaveenKumar-kb2fm could you please try this -
      def apply_scd_type2(source_df, target_df, key_columns, update_columns):
      # Add new columns to the source DataFrame
      source_df = source_df.withColumn("start_date", current_date())
      source_df = source_df.withColumn("end_date", current_date("start_date"))
      source_df = source_df.withColumn("current_flag", col("end_date").isNull())
      # Calculate the maximum end date for each key combination in the target DataFrame
      window_spec = Window.partitionBy(*key_columns).orderBy(col("end_date").desc())
      max_end_date = target_df.select(*key_columns, col("end_date")).withColumn(
      "max_end_date", col("end_date")).over(window_spec).select("max_end_date")
      # Join the source and target DataFrames based on the key columns
      joined_df = source_df.join(target_df, on=key_columns, how="left")
      # Filter out rows from the source DataFrame that already exist in the target DataFrame
      filtered_df = joined_df.filter(max_end_date.isNull() | (col("start_date") > max_end_date))
      # Update the end_date of the existing records in the target DataFrame
      update_df = joined_df.filter(col("start_date")

  • @user-dl3ck6ym4r
    @user-dl3ck6ym4r 10 місяців тому +1

    what hive and sqoop can do but spark can't do?

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  10 місяців тому

      Hive, Sqoop, and Spark serve different purposes in the big data ecosystem. Each has its own strengths and use cases. Spark's in-memory processing and distributed nature can introduce overhead, making it less efficient for small to medium-sized datasets compared to simpler tools like Hive for certain use cases. Spark can be resource-intensive, especially for memory and CPU usage. Managing resources effectively requires careful tuning, and inefficient resource allocation can impact performance. Also, Spark might not be as optimal for interactive queries as Hive. Hive's query engine can be more efficient for certain types of ad-hoc queries.
      Often, organizations use a combination of Spark, Hive, Sqoop, and other tools to leverage the strengths of each in different parts of their data processing pipelines.

  • @KiranKumar-cg3yg
    @KiranKumar-cg3yg Рік тому

    Mam, Your and Jawan Trailer uploaded at the same time. But Let me complete this knowledge session and view the trailer.

  • @ramachandranselvaraj7215
    @ramachandranselvaraj7215 Рік тому

    Hi, ur contents are so useful. I cracked more interviews as well 😊. Thank you

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  Рік тому

      Thanks a lot 🙏

    • @ramachandranselvaraj7215
      @ramachandranselvaraj7215 Рік тому

      @@sravanalakshmipisupati6533please post more interesting scenarios frequent for study purpose .. I also have some interesting interview questions to share. Please do let me know where to share questions.. thank you 😊

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  Рік тому

      @@ramachandranselvaraj7215 Thank you.Please mail me @ sravana.pisupati@gmail.com
      I will share with our friends.

    • @ramachandranselvaraj7215
      @ramachandranselvaraj7215 Рік тому

      Sure thank you

  • @MasoodAhmed-x6n
    @MasoodAhmed-x6n Рік тому

    Could you please send me the code convert xml to csv file in databricks ?

    • @sravanalakshmipisupati6533
      @sravanalakshmipisupati6533  Рік тому

      ua-cam.com/video/shUcdR1DAdU/v-deo.html
      | ua-cam.com/video/fVZOjMTDvuM/v-deo.html |ua-cam.com/video/qkiz53baIks/v-deo.html