Live Big Data Mock Interview | Technical Round 2 : PySpark | Slowly Changing Dimensions | Data Skew

Поділитися
Вставка
  • Опубліковано 19 гру 2024

КОМЕНТАРІ • 10

  • @Sha-mu3pv
    @Sha-mu3pv 13 днів тому

    17:20 distinct is not a narrow transformation

  • @rajeshd9925
    @rajeshd9925 8 місяців тому

    Window function will helps to solve this pyspark code.

  • @akhiladevangamath1277
    @akhiladevangamath1277 25 днів тому

    Interviewee has 9 years of experience as a data engineer

  • @varadpadalkar4879
    @varadpadalkar4879 8 місяців тому

    Sir, Can you please share the pyspark code of that problem?

  • @Blissful_Echoes-f6t
    @Blissful_Echoes-f6t 8 місяців тому

    thank u sir.

  • @VikasChauhan-h7o
    @VikasChauhan-h7o 5 місяців тому +1

    from pyspark.sql import SparkSession
    from pyspark.sql import *
    from pyspark.sql.functions import *
    from pyspark.sql.functions import lag
    from pyspark.sql import functions as F
    sc = SparkSession\
    .builder\
    .master("local[*]")\
    .appName('example_spark')\
    .getOrCreate()

    # creating a dataframe
    data = [
    (2000,'2024-01-01'),
    (3000,'2024-01-02'),
    (45000,'2024-01-22'),
    (40000,'2024-02-02'),
    (13000,'2024-03-03')
    ]
    headers = ("revenue","date")
    df = sc.createDataFrame(data, headers)
    df.show()
    df = df.withColumn('month',date_format(df.date, 'yyyy-MM'))
    df = df.groupBy('month').agg(sum('revenue').alias('revenue')).orderBy('month')
    my_window = (Window.orderBy('month')
    .rowsBetween(Window.unboundedPreceding, 0))
    df_new = df.withColumn('cum_sum', F.sum('revenue').over(my_window))
    df_new.show()

  • @VikasChauhan-h7o
    @VikasChauhan-h7o 5 місяців тому

    from pyspark.sql import SparkSession
    from pyspark.sql import *
    from pyspark.sql.functions import *
    from pyspark.sql.functions import lag
    from pyspark.sql import functions as F
    sc = SparkSession\
    .builder\
    .master("local[*]")\
    .appName('example_spark')\
    .getOrCreate()

    # creating a dataframe
    data = [
    (2000,'2024-01-01'),
    (3000,'2024-01-02'),
    (45000,'2024-01-22'),
    (40000,'2024-02-02'),
    (13000,'2024-03-03')
    ]
    headers = ("revenue","date")
    df = sc.createDataFrame(data, headers)
    df.show()
    df = df.withColumn('month',date_format(df.date, 'yyyy-MM'))
    df = df.groupBy('month').agg(sum('revenue').alias('revenue')).orderBy('month')
    my_window = (Window.orderBy('month')
    .rowsBetween(Window.unboundedPreceding, 0))
    df_new = df.withColumn('cum_sum', F.sum('revenue').over(my_window))
    df_new.show()