Commulative Salary - PySpark Interview Question

Поділитися
Вставка
  • Опубліковано 27 січ 2025
  • Hello Everyone,
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType
    from pyspark.sql.functions import col, sum
    from pyspark.sql.window import Window
    data = [
    (1, "A", 1000),
    (2, "B", 2000),
    (3, "C", 3000),
    (4, "D", 4000),
    ]
    Define the schema for the DataFrame
    schema1 = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Sal", IntegerType(), True)
    ])
    df2 = spark.createDataFrame(data, schema=schema1)
    df2.show()
    This series is for beginners and intermediate level candidates who wants to crack PySpark interviews
    Here is the link to the course : www.geekcoders...
    #pyspark #interviewquestions #interview #pysparkinterview #dataengineer #aws #databricks #python
  • Розваги

КОМЕНТАРІ • 6

  • @chandanpatra1053
    @chandanpatra1053 9 місяців тому +1

    If you solve this question in sparksql is there any problem? In terms of optimization.

    • @GeekCoders
      @GeekCoders  9 місяців тому

      Same

    • @chandanpatra1053
      @chandanpatra1053 9 місяців тому +1

      @@GeekCoders then can you please tell if it can be solved sparksql then why to go with pyspark. Please make a video where you explain there are certain cases where sparksql fails to perform some transformation and pyspark comes into picture. And the question you solved is a running total which can easily be solved using sum (case when). I will wait for your reply. Can you please make a video on what to prepare for interview for people who are into different domain of IT and want to enter into data engineering and show experience around 1.5 yrs in Databricks. Which topics needs to be mandatory touched along with good grasp . I will wait for your reply. 🙏🙏🙏

    • @rawat7203
      @rawat7203 9 місяців тому

      @@chandanpatra1053 pyspark > sparksql in terms of optimization

  • @tarunbhatt.1995
    @tarunbhatt.1995 5 місяців тому +2

    data = [(1, "A", 1000),
    (2, "B", 2000),
    (3, "C", 3000),
    (4, "D", 4000)]
    schema1 = StructType([StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Sal", IntegerType(), True)])
    df2 = spark.createDataFrame(data, schema=schema1)
    df2.show()
    df2.createOrReplaceTempView('emp')
    df3 = spark.sql('select sum(sal) over (order by ID rows between unbounded preceding and current row) as Total_Sal from emp')
    df3.show()

  • @souradeep.official
    @souradeep.official 5 місяців тому

    data = [
    (1, "A", 1000),
    (1, "A", 5000),
    (2, "B", 2000),
    (3, "C", 3000),
    (4, "D", 4000),
    ]
    schema=StructType([
    StructField('id',IntegerType(),True),
    StructField('name',StringType(),True),
    StructField('salary',IntegerType(),True)
    ])
    df=spark.createDataFrame(data,schema)
    df.show(truncate=False)
    window_spec=Window.orderBy(col('id')).rowsBetween(Window.unboundedPreceding,Window.currentRow)
    df1=df.withColumn('cum_sal',sum(col('salary')).over(window_spec)).select(col('cum_sal'))
    df1.show(truncate=False)