pyspark scenario based interview questions and answers |

Поділитися
Вставка
  • Опубліковано 11 вер 2024

КОМЕНТАРІ • 11

  • @Tech.S7
    @Tech.S7 3 місяці тому

    Thanks for informative stuff.
    Instead of specifying all conditions in the join.
    Just we can specify only one condition ( I mean not required and Or conditions)
    It works and fetch expected output.
    Cheers!!

  • @PoojaM22
    @PoojaM22 5 місяців тому

    awesome bro! please keep up the work!

  • @siddharthchoudhary103
    @siddharthchoudhary103 6 місяців тому

    In last after finding unique record can we use collectlist by using group by on customer then using indexes as start and end location in withcolumn?

  • @user-ju4ih5xr8e
    @user-ju4ih5xr8e 6 місяців тому +3

    here is my solution
    #creating two dataframes for start and end
    df1=df.select('customer','start_location').alias('a')
    df2=df.select('customer','end_location').alias('b')
    #checking for locations
    df3=df1.join(df2,concat(col('a.customer'),col('a.start_location'))==concat(col('b.customer'),col('b.end_location')),'leftanti')
    df4=df2.join(df1,concat(col('a.customer'),col('a.start_location'))==concat(col('b.customer'),col('b.end_location')),'leftanti')
    #final output
    df5=df3.join(df4,["customer"],'inner')

  • @user-dv1ry5cs7e
    @user-dv1ry5cs7e 5 місяців тому

    with t1 AS (select customer,start_loc from travel_data where start_loc not in (select end_loc from travel_data))
    ,t2 AS (select customer,end_loc from travel_data where end_loc not in (select start_loc from travel_data))
    select t1.customer,t1.start_loc,t2.end_loc from t2 join t1 on t2.customer=t1.customer

  • @prabhatgupta6415
    @prabhatgupta6415 6 місяців тому

    df1=df.select("customer","start_location")
    df2=df.select("customer","end_location")
    df3=df1.union(df2).groupBy("customer","start_location").agg(count("start_location").alias("count")).filter("count==1")
    df3.alias("a").join(df3.alias("b"),["customer"],"inner").filter("a.start_location

  • @user-tm4zj2zz8x
    @user-tm4zj2zz8x 5 місяців тому

    from pyspark.sql.functions import collect_list, udf
    from pyspark.sql.types import StringType
    def loc(x, y):
    a = [i for i in x if i not in y]
    return a[0]
    loc_udf = udf(loc, StringType())
    df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'), collect_list('end_location').alias('end_list'))
    display(df1)
    df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list'))
    display(df2)

  • @tradingwith10k10
    @tradingwith10k10 3 місяці тому

    No udf, no join, no subquery
    display(df.groupBy("customer")
    .agg(collect_set("start_location").alias("start_list"),collect_set("end_location").alias("end_list"))
    .withColumn("start_location",array_except("start_list","end_list").getItem(0))
    .withColumn("end_location",array_except("end_list","start_list").getItem(0))
    .drop("start_list","end_list"))

  • @user-tm4zj2zz8x
    @user-tm4zj2zz8x 5 місяців тому

    from pyspark.sql.functions import collect_list, udf
    from pyspark.sql.types import StringType
    def loc(x, y):
    a = [i for i in x if i not in y]
    return a[0]
    loc_udf = udf(loc, StringType())
    df1 = df.groupBy('customer').agg(collect_list('start_location').alias('start_list'), collect_list('end_location').alias('end_list'))
    display(df1)
    df2 = df1.withColumn('start', loc_udf(df1.start_list, df1.end_list)).withColumn('end', loc_udf(df1.end_list, df1.start_list)).drop(*('start_list', 'end_list'))
    display(df2)