Commulative Salary - PySpark Interview Question
Вставка
- Опубліковано 27 січ 2025
- Hello Everyone,
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, sum
from pyspark.sql.window import Window
data = [
(1, "A", 1000),
(2, "B", 2000),
(3, "C", 3000),
(4, "D", 4000),
]
Define the schema for the DataFrame
schema1 = StructType([
StructField("ID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("Sal", IntegerType(), True)
])
df2 = spark.createDataFrame(data, schema=schema1)
df2.show()
This series is for beginners and intermediate level candidates who wants to crack PySpark interviews
Here is the link to the course : www.geekcoders...
#pyspark #interviewquestions #interview #pysparkinterview #dataengineer #aws #databricks #python - Розваги
If you solve this question in sparksql is there any problem? In terms of optimization.
Same
@@GeekCoders then can you please tell if it can be solved sparksql then why to go with pyspark. Please make a video where you explain there are certain cases where sparksql fails to perform some transformation and pyspark comes into picture. And the question you solved is a running total which can easily be solved using sum (case when). I will wait for your reply. Can you please make a video on what to prepare for interview for people who are into different domain of IT and want to enter into data engineering and show experience around 1.5 yrs in Databricks. Which topics needs to be mandatory touched along with good grasp . I will wait for your reply. 🙏🙏🙏
@@chandanpatra1053 pyspark > sparksql in terms of optimization
data = [(1, "A", 1000),
(2, "B", 2000),
(3, "C", 3000),
(4, "D", 4000)]
schema1 = StructType([StructField("ID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("Sal", IntegerType(), True)])
df2 = spark.createDataFrame(data, schema=schema1)
df2.show()
df2.createOrReplaceTempView('emp')
df3 = spark.sql('select sum(sal) over (order by ID rows between unbounded preceding and current row) as Total_Sal from emp')
df3.show()
data = [
(1, "A", 1000),
(1, "A", 5000),
(2, "B", 2000),
(3, "C", 3000),
(4, "D", 4000),
]
schema=StructType([
StructField('id',IntegerType(),True),
StructField('name',StringType(),True),
StructField('salary',IntegerType(),True)
])
df=spark.createDataFrame(data,schema)
df.show(truncate=False)
window_spec=Window.orderBy(col('id')).rowsBetween(Window.unboundedPreceding,Window.currentRow)
df1=df.withColumn('cum_sal',sum(col('salary')).over(window_spec)).select(col('cum_sal'))
df1.show(truncate=False)