14. explode(), split(), array() & array_contains() functions in PySpark |
Вставка
- Опубліковано 13 жов 2024
- In this video, I explained about explode() , split(), array() & array_contains() functions usages with ArrayType column in PySpark.
Link for PySpark Playlist:
• 1. What is PySpark?
Link for PySpark Real Time Scenarios Playlist:
• 1. Remove double quote...
Link for Azure Synapse Analytics Playlist:
• 1. Introduction to Azu...
Link to Azure Synapse Real Time scenarios Playlist:
• Azure Synapse Analytic...
Link for Azure Data bricks Play list:
• 1. Introduction to Az...
Link for Azure Functions Play list:
• 1. Introduction to Azu...
Link for Azure Basics Play list:
• 1. What is Azure and C...
Link for Azure Data factory Play list:
• 1. Introduction to Azu...
Link for Azure Data Factory Real time Scenarios
• 1. Handle Error Rows i...
Link for Azure Logic Apps playlist
• 1. Introduction to Azu...
#PySpark #Spark #databricks #azuresynapse #synapse #notebook #azuredatabricks #PySparkcode #dataframe #WafaStudies #maheer #azure
this is worthy to watch.... The speed I picked up after following you is unbelievable. thank you soo muchh for this amazing content and no doubt your explanation is finest ever I have seen.
Thank you for your kind words ☺️
Thank you Maheer. you are doing a very gentle work. have you prepared the tips of this videos i means slides or whatever?
You are doing an amazing job brother. Keep it up. Thanks for all your contributions to data engineering tutorials.
Thank you ☺️
@@WafaStudies brother , can you try to upload the videos quickly as much as you can if you don't mind?
@@tarigopulaayyappa will try to do more fastly 😇
@@WafaStudies Thank you very much.
Awesome Video this is i can thoroughly understand it.
Thank you 😊
Thanks a lot for sharing maheer. Can we create any trail account for practice. As of now Microsoft not provide community free trail subscription I think
Thanks very much for the tutorial :) , I have a query regarding reading in json files.
so i have an array of structs where each struct has a different structure/schema.
And based on a certain property value of struct I apply filter to get that nested struct , however when I display using printschema it contains fields that do not belong to that object but are somehow being associated with the object from the schema of other structs , how can i possibly fix this issue ?
Good content
Good Vedio.
Thanks Maheer.
Welcome 🤗
When you used array() ... What if the number of skills is different between each data?
Nice video how can we remove duplicates from array column
in case of split, what will happen if we give delimiter as | instead of ,
@WafaStudies
Are there any other ways to explode the array without the explode command?
I ask because I made a script with the explode command, but the performance is really bad and I'm looking for another way to do this.
Thank you!
Thank You Wafa..😁😊
Welcome 🤗
Please drop that notebook details in description..so that it will be easy for us to refer...or u can share at git hub repository
explained about explode() , split(), array() & array_contains() functions usages with ArrayType column in PySpark.
----------------------------------------
data = [(1,'Maheer',['dotnet','azure']),(2,'Wafa',['java','aws'])]
schema = ['id', 'name', 'skills']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
-----
#explode()
from pyspark.sql.functions import explode,col
df.show()
df1 = df.withColumn('skill',explode(col='skills'))
df1.show()
-------------------------------------------
data = [(1,'Maheer','dotnet,azure'),(2,'Wafa','java,aws')]
schema = ['id', 'name', 'skills']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
-----
#split()
from pyspark.sql.functions import split,col
df.show()
df1 = df.withColumn('skills_array',split('skills',','))
df1.show()
--------------------------------------------
data = [(1,'Maheer','dotnet','azure'),(2,'Wafa','java','aws')]
schema = ['id', 'name', 'primaryskill', 'secondaryskill']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
------
#array()
from pyspark.sql.functions import array,col
df.show()
df1 = df.withColumn('skillsArray',array(col('primarySkill'),col('secondarySkill')))
df1.show()
---------------------------------------------
data = [(1,'Maheer',['dotnet','azure']),(2,'Wafa',['java','aws'])]
schema = ['id', 'name', 'skills']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
------
from pyspark.sql.functions import array_contains,col
df.show()
df1 = df.withColumn('HasJavaSkill',array_contains('skills',value='java'))
df1.show()
-------------------------------------------------
sir how can we explode more than 2 columns or more like 150
Thank you ❤️
Welcome 🤗
thanks Sir
Welcome
For me I am not sure why it was not working I changed the script then i got skills and skill both the columns from pyspark.sql.functions import explode, col
# Sample data
data = [(1, 'abhishek', ['dotnet', 'azure']), (2, 'abhi', ['java', 'aws'])]
schema = ['id', 'name', 'skills']
# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()
# Apply explode function on the "skills" column and rename the exploded column
df1 = df.withColumn('skill', explode(col('skills'))).select('id', 'name', 'skills', 'skill')
df1.show()
Completed
0:48 eh tusi soap paya pani ch,, sahi tarah dasso , confusion ho rhi hai bahut jada