this is worthy to watch.... The speed I picked up after following you is unbelievable. thank you soo muchh for this amazing content and no doubt your explanation is finest ever I have seen.
Thanks very much for the tutorial :) , I have a query regarding reading in json files. so i have an array of structs where each struct has a different structure/schema. And based on a certain property value of struct I apply filter to get that nested struct , however when I display using printschema it contains fields that do not belong to that object but are somehow being associated with the object from the schema of other structs , how can i possibly fix this issue ?
Thanks a lot for sharing maheer. Can we create any trail account for practice. As of now Microsoft not provide community free trail subscription I think
@WafaStudies Are there any other ways to explode the array without the explode command? I ask because I made a script with the explode command, but the performance is really bad and I'm looking for another way to do this. Thank you!
For me I am not sure why it was not working I changed the script then i got skills and skill both the columns from pyspark.sql.functions import explode, col # Sample data data = [(1, 'abhishek', ['dotnet', 'azure']), (2, 'abhi', ['java', 'aws'])] schema = ['id', 'name', 'skills'] # Create DataFrame df = spark.createDataFrame(data, schema) df.show() # Apply explode function on the "skills" column and rename the exploded column df1 = df.withColumn('skill', explode(col('skills'))).select('id', 'name', 'skills', 'skill') df1.show()
this is worthy to watch.... The speed I picked up after following you is unbelievable. thank you soo muchh for this amazing content and no doubt your explanation is finest ever I have seen.
Thank you for your kind words ☺️
You are doing an amazing job brother. Keep it up. Thanks for all your contributions to data engineering tutorials.
Thank you ☺️
@@WafaStudies brother , can you try to upload the videos quickly as much as you can if you don't mind?
@@tarigopulaayyappa will try to do more fastly 😇
@@WafaStudies Thank you very much.
Awesome Video this is i can thoroughly understand it.
Thank you 😊
Good Vedio.
Thanks Maheer.
Welcome 🤗
Thank you Maheer. you are doing a very gentle work. have you prepared the tips of this videos i means slides or whatever?
Thanks very much for the tutorial :) , I have a query regarding reading in json files.
so i have an array of structs where each struct has a different structure/schema.
And based on a certain property value of struct I apply filter to get that nested struct , however when I display using printschema it contains fields that do not belong to that object but are somehow being associated with the object from the schema of other structs , how can i possibly fix this issue ?
Nice video how can we remove duplicates from array column
Thanks a lot for sharing maheer. Can we create any trail account for practice. As of now Microsoft not provide community free trail subscription I think
Please drop that notebook details in description..so that it will be easy for us to refer...or u can share at git hub repository
explained about explode() , split(), array() & array_contains() functions usages with ArrayType column in PySpark.
----------------------------------------
data = [(1,'Maheer',['dotnet','azure']),(2,'Wafa',['java','aws'])]
schema = ['id', 'name', 'skills']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
-----
#explode()
from pyspark.sql.functions import explode,col
df.show()
df1 = df.withColumn('skill',explode(col='skills'))
df1.show()
-------------------------------------------
data = [(1,'Maheer','dotnet,azure'),(2,'Wafa','java,aws')]
schema = ['id', 'name', 'skills']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
-----
#split()
from pyspark.sql.functions import split,col
df.show()
df1 = df.withColumn('skills_array',split('skills',','))
df1.show()
--------------------------------------------
data = [(1,'Maheer','dotnet','azure'),(2,'Wafa','java','aws')]
schema = ['id', 'name', 'primaryskill', 'secondaryskill']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
------
#array()
from pyspark.sql.functions import array,col
df.show()
df1 = df.withColumn('skillsArray',array(col('primarySkill'),col('secondarySkill')))
df1.show()
---------------------------------------------
data = [(1,'Maheer',['dotnet','azure']),(2,'Wafa',['java','aws'])]
schema = ['id', 'name', 'skills']
df = spark.createDataFrame(data=data,schema=schema)
df.display()
df.printSchema()
------
from pyspark.sql.functions import array_contains,col
df.show()
df1 = df.withColumn('HasJavaSkill',array_contains('skills',value='java'))
df1.show()
-------------------------------------------------
Thank You Wafa..😁😊
Welcome 🤗
Good content
When you used array() ... What if the number of skills is different between each data?
in case of split, what will happen if we give delimiter as | instead of ,
sir how can we explode more than 2 columns or more like 150
@WafaStudies
Are there any other ways to explode the array without the explode command?
I ask because I made a script with the explode command, but the performance is really bad and I'm looking for another way to do this.
Thank you!
Thank you ❤️
Welcome 🤗
0:48 eh tusi soap paya pani ch,, sahi tarah dasso , confusion ho rhi hai bahut jada
For me I am not sure why it was not working I changed the script then i got skills and skill both the columns from pyspark.sql.functions import explode, col
# Sample data
data = [(1, 'abhishek', ['dotnet', 'azure']), (2, 'abhi', ['java', 'aws'])]
schema = ['id', 'name', 'skills']
# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()
# Apply explode function on the "skills" column and rename the exploded column
df1 = df.withColumn('skill', explode(col('skills'))).select('id', 'name', 'skills', 'skill')
df1.show()
thanks Sir
Welcome
Completed