14. explode(), split(), array() & array_contains() functions in PySpark |

Поділитися
Вставка
  • Опубліковано 29 лис 2024

КОМЕНТАРІ • 31

  • @raghunathpanse3258
    @raghunathpanse3258 Рік тому +4

    this is worthy to watch.... The speed I picked up after following you is unbelievable. thank you soo muchh for this amazing content and no doubt your explanation is finest ever I have seen.

    • @WafaStudies
      @WafaStudies  Рік тому

      Thank you for your kind words ☺️

  • @deepjyotimitra1340
    @deepjyotimitra1340 2 роки тому +4

    You are doing an amazing job brother. Keep it up. Thanks for all your contributions to data engineering tutorials.

    • @WafaStudies
      @WafaStudies  2 роки тому

      Thank you ☺️

    • @tarigopulaayyappa
      @tarigopulaayyappa 2 роки тому +1

      @@WafaStudies brother , can you try to upload the videos quickly as much as you can if you don't mind?

    • @WafaStudies
      @WafaStudies  2 роки тому

      @@tarigopulaayyappa will try to do more fastly 😇

    • @tarigopulaayyappa
      @tarigopulaayyappa 2 роки тому

      @@WafaStudies Thank you very much.

  • @VivekKBangaru
    @VivekKBangaru Рік тому +2

    Awesome Video this is i can thoroughly understand it.

  • @polakigowtam183
    @polakigowtam183 2 роки тому +1

    Good Vedio.
    Thanks Maheer.

  • @Aelmasri-ht5sv
    @Aelmasri-ht5sv Рік тому

    Thank you Maheer. you are doing a very gentle work. have you prepared the tips of this videos i means slides or whatever?

  • @shreyaspatil4861
    @shreyaspatil4861 10 місяців тому

    Thanks very much for the tutorial :) , I have a query regarding reading in json files.
    so i have an array of structs where each struct has a different structure/schema.
    And based on a certain property value of struct I apply filter to get that nested struct , however when I display using printschema it contains fields that do not belong to that object but are somehow being associated with the object from the schema of other structs , how can i possibly fix this issue ?

  • @RakeshGandu-wb7eu
    @RakeshGandu-wb7eu Рік тому

    Nice video how can we remove duplicates from array column

  • @phanidivi3613
    @phanidivi3613 2 роки тому

    Thanks a lot for sharing maheer. Can we create any trail account for practice. As of now Microsoft not provide community free trail subscription I think

  • @vasanthasworld2948
    @vasanthasworld2948 2 роки тому

    Please drop that notebook details in description..so that it will be easy for us to refer...or u can share at git hub repository

    • @DataWithNagar
      @DataWithNagar Рік тому

      explained about explode() , split(), array() & array_contains() functions usages with ArrayType column in PySpark.
      ----------------------------------------
      data = [(1,'Maheer',['dotnet','azure']),(2,'Wafa',['java','aws'])]
      schema = ['id', 'name', 'skills']
      df = spark.createDataFrame(data=data,schema=schema)
      df.display()
      df.printSchema()
      -----
      #explode()
      from pyspark.sql.functions import explode,col
      df.show()
      df1 = df.withColumn('skill',explode(col='skills'))
      df1.show()
      -------------------------------------------
      data = [(1,'Maheer','dotnet,azure'),(2,'Wafa','java,aws')]
      schema = ['id', 'name', 'skills']
      df = spark.createDataFrame(data=data,schema=schema)
      df.display()
      df.printSchema()
      -----
      #split()
      from pyspark.sql.functions import split,col
      df.show()
      df1 = df.withColumn('skills_array',split('skills',','))
      df1.show()
      --------------------------------------------
      data = [(1,'Maheer','dotnet','azure'),(2,'Wafa','java','aws')]
      schema = ['id', 'name', 'primaryskill', 'secondaryskill']
      df = spark.createDataFrame(data=data,schema=schema)
      df.display()
      df.printSchema()
      ------
      #array()
      from pyspark.sql.functions import array,col
      df.show()
      df1 = df.withColumn('skillsArray',array(col('primarySkill'),col('secondarySkill')))
      df1.show()
      ---------------------------------------------
      data = [(1,'Maheer',['dotnet','azure']),(2,'Wafa',['java','aws'])]
      schema = ['id', 'name', 'skills']
      df = spark.createDataFrame(data=data,schema=schema)
      df.display()
      df.printSchema()
      ------
      from pyspark.sql.functions import array_contains,col
      df.show()
      df1 = df.withColumn('HasJavaSkill',array_contains('skills',value='java'))
      df1.show()
      -------------------------------------------------

  • @tarun007
    @tarun007 2 роки тому +1

    Thank You Wafa..😁😊

  • @Sundar_Tenkasi
    @Sundar_Tenkasi 3 місяці тому

    Good content

  • @yosaki-fv9yy
    @yosaki-fv9yy 11 місяців тому

    When you used array() ... What if the number of skills is different between each data?

  • @sahilgarg7383
    @sahilgarg7383 8 місяців тому

    in case of split, what will happen if we give delimiter as | instead of ,

  • @mohitpande2006
    @mohitpande2006 Рік тому

    sir how can we explode more than 2 columns or more like 150

  • @julianalilian
    @julianalilian Рік тому

    @WafaStudies
    Are there any other ways to explode the array without the explode command?
    I ask because I made a script with the explode command, but the performance is really bad and I'm looking for another way to do this.
    Thank you!

  • @SonuKumar-fn1gn
    @SonuKumar-fn1gn 2 роки тому +1

    Thank you ❤️

  • @maskally6398
    @maskally6398 3 місяці тому

    0:48 eh tusi soap paya pani ch,, sahi tarah dasso , confusion ho rhi hai bahut jada

  • @abhishekstatus_7
    @abhishekstatus_7 Рік тому

    For me I am not sure why it was not working I changed the script then i got skills and skill both the columns from pyspark.sql.functions import explode, col
    # Sample data
    data = [(1, 'abhishek', ['dotnet', 'azure']), (2, 'abhi', ['java', 'aws'])]
    schema = ['id', 'name', 'skills']
    # Create DataFrame
    df = spark.createDataFrame(data, schema)
    df.show()
    # Apply explode function on the "skills" column and rename the exploded column
    df1 = df.withColumn('skill', explode(col('skills'))).select('id', 'name', 'skills', 'skill')
    df1.show()

  • @deepakk8758
    @deepakk8758 2 роки тому +1

    thanks Sir

  • @vutv5742
    @vutv5742 9 місяців тому

    Completed