If you want to do this in a Spark data frame, and store the results, use the pyspark.sql.functions.split function to split the string by a delimiter, this will return an array column like map. Then to get the same sort of effect as flatMap inside and existing data frame, you can use the pyspark.sql.functions.explode function on the array column of split values. import pyspark.sql.functions as f; df.withColumn("split_values", f.split(f.col("product_descriptions"), " ")); df.withColumn("exploded", f.explode(f.col("split_values"))); Keep in mind, it depends on what you're trying to do; map and flatMap are useful if you want to return the column from a data frame to then do other work in the programming language outside the context of Spark; say for instance, getting a list in Python to iterate through using another Python library. If you want to retain the data in the data frame, you're usually better off using the built-in Spark functions on the data frame columns directly, in some cases, these are calling map and flatMap internally on the RDD, but it typically results in less code for the same performance. There are circumstances where map and flatMap methods can be slower in my experience; sticking to the Spark/Pyspark built-in column functions is best. You can build a data frame from an RDD using RDD.toDF(), but you will need some kind of index value to join it back on to the source data frame in a meaningful way, due to the way that Spark does partitioning between executors, there is no inherent order of the data and would make joining an RDD back (at scale) pointless without a column to join on. So, this goes back to the point that using the built-in functions avoid all this hassle.
Nice explanation 👌 👍 👏
Nice Explanation, I want all the columns in both the examples
Nice and clear, thank you!
better than most lecturers
super bro
Once you have this column as RDD post transformation, how do you add it back to existing data frame as a new column?
If you want to do this in a Spark data frame, and store the results, use the pyspark.sql.functions.split function to split the string by a delimiter, this will return an array column like map. Then to get the same sort of effect as flatMap inside and existing data frame, you can use the pyspark.sql.functions.explode function on the array column of split values. import pyspark.sql.functions as f; df.withColumn("split_values", f.split(f.col("product_descriptions"), " ")); df.withColumn("exploded", f.explode(f.col("split_values")));
Keep in mind, it depends on what you're trying to do; map and flatMap are useful if you want to return the column from a data frame to then do other work in the programming language outside the context of Spark; say for instance, getting a list in Python to iterate through using another Python library. If you want to retain the data in the data frame, you're usually better off using the built-in Spark functions on the data frame columns directly, in some cases, these are calling map and flatMap internally on the RDD, but it typically results in less code for the same performance. There are circumstances where map and flatMap methods can be slower in my experience; sticking to the Spark/Pyspark built-in column functions is best.
You can build a data frame from an RDD using RDD.toDF(), but you will need some kind of index value to join it back on to the source data frame in a meaningful way, due to the way that Spark does partitioning between executors, there is no inherent order of the data and would make joining an RDD back (at scale) pointless without a column to join on. So, this goes back to the point that using the built-in functions avoid all this hassle.