PySpark flatMap()
is a transformation operation that flattens the RDD/DataFrame (array/map DataFrame columns) after applying the function on every element and returns a new PySpark RDD/DataFrame. In this article, you will learn the syntax and usage of the PySpark flatMap() with an example.
First, let’s create an RDD from the list.
This yields the below output
flatMap() Syntax
flatMap() Example
Now, let’s see with an example of how to apply a flatMap() transformation on RDD. In the below example, first, it splits each record by space in an RDD and finally flattens it. Resulting RDD consists of a single word on each record.
This yields below output.
Complete PySpark flatMap() Example
Below is the complete example of flatMap() function that works with RDD.
Using flatMap() Transformation on DataFrame
Unfortunately, PySpark DataFame doesn’t have flatMap() transformation however, DataFrame has explode() SQL function that is used to flatten the column. Below is a complete example.
This example flattens the array column “knownLanguages
” and yields below output
Conclusion
In conclusion, we have learned how to apply a PySpark flatMap() transformation to flattens the array or map columns and also learned how to use alternatives for DataFrame.