PySpark SQL provides split()
function to convert delimiter separated String to an Array (StringType
to ArrayType
) column on DataFrame. This can be done by splitting a string column based on a delimiter like space, comma, pipe e.t.c, and converting it into ArrayType.
In this article, We will explain converting String to Array column using split() function on DataFrame and SQL query.
Split() Function Syntax
PySpark SQL split()
is grouped under Array Functions in PySpark SQL Functions class with the below syntax.
The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter. This function returns pyspark.sql.Column
of type Array.
Before we start with usage, first, let’s create a DataFrame with a string column with text separated with comma delimiter
This yields the below output. As you notice we have a name column with takens firstname, middle and lastname with comma separated.
PySpark Convert String to Array Column
Below PySpark example snippet splits the String column name
on comma delimiter and convert it to an Array. If you do not need the original column, use drop() to remove the column.
This yields below output. As you see below schema NameArray is a array type.
Convert String to Array Column Using SQL Query
Since PySpark provides a way to execute the raw SQL, let’s learn how to write the same example using Spark SQL expression.
In order to use raw SQL, first, you need to create a table using createOrReplaceTempView()
. This creates a temporary view from the Dataframe and this view is the available lifetime of the current Spark context.
This yields the same output as above example.
Complete Example
Below is the complete example of splitting an String type column based on a delimiter or patterns and converting into ArrayType column.
Conclusion
In this simple article, we have learned how to convert the string column into an array column by splitting the string by delimiter and also learned how to use the split function on PySpark SQL expression.