PySpark expr()
is a SQL function to execute SQL-like expressions and to use an existing DataFrame column value as an expression argument to Pyspark built-in functions. Most of the commonly used SQL functions are either part of the PySpark Column class or built-in pyspark.sql.functions
API, besides these PySpark also supports many other SQL functions, so in order to use these, you have to use expr()
function.
Below are 2 use cases of PySpark expr() funcion.
- First, allowing to use of SQL-like functions that are not present in
PySpark Column
type &pyspark.sql.functions
API. for exampleCASE WHEN
,regr_count()
. - Second, it extends the PySpark SQL Functions by allowing to use DataFrame columns in functions for expression. for example, if you wanted to add a month value from a column to a Date column. I will explain this in the example below.
1. PySpark expr() Syntax
Following is syntax of the expr() function.
expr()
the function takes SQL expression as a string argument, executes the expression, and returns a PySpark Column type. Expressions provided with this function are not compile-time safety like DataFrame operations.
2. PySpark SQL expr() Function Examples
Below are some of the examples of using expr() SQL function.
2.1 Concatenate Columns using || (similar to SQL)
If you have SQL background, you pretty much familiar using || to concatenate values from two string columns, you can use expr() expression to do exactly same.
2.2 Using SQL CASE WHEN with expr()
PySpark doesn’t have SQL Like CASE WHEN so in order to use this on PySpark DataFrame withColumn() or select(), you should use expr()
function with expression as shown below.
Here, I have used CASE WHEN
expression on withColumn()
by using expr()
, this example updates an existing column gender with the derived values, M for male, F for Female, and unknown for others
If you have any errors in the expression you will get the run time error but not during the compile time.
2.3 Using an Existing Column Value for Expression
Most of the PySpark function takes constant literal values but sometimes we need to use a value from an existing column instead of a constant and this is not possible without expr() expression. The below example adds a number of months from an existing column instead of a Python constant.
Note that Importing SQL functions are not required when using them with expr(). You see above add_months() is used without importing.
2.4 Giving Column Alias along with expr()
You can also use SQL like syntax to provide the alias name to the column expression.
2.5 Case Function with expr()
Below example converts long data type to String type.
2.7 Arithmetic Operations
expr() is also used to provide arithmetic operations, below examples add value 5 to increment
and creates a new column new_increment
2.8 Using Filter with expr()
Filter the DataFrame rows can done using expr() expression.
3. Complete Example of PySpark expr() Function
Conclusion
PySpark expr() function provides a way to run SQL like expression with DataFrames, here we have learned how to use expression with select(), withColumn() and to filter the DataFrame rows.
explained well!