PySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value. Both these functions return Column type as return type.
Both of these are available in PySpark by importing pyspark.sql.functions
First, let’s create a DataFrame.
lit() Function to Add Constant Column
PySpark lit() function is used to add constant or literal value as a new column to the DataFrame.
Creates a [[Column]] of literal value. The passed in object is returned directly if it is already a [[Column]]. If the object is a Scala Symbol, it is converted into a [[Column]] also. Otherwise, a new [[Column]] is created to represent the literal value
Let’s take a look at some examples.
Example 1: Simple Usage of lit() Function
Let’s see an example of how to create a new column with constant value using lit() Spark SQL function. On the below snippet, we are creating a new column by adding a literal ‘1’ to PySpark DataFrame.
Adding the same constant literal to all records in DataFrame may not be real-time useful so let’s see another example.
Example 2 : lit() Function with withColumn
The following example shows how to use pyspark lit()
function using withColumn to derive a new column based on some conditions.
Below is the output for the above code snippet.
typedLit() Function – Syntax
Difference between lit()
and typedLit()
is that, typedLit function can handle collection types e.g.: Array, Dictionary(map) e.t.c.
Complete Example of How to Add Constant Column
We have learned multiple ways to add a constant literal value to DataFrame using PySpark lit() function and have learned the difference between lit and typedLit functions.
When possible try to use predefined PySpark functions as they are a little bit more compile-time safety and perform better when compared to user-defined functions. If your application is critical on performance try to avoid using custom UDF functions as these are not guarantee on performance.