In Spark or PySpark let’s see how to merge/union two DataFrames with a different number of columns (different schema). In Spark 3.1, you can easily achieve this using unionByName() transformation by passing allowMissingColumns with the value true. In older version, this property is not available
The difference between unionByName()
function and union()
is that this function
resolves columns by name (not by position). In other words, unionByName() is used to merge two DataFrame’s by column names instead of by position.
In case if you are using older than Spark 3.1 version, use below approach to merge DataFrame’s with different column names.
Spark Merge Two DataFrames with Different Columns
In this section I will cover Spark with Scala example of how to merge two different DataFrames, first let’s create DataFrames with different number of columns. DataFrame df1
missing column state
and salary
and df2
missing column age
.
Second DataFrame
Now create a new DataFrames from existing after adding missing columns. newly added columns contains null
values and we add constant column using lit() function.
Finally merge two DataFrame’s by using column names
PySpark Merge Two DataFrames with Different Columns
In PySpark to merge two DataFrames with different columns, will use the similar approach explain above and uses unionByName()
transformation. First let’s create DataFrame’s with different number of columns.
Now add the missing columns ‘state
‘ and ‘salary
‘ to df1
and ‘age
‘ to df2
with null values.
Now merge/union the DataFrames using unionByName()
. The difference between unionByName()
function and union()
is that this function
resolves columns by name (not by position). In other words, unionByName() is used to merge two DataFrame’s by column names instead of by position.
Conclusion
In this article, you have learned with PySpark examples of how to merge two DataFrames with different columns can be done by adding missing columns to the DataFrame’s and finally union them using unionByName().