PySpark distinct()
function is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates()
is used to drop rows based on selected (one or multiple) columns. In this article, we will learn how to use distinct() and dropDuplicates() functions with PySpark example.
Before we start, first let’s create a DataFrame with some duplicate rows and values on a few columns. We use this DataFrame to demonstrate how to get distinct multiple columns.
Yields below output
On the above table, record with employer’s name Robert
has duplicate rows, As you notice we have 2 rows that have duplicate values on all columns and we have 4 rows that have duplicate values on department
and salary
columns.
1. Get Distinct Rows (By Comparing All Columns)
On the above DataFrame, we have a total of 10 rows with 2 rows having all values duplicated, performing distinct on this DataFrame should get us 9 after removing 1 duplicate row.
distinct() function on DataFrame returns a new DataFrame after removing the duplicate records. This example yields the below output.
Alternatively, you can also run dropDuplicates() function which returns a new DataFrame after removing duplicate rows.
2. PySpark Distinct of Selected Multiple Columns
PySpark doesn’t have a distinct method that takes columns that should run distinct on (drop duplicate rows on selected multiple columns) however, it provides another signature of dropDuplicates()
function which takes multiple columns to eliminate duplicates.
Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows removed.
Yields below output. If you notice the output, It dropped 2 records that are duplicates.
3. More Examples to Get Distinct Rows
Conclusion
In this PySpark SQL article, wehave learned distinct()
a method which is used to get the distinct values of rows (all columns) and also learned how to use dropDuplicates()
to get the distinct and finally learned using dropDuplicates() function to get distinct of multiple columns.