PySpark - myTechMint

PySpark – partitionBy()

PySpark partitionBy() is a function of pyspark.sql.DataFrameWriter the class which is used to partition the large dataset (DataFrame) into smaller files based on one or multiple columns while …

Read More ➜

In PySpark, DataFrame.fillna() or DataFrameNaFunctions.fill() is used to replace NULL/None values on all or selected multiple DataFrame columns with either zero(0), empty string, space, or any constant literal values. While …

Read More ➜

PySpark – sample() and sampleBy()

PySpark provides a pyspark.sql.DataFrame.sample(), pyspark.sql.DataFrame.sampleBy(), RDD.sample(), and RDD.takeSample() methods to get the random sampling subset from the large dataset, In this article, I will explain with Python examples. If …

Read More ➜

PySpark – flatMap()

PySpark flatMap() is a transformation operation that flattens the RDD/DataFrame (array/map DataFrame columns) after applying the function on every element and returns a new PySpark RDD/DataFrame. In …

Read More ➜

PySpark – unionByName()

In Spark or PySpark let’s see how to merge/union two DataFrames with a different number of columns (different schema). In Spark 3.1, you can easily …

Read More ➜

PySpark – distinct() and dropDuplicates()

PySpark distinct() function is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns. …

Read More ➜

PySpark – withColumnRenamed()

Use PySpark withColumnRenamed() to rename a DataFrame column, we often need to rename one column or multiple (or all) columns on PySpark DataFrame, you can do this …

Read More ➜