PySpark – sample() and sampleBy()

pyspark-mytechmint

PySpark provides a pyspark.sql.DataFrame.sample(), pyspark.sql.DataFrame.sampleBy(), RDD.sample(), and RDD.takeSample() methods to get the random sampling subset from the large dataset, In this article, I will explain with Python examples. If …

Read More ➜

PySpark – flatMap()

pyspark-mytechmint

PySpark flatMap() is a transformation operation that flattens the RDD/DataFrame (array/map DataFrame columns) after applying the function on every element and returns a new PySpark RDD/DataFrame. In …

Read More ➜

PySpark – unionByName()

pyspark-mytechmint

In Spark or PySpark let’s see how to merge/union two DataFrames with a different number of columns (different schema). In Spark 3.1, you can easily …

Read More ➜

PySpark – collect()

pyspark-mytechmint

PySpark RDD/DataFrame collect() is an action operation that is used to retrieve all the elements of the dataset (from all nodes) to the driver node. We should …

Read More ➜

PySpark – show()

pyspark-mytechmint

PySpark DataFrame show() is used to display the contents of the DataFrame in a Table Row and Column Format. By default, it shows only 20 …

Read More ➜