How to use getNumPartitions() function In Python
In this method, we are going to find the number of partitions in a data frame using getNumPartitions() function in a data frame.
Syntax: rdd.getNumPartitions()
- Return type: This function return the numbers of partitions.
Stepwise Implementation:
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file in which you want to know the number of partitions.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Finally, get the number of partitions using the getNumPartitions function.
print(data_frame.rdd.getNumPartitions())
Example:
In this example, we have read the given below CSV file and obtained the current number of partitions using the getNumPartitions function.
Python
# Python program to get current number of # partitions using getNumPartitions function # Import the SparkSession libraries from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/class_data.csv' , sep = ',' , inferSchema = True , header = True ) # Get current number of partitions # using getNumPartitions function print (data_frame.rdd.getNumPartitions()) |
Output:
1
Get current number of partitions of a DataFrame – Pyspark
In this article, we are going to learn how to get the current number of partitions of a data frame using Pyspark in Python.
In many cases, we need to know the number of partitions in large data frames. Sometimes we have partitioned the data and we need to verify if it has been correctly partitioned or not. There are various methods to get the current number of partitions of a data frame using Pyspark in Python.
Prerequisite
Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.
Modules Required
Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Contact Us