Ranking Function

The function returns the statistical rank of a given value for each row in a partition or group. The goal of this function is to provide consecutive numbering of the rows in the resultant column, set by the order selected in the Window.partition for each partition specified in the OVER clause. E.g. row_number(), rank(), dense_rank(), etc.

Creating Dataframe for demonstration:

Before we start with these functions, first we need to create a DataFrame. We will create a DataFrame that contains student details like Roll_No, Student_Name, Subject, Marks. After creating the DataFrame we will apply each Ranking function on this DataFrame df2.

Python3

# importing pyspark
from pyspark.sql.window import Window
import pyspark
 
# importing sparksession
from pyspark.sql import SparkSession
 
# creating a sparksession object and providing appName
spark = SparkSession.builder.appName("pyspark_window").getOrCreate()
 
# sample data for dataframe
sampleData = ((101, "Ram", "Biology", 80),
              (103, "Meena", "Social Science", 78),
              (104, "Robin", "Sanskrit", 58),
              (102, "Kunal", "Phisycs", 89),
              (101, "Ram", "Biology", 80),
              (106, "Srishti", "Maths", 70),
              (108, "Jeny", "Physics", 75),
              (107, "Hitesh", "Maths", 88),
              (109, "Kailash", "Maths", 90),
              (105, "Sharad", "Social Science", 84)
              )
 
# column names for dataframe
columns = ["Roll_No", "Student_Name", "Subject", "Marks"]
 
# creating the dataframe df
df2 = spark.createDataFrame(data=sampleData,
                            schema=columns)
 
# importing window from pyspark.sql.window
 
# creating a window partition of dataframe
windowPartition = Window.partitionBy("Subject").orderBy("Marks")
 
# print schema
df2.printSchema()
 
# show df
df2.show()

Output:

This is the DataFrame df2 on which we will apply all the Window ranking function.

Example 1: Using row_number().

row_number() function is used to gives a sequential number to each row present in the table. Let’s see the example:

Python3

# importing row_number() from pyspark.sql.functions
from pyspark.sql.functions import row_number
 
# applying the row_number() function
df2.withColumn("row_number", 
               row_number().over(windowPartition)).show()

Output:

In this output, we can see that we have the row number for each row based on the specified partition i.e. the row numbers are given followed by the Subject and Marks column.

Example 2: Using rank()

The rank function is used to give ranks to rows specified in the window partition. This function leaves gaps in rank if there are ties. Let’s see the example:

Python3

# importing rank() from pyspark.sql.functions
from pyspark.sql.functions import rank
 
# applying the rank() function
df2.withColumn("rank", rank().over(windowPartition)) \
    .show()

Output:

In the output, the rank is provided to each row as per the Subject and Marks column as specified in the window partition.

Example 3: Using percent_rank()

This function is similar to rank() function. It also provides rank to rows but in a percentile format. Let’s see the example:

Python3

# importing percent_rank() from pyspark.sql.functions
from pyspark.sql.functions import percent_rank
 
# applying the percent_rank() function
df2.withColumn("percent_rank",
               percent_rank().over(windowPartition)).show()

Output:

We can see that in the output the rank column contains values in a percentile form i.e. in the decimal format.

Example 4: Using dense_rank()

This function is used to get the rank of each row in the form of row numbers. This is similar to rank() function, there is only one difference the rank function leaves gaps in rank when there are ties. Let’s see the example:

Python3

# importing dense_rank() from pyspark.sql.functions
from pyspark.sql.functions import dense_rank
 
# applying the dense_rank() function
df2.withColumn("dense_rank",
               dense_rank().over(windowPartition)).show()

Output:

In the output, we can see that the ranks are given in the form of row numbers.

PySpark Window Functions

PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and finally how to use them with PySpark SQL and PySpark DataFrame API.

There are mainly three types of Window function:

Analytical Function
Ranking Function
Aggregate Function

To perform window function operation on a group of rows first, we need to partition i.e. define the group of data rows using window.partition() function, and for row number and rank function we need to additionally order by on partition data using ORDER BY clause.

Syntax for Window.partition:

Window.partitionBy(“column_name”).orderBy(“column_name”)

Syntax for Window function:

DataFrame.withColumn(“new_col_name”, Window_function().over(Window_partition))

Let’s understand and implement all these functions one by one with examples.

Tags:

#Python-Pyspark #Python #python

Analytical functions

Aggregate function

Ranking Function

Python3

Example 1: Using row_number().

Python3

Example 2: Using rank()

Python3

Example 3: Using percent_rank()

Python3

Example 4: Using dense_rank()

Python3

PySpark Window Functions

Similar Reads

Contact Us