Analytical functions

An analytic function is a function that returns a result after operating on data or a finite set of rows partitioned by a SELECT clause or in the ORDER BY clause. It returns a result in the same number of rows as the number of input rows. E.g. lead(), lag(), cume_dist().

Creating dataframe for demonstration:

Before we start with these functions, first we need to create a DataFrame. We will create a DataFrame that contains employee details like Employee_Name, Age, Department, Salary. After creating the DataFrame we will apply each analytical function on this DataFrame df.

Python3




# importing pyspark
from pyspark.sql.window import Window
import pyspark
 
# importing sparksession
from pyspark.sql import SparkSession
 
# creating a sparksession object
# and providing appName
spark = SparkSession.builder.appName("pyspark_window").getOrCreate()
 
# sample data for dataframe
sampleData = (("Ram", 28, "Sales", 3000),
              ("Meena", 33, "Sales", 4600),
              ("Robin", 40, "Sales", 4100),
              ("Kunal", 25, "Finance", 3000),
              ("Ram", 28, "Sales", 3000),
              ("Srishti", 46, "Management", 3300),
              ("Jeny", 26, "Finance", 3900),
              ("Hitesh", 30, "Marketing", 3000),
              ("Kailash", 29, "Marketing", 2000),
              ("Sharad", 39, "Sales", 4100)
              )
 
# column names for dataframe
columns = ["Employee_Name", "Age",
           "Department", "Salary"]
 
# creating the dataframe df
df = spark.createDataFrame(data=sampleData,
                           schema=columns)
 
# importing Window from pyspark.sql.window
 
# creating a window
# partition of dataframe
windowPartition = Window.partitionBy("Department").orderBy("Age")
 
# print schema
df.printSchema()
 
# show df
df.show()


Output:

This is the DataFrame on which we will apply all the analytical functions.

Example 1: Using cume_dist()

cume_dist() window function is used to get the cumulative distribution within a window partition. It is similar to CUME_DIST in SQL. Let’s see an example:

Python3




# importing cume_dist()
# from pyspark.sql.functions
from pyspark.sql.functions import cume_dist
 
# applying window function with
# the help of DataFrame.withColumn
df.withColumn("cume_dist",
              cume_dist().over(windowPartition)).show()


Output:

In the output, we can see that a new column is added to the df named “cume_dist” that contains the cumulative distribution of the Department column which is ordered by the Age column.

Example 2: Using lag()

A lag() function is used to access previous rows’ data as per the defined offset value in the function. This function is similar to the LAG in SQL.

Python3




# importing lag() from pyspark.sql.functions
from pyspark.sql.functions import lag
 
df.withColumn("Lag", lag("Salary", 2).over(windowPartition)) \
    .show()


Output:

In the output, we can see that lag column is added to the df that contains lag values. In the first 2 rows there is a null value as we have defined offset 2 followed by column Salary in the lag() function. The next rows contain the values of previous rows.

Example 3: Using lead()

A lead() function is used to access next rows data as per the defined offset value in the function. This function is similar to the LEAD in SQL and just opposite to lag() function or LAG in SQL.

Python3




# importing lead() from pyspark.sql.functions
from pyspark.sql.functions import lead
 
df.withColumn("Lead", lead("salary", 2).over(windowPartition)) \
    .show()


Output:

PySpark Window Functions

PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and finally how to use them with PySpark SQL and PySpark DataFrame API. 

There are mainly three types of Window function:

  • Analytical Function
  • Ranking Function
  • Aggregate Function

To perform window function operation on a group of rows first, we need to partition i.e. define the group of data rows using window.partition() function, and for row number and rank function we need to additionally order by on partition data using ORDER BY clause. 

Syntax for Window.partition:

Window.partitionBy(“column_name”).orderBy(“column_name”)

Syntax for Window function:

DataFrame.withColumn(“new_col_name”, Window_function().over(Window_partition))

Let’s understand and implement all these functions one by one with examples.

Similar Reads

Analytical functions

An analytic function is a function that returns a result after operating on data or a finite set of rows partitioned by a SELECT clause or in the ORDER BY clause. It returns a result in the same number of rows as the number of input rows. E.g. lead(), lag(), cume_dist()....

Ranking Function

...

Aggregate function

...

Contact Us