Aggregate function

An aggregate function or aggregation function is a function where the values of multiple rows are grouped to form a single summary value. The definition of the groups of rows on which they operate is done by using the SQL GROUP BY clause. E.g. AVERAGE, SUM, MIN, MAX, etc. 

Creating Dataframe for demonstration:

Before we start with these functions, we will create a new DataFrame that contains employee details like Employee_Name, Department, and Salary. After creating the DataFrame we will apply each Aggregate function on this DataFrame.

Python3




# importing pyspark
import pyspark
 
# importing sparksessio
from pyspark.sql import SparkSession
 
# creating a sparksession
# object and providing appName
spark = SparkSession.builder.appName("pyspark_window").getOrCreate()
 
# sample data for dataframe
sampleData = (("Ram", "Sales", 3000),
              ("Meena", "Sales", 4600),
              ("Robin", "Sales", 4100),
              ("Kunal", "Finance", 3000),
              ("Ram", "Sales", 3000),
              ("Srishti", "Management", 3300),
              ("Jeny", "Finance", 3900),
              ("Hitesh", "Marketing", 3000),
              ("Kailash", "Marketing", 2000),
              ("Sharad", "Sales", 4100)
              )
 
# column names for dataframe
columns = ["Employee_Name", "Department", "Salary"]
 
# creating the dataframe df
df3 = spark.createDataFrame(data=sampleData,
                            schema=columns)
 
# print schema
df3.printSchema()
 
# show df
df3.show()


Output:

This is the DataFrame df3 on which we will apply all the aggregate functions.

Example: Let’s see how to apply the aggregate functions with this example

Python3




# importing window from pyspark.sql.window
from pyspark.sql.window import Window
 
# importing aggregate functions
# from pyspark.sql.functions
from pyspark.sql.functions import col,avg,sum,min,max,row_number
 
# creating a window partition of dataframe
windowPartitionAgg  = Window.partitionBy("Department")
 
# applying window aggregate function
# to df3 with the help of withColumn
 
# this is average()
df3.withColumn("Avg",
               avg(col("salary")).over(windowPartitionAgg))
    #this is sum()
  .withColumn("Sum",
              sum(col("salary")).over(windowPartitionAgg))
    #this is min()
  .withColumn("Min",
              min(col("salary")).over(windowPartitionAgg))
    #this is max()
  .withColumn("Max",
              max(col("salary")).over(windowPartitionAgg)).show()


Output: 

In the output df, we can see that there are four new columns added to df.  In the code, we have applied all the four aggregate functions one by one. We got four output columns added to the df3 that contains values for each row. These four columns contain the Average, Sum, Minimum, and Maximum values of the Salary column.

 



PySpark Window Functions

PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and finally how to use them with PySpark SQL and PySpark DataFrame API. 

There are mainly three types of Window function:

  • Analytical Function
  • Ranking Function
  • Aggregate Function

To perform window function operation on a group of rows first, we need to partition i.e. define the group of data rows using window.partition() function, and for row number and rank function we need to additionally order by on partition data using ORDER BY clause. 

Syntax for Window.partition:

Window.partitionBy(“column_name”).orderBy(“column_name”)

Syntax for Window function:

DataFrame.withColumn(“new_col_name”, Window_function().over(Window_partition))

Let’s understand and implement all these functions one by one with examples.

Similar Reads

Analytical functions

An analytic function is a function that returns a result after operating on data or a finite set of rows partitioned by a SELECT clause or in the ORDER BY clause. It returns a result in the same number of rows as the number of input rows. E.g. lead(), lag(), cume_dist()....

Ranking Function

...

Aggregate function

...

Contact Us