Extract First and last N rows from PySpark DataFrame

In this article, we are going to get the extract first N rows and Last N rows from the dataframe using PySpark in Python. To do our task first we will create a sample dataframe.

We have to create a spark object with the help of the spark session and give the app name by using getorcreate() method.

spark = SparkSession.builder.appName('sparkdf').getOrCreate()

Finally, after creating the data with the list and column list to the method:

dataframe = spark.createDataFrame(data, columns)


# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"]]
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
print('Actual data in dataframe')


Extracting first N rows

We can extract the first N rows by using several methods which are discussed below with the help of some examples:

Method 1: Using head()

This function is used to extract top N rows in the given dataframe

Syntax: dataframe.head(n)


  • n specifies the number of rows to be extracted from first
  • dataframe is the dataframe name created from the nested lists using pyspark.


print("Top 2 rows ")
# extract top 2 rows
a = dataframe.head(2)
print("Top 1 row ")
# extract top 1 row
a = dataframe.head(1)


Top 2 rows  

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′), 

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

Top 1 row  

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

Method 2: Using first()

This function is used to extract only one row in the dataframe.

Syntax: dataframe.first()

  • It doesn’t take any parameter
  • dataframe is the dataframe name created from the nested lists using pyspark


print("Top row ")
# extract top  row
a = dataframe.first()


Top row  

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Method 3: Using show() 

Used to display the dataframe from top to bottom by default.

Syntax: dataframe.show(n)


  • dataframe is the input dataframe
  • n is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe


# show() function to get 
# 2 rows


Extracting Last N rows

Extracting the last rows means getting the last N rows from the given dataframe. For this, we are using tail() function and can get the last N rows

Syntax: dataframe.tail(n)


  • n is the number to get last n rows
  • data frame is the input dataframe



print("Last 2 rows ")
# extract last 2 rows
a = dataframe.tail(2)
print("Last 1 row ")
# extract last 1 row
a = dataframe.tail(1)


Last 2 rows  

[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′), 

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

Last 1 row  

[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

