How to check the schema of PySpark DataFrame? ❤️

In this article, we are going to check the schema of pyspark dataframe. We are going to use the below Dataframe for demonstration.

Method 1: Using df.schema

Schema is used to return the columns along with the type.

Syntax: dataframe.schema

Where, dataframe is the input dataframe

Code:

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data with 5 row values 
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 2"], 
        ["3", "bobby", "company 3"], 
        ["4", "rohith", "company 2"], 
        ["5", "gnanesh", "company 1"]] 
  
# specify column names 
columns = ['Employee ID', 'Employee NAME', 'Company Name'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display dataframe columns 
dataframe.schema 

Output:

StructType(List(StructField(Employee ID,StringType,true),
StructField(Employee NAME,StringType,true),
StructField(Company Name,StringType,true)))

Method 2: Using schema.fields

It is used to return the names of the columns

Syntax: dataframe.schema.fields

where dataframe is the dataframe name

Code:

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data with 5 row values 
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 2"], 
        ["3", "bobby", "company 3"], 
        ["4", "rohith", "company 2"], 
        ["5", "gnanesh", "company 1"]] 
  
# specify column names 
columns = ['Employee ID', 'Employee NAME', 'Company Name'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display dataframe columns 
dataframe.schema.fields 

Output:

[StructField(Employee ID,StringType,true),
StructField(Employee NAME,StringType,true),
StructField(Company Name,StringType,true)]

Method 3: Using printSchema()

It is used to return the schema with column names

Syntax: dataframe.printSchema()

where dataframe is the input pyspark dataframe

Python3

# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
# list  of employee data with 5 row values 
data = [["1", "sravan", "company 1"], 
        ["2", "ojaswi", "company 2"], 
        ["3", "bobby", "company 3"], 
        ["4", "rohith", "company 2"], 
        ["5", "gnanesh", "company 1"]] 
  
# specify column names 
columns = ['Employee ID', 'Employee NAME', 'Company Name'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
  
# display dataframe columns 
dataframe.printSchema() 

Output:

root
 |-- Employee ID: string (nullable = true)
 |-- Employee NAME: string (nullable = true)
 |-- Company Name: string (nullable = true)

How to check the schema of PySpark DataFrame?

Python3

Python3

Python3

Contact Us