How to check the schema of PySpark DataFrame?
In this article, we are going to check the schema of pyspark dataframe. We are going to use the below Dataframe for demonstration.
Method 1: Using df.schema
Schema is used to return the columns along with the type.
Syntax: dataframe.schema
Where, dataframe is the input dataframe
Code:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data with 5 row values data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 2" ], [ "3" , "bobby" , "company 3" ], [ "4" , "rohith" , "company 2" ], [ "5" , "gnanesh" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe columns dataframe.schema |
Output:
StructType(List(StructField(Employee ID,StringType,true), StructField(Employee NAME,StringType,true), StructField(Company Name,StringType,true)))
Method 2: Using schema.fields
It is used to return the names of the columns
Syntax: dataframe.schema.fields
where dataframe is the dataframe name
Code:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data with 5 row values data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 2" ], [ "3" , "bobby" , "company 3" ], [ "4" , "rohith" , "company 2" ], [ "5" , "gnanesh" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe columns dataframe.schema.fields |
Output:
[StructField(Employee ID,StringType,true), StructField(Employee NAME,StringType,true), StructField(Company Name,StringType,true)]
Method 3: Using printSchema()
It is used to return the schema with column names
Syntax: dataframe.printSchema()
where dataframe is the input pyspark dataframe
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data with 5 row values data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 2" ], [ "3" , "bobby" , "company 3" ], [ "4" , "rohith" , "company 2" ], [ "5" , "gnanesh" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe columns dataframe.printSchema() |
Output:
root |-- Employee ID: string (nullable = true) |-- Employee NAME: string (nullable = true) |-- Company Name: string (nullable = true)
Contact Us