Add new column with default value in PySpark dataframe
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.
The three ways to add a column to PandPySpark as DataFrame with Default Value.
- Using pyspark.sql.DataFrame.withColumn(colName, col)
- Using pyspark.sql.DataFrame.select(*cols)
- Using pyspark.sql.SparkSession.sql(sqlQuery)
Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)
It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.
Syntax: pyspark.sql.DataFrame.withColumn(colName, col)
Parameters: This method accepts the following parameter as mentioned above and described below.
- colName: It is a string and contains name of the new column.
- col: It is a Column expression for the new column.
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark findspark.init() # Importing the modules from datetime import datetime, date import pandas as pd from pyspark.sql import SparkSession # creating the session spark = SparkSession.builder.getOrCreate() # creating the dataframe pandas_df = pd.DataFrame({ 'Name' : [ 'Anurag' , 'Manjeet' , 'Shubham' , 'Saurabh' , 'Ujjawal' ], 'Address' : [ 'Patna' , 'Delhi' , 'Coimbatore' , 'Greater noida' , 'Patna' ], 'ID' : [ 20123 , 20124 , 20145 , 20146 , 20147 ], 'Sell' : [ 140000 , 300000 , 600000 , 200000 , 600000 ] }) df = spark.createDataFrame(pandas_df) print ( "Original DataFrame :" ) df.show() |
Output:
Add a new column with Default Value:
Python3
# Add new column with NUll from pyspark.sql.functions import lit df = df.withColumn( "Rewards" , lit( None )) df.show() # Add new constanst column df = df.withColumn( "Bonus Percent" , lit( 0.25 )) df.show() |
Output:
Method 2: Using pyspark.sql.DataFrame.select(*cols)
We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.
Syntax: pyspark.sql.DataFrame.select(*cols)
Parameters: This method accepts the following parameter as mentioned above and described below.
- cols: It contains column names (string) or expressions (Column)
Returns: DataFrame
First, create a simple DataFrame.
Python3
import findspark findspark.init() # Importing the modules from datetime import datetime, date import pandas as pd from pyspark.sql import SparkSession # creating the session spark = SparkSession.builder.getOrCreate() # creating the dataframe pandas_df = pd.DataFrame({ 'Name' : [ 'Anurag' , 'Manjeet' , 'Shubham' , 'Saurabh' , 'Ujjawal' ], 'Address' : [ 'Patna' , 'Delhi' , 'Coimbatore' , 'Greater noida' , 'Patna' ], 'ID' : [ 20123 , 20124 , 20145 , 20146 , 20147 ], 'Sell' : [ 140000 , 300000 , 600000 , 200000 , 600000 ] }) df = spark.createDataFrame(pandas_df) print ( "Original DataFrame :" ) df.show() |
Output:
Add a new column with Default Value:
Python3
# Add new column with NUll from pyspark.sql.functions import lit df = df.select( '*' , lit( None ).alias( "Rewards" )) # Add new constanst column df = df.select( '*' , lit( 0.25 ).alias( "Bonus Percent" )) df.show() |
Output:
Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)
We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.
Syntax: pyspark.sql.SparkSession.sql(sqlQuery)
Parameters: This method accepts the following parameter as mentioned above and described below.
- sqlQuery: It is a string and contains the sql executable query.
Returns: DataFrame
First, create a simple DataFrame:
Python3
import findspark findspark.init() # Importing the modules from datetime import datetime, date import pandas as pd from pyspark.sql import SparkSession # creating the session spark = SparkSession.builder.getOrCreate() # creating the dataframe pandas_df = pd.DataFrame({ 'Name' : [ 'Anurag' , 'Manjeet' , 'Shubham' , 'Saurabh' , 'Ujjawal' ], 'Address' : [ 'Patna' , 'Delhi' , 'Coimbatore' , 'Greater noida' , 'Patna' ], 'ID' : [ 20123 , 20124 , 20145 , 20146 , 20147 ], 'Sell' : [ 140000 , 300000 , 600000 , 200000 , 600000 ] }) df = spark.createDataFrame(pandas_df) print ( "Original DataFrame :" ) df.show() |
Output:
Add a new column with Default Value:
Python3
# Add columns to DataFrame using SQL df.createOrReplaceTempView( "GFG_Table" ) # Add new column with NUll df = spark.sql( "select *, null as Rewards from GFG_Table" ) # Add new constanst column df.createOrReplaceTempView( "GFG_Table" ) df = spark.sql( "select *, '0.25' as Bonus_Percent from GFG_Table" ) df.show() |
Output:
Contact Us