Add new column with default value in PySpark dataframe ❤️

In this article, we are going to see how to add a new column with a default value in PySpark Dataframe.

The three ways to add a column to PandPySpark as DataFrame with Default Value.

Using pyspark.sql.DataFrame.withColumn(colName, col)
Using pyspark.sql.DataFrame.select(*cols)
Using pyspark.sql.SparkSession.sql(sqlQuery)

Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)

It Adds a column or replaces the existing column that has the same name to a DataFrame and returns a new DataFrame with all existing columns to new ones. The column expression must be an expression over this DataFrame and adding a column from some other DataFrame will raise an error.

Syntax: pyspark.sql.DataFrame.withColumn(colName, col)

Parameters: This method accepts the following parameter as mentioned above and described below.

colName: It is a string and contains name of the new column.

col: It is a Column expression for the new column.

Returns: DataFrame

First, create a simple DataFrame.

Python3

import findspark 
findspark.init() 
  
# Importing the modules 
from datetime import datetime, date 
import pandas as pd 
from pyspark.sql import SparkSession 
  
# creating the session 
spark = SparkSession.builder.getOrCreate() 
  
# creating the dataframe 
pandas_df = pd.DataFrame({ 
    'Name': ['Anurag', 'Manjeet', 'Shubham', 
             'Saurabh', 'Ujjawal'], 
    'Address': ['Patna', 'Delhi', 'Coimbatore', 
                'Greater noida', 'Patna'], 
    'ID': [20123, 20124, 20145, 20146, 20147], 
    'Sell': [140000, 300000, 600000, 200000, 600000] 
}) 
df = spark.createDataFrame(pandas_df) 
print("Original DataFrame :") 
df.show()

Output:

Add a new column with Default Value:

Python3

# Add new column with NUll 
from pyspark.sql.functions import lit 
df = df.withColumn("Rewards", lit(None)) 
df.show() 
  
# Add new constanst column 
df = df.withColumn("Bonus Percent", lit(0.25)) 
df.show()

Output:

**Method 2: Using pyspark.sql.DataFrame.select(*cols)**

We can use pyspark.sql.DataFrame.select() create a new column in DataFrame and set it to default values. It projects a set of expressions and returns a new DataFrame.

Syntax: pyspark.sql.DataFrame.select(*cols)

Parameters: This method accepts the following parameter as mentioned above and described below.

cols: It contains column names (string) or expressions (Column)

Returns: DataFrame

First, create a simple DataFrame.

Python3

import findspark 
findspark.init() 
  
# Importing the modules 
from datetime import datetime, date 
import pandas as pd 
from pyspark.sql import SparkSession 
  
# creating the session 
spark = SparkSession.builder.getOrCreate() 
  
# creating the dataframe 
pandas_df = pd.DataFrame({ 
    'Name': ['Anurag', 'Manjeet', 'Shubham', 
             'Saurabh', 'Ujjawal'], 
    'Address': ['Patna', 'Delhi', 'Coimbatore', 
                'Greater noida', 'Patna'], 
    'ID': [20123, 20124, 20145, 20146, 20147], 
    'Sell': [140000, 300000, 600000, 200000, 600000] 
}) 
df = spark.createDataFrame(pandas_df) 
print("Original DataFrame :") 
df.show()

Output:

Add a new column with Default Value:

Python3

# Add new column with NUll 
from pyspark.sql.functions import lit 
df = df.select('*', lit(None).alias("Rewards")) 
  
# Add new constanst column 
df = df.select('*', lit(0.25).alias("Bonus Percent")) 
df.show()

Output:

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)

We can use pyspark.sql.SparkSession.sql() create a new column in DataFrame and set it to default values. It returns a DataFrame representing the result of the given query.

Syntax: pyspark.sql.SparkSession.sql(sqlQuery)

Parameters: This method accepts the following parameter as mentioned above and described below.

sqlQuery: It is a string and contains the sql executable query.

Returns: DataFrame

First, create a simple DataFrame:

Python3

import findspark 
findspark.init() 
  
# Importing the modules 
from datetime import datetime, date 
import pandas as pd 
from pyspark.sql import SparkSession 
  
# creating the session 
spark = SparkSession.builder.getOrCreate() 
  
# creating the dataframe 
pandas_df = pd.DataFrame({ 
    'Name': ['Anurag', 'Manjeet', 'Shubham', 
             'Saurabh', 'Ujjawal'], 
    'Address': ['Patna', 'Delhi', 'Coimbatore', 
                'Greater noida', 'Patna'], 
    'ID': [20123, 20124, 20145, 20146, 20147], 
    'Sell': [140000, 300000, 600000, 200000, 600000] 
}) 
df = spark.createDataFrame(pandas_df) 
print("Original DataFrame :") 
df.show()

Output:

Add a new column with Default Value:

Python3

# Add columns to DataFrame using SQL 
df.createOrReplaceTempView("GFG_Table") 
  
# Add new column with NUll 
df=spark.sql("select *, null as Rewards from GFG_Table") 
  
# Add new constanst column 
df.createOrReplaceTempView("GFG_Table") 
df=spark.sql("select *, '0.25' as Bonus_Percent from GFG_Table") 
df.show()

Output:

Add new column with default value in PySpark dataframe

Method 1: Using pyspark.sql.DataFrame.withColumn(colName, col)

Python3

Python3

Method 2: Using pyspark.sql.DataFrame.select(*cols)

Python3

Python3

Method 3: Using pyspark.sql.SparkSession.sql(sqlQuery)

Python3

Python3

Contact Us

**Method 2: Using pyspark.sql.DataFrame.select(*cols)**