How to use DataFrame.withColumn() In Python
The DataFrame.withColumn(colName, col) can be used for extracting substring from the column data by using pyspark’s substring() function along with it.
Syntax: DataFrame.withColumn(colName, col)
Parameters:
- colName: str, name of the new column
- col: str, a column expression for the new column
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
We will make use of the pyspark’s substring() function to create a new column “State” by extracting the respective substring from the LicenseNo column.
Syntax: pyspark.sql.functions.substring(str, pos, len)
Example 1: For single columns as substring.
Python
from pyspark.sql.functions import substring reg_df.withColumn( 'State' , substring( 'LicenseNo' , 1 , 2 ) ).show() |
Output:
Here, we have created a new column “State” where the substring is taken from “LicenseNo” column. (1, 2) indicates that we need to start from the first character and extract 2 characters from the “LicenseNo” column.
Example 2: For multiple columns as substring
Extracting State Code as ‘State’, Registration Year as ‘RegYear’, Registration ID as ‘RegID’, Expiry Year as ‘ExpYr’, Expiry Date as ‘ExpDt’, Expiry Month as ‘ExpMo’.
Python
from pyspark.sql.functions import substring reg_df \ .withColumn( 'State' , substring( 'LicenseNo' , 1 , 2 )) \ .withColumn( 'RegYear' , substring( 'LicenseNo' , 3 , 4 )) \ .withColumn( 'RegID' , substring( 'LicenseNo' , 7 , 8 )) \ .withColumn( 'ExpYr' , substring( 'ExpiryDate' , 1 , 4 )) \ .withColumn( 'ExpMo' , substring( 'ExpiryDate' , 6 , 2 )) \ .withColumn( 'ExpDt' , substring( 'ExpiryDate' , 9 , 2 )) \ .show() |
Output:
The above code demonstrates how withColumn() method can be used multiple times to get multiple substring columns. Each withColumn() method adds a new column in the dataframe. It is worth noting that it also retains the original columns as well.
How to check for a substring in a PySpark dataframe ?
In this article, we are going to see how to check for a substring in PySpark dataframe.
Substring is a continuous sequence of characters within a larger string size. For example, “learning pyspark” is a substring of “I am learning pyspark from w3wiki”. Let us look at different ways in which we can find a substring from one or more columns of a PySpark dataframe.
Creating Dataframe for demonstration:
Python
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # Column names for the dataframe columns = [ "LicenseNo" , "ExpiryDate" ] # Row data for the dataframe data = [ ( "MH201411094334" , "2024-11-19" ), ( "AR202027563890" , "2030-03-16" ), ( "UP202010345567" , "2035-12-30" ), ( "KN201822347800" , "2028-10-29" ), ] # Create the dataframe using the above values reg_df = spark.createDataFrame(data = data, schema = columns) # View the dataframe reg_df.show() |
Output:
In the above dataframe, LicenseNo is composed of 3 information, 2-letter State Code + Year of registration + 8 digit registration number.
Contact Us