How to split a column with comma separated values in PySpark’s Dataframe?
In this article, we are going to learn how to split a column with comma-separated values in a data frame in Pyspark using Python.
This is a part of data processing in which after the data processing process we have to process raw data for visualization. we may get the data in which a column contains comma-separated data which is difficult to visualize using visualizing techniques. so, we have to separate that data into different columns first so that we can perform visualization easily.
Steps to split a column with comma-separated values in PySpark’s Dataframe
Below are the steps to perform the splitting operation on columns in which comma-separated values are present.
Step 1: First of all, import the required libraries, i.e. SparkSession, and functions. The SparkSession library is used to create the session while the functions library gives access to all built-in functions available for the data frame.
from pyspark.sql import SparkSession from pyspark.sql import functions
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Creating a Spark Context.
sc=spark_session.sparkContext
Step 4: Reading the CSV file or create the data frame using createDataFrame()
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
or
data_frame = spark_session.createDataFrame((Declare_the_dataset))
Step 5: Split the column names with commas and put them in the list.
df2 = df.select('Roll_no', functions.split('column_to_be_split', ', ').alias('column_to_be_split'))
Step 6: Obtain the number of columns in each row using functions.size() function.
df_sizes = df2.select(functions.size('column_to_be_split').alias('column_to_be_split'))
Step 7: In this step, we get the maximum size among all the column sizes available for each row.
nb_columns = df_sizes.agg(functions.max('column_to_be_split')).collect()[0][0]
Step 8: Here, we split the data frame column into different columns in the data frame.
split_df=df2.select( *[df2['column_to_be_split'][i] for i in range(nb_columns)])
Step 9: Next, create a list defining the column names which you want to give to the split columns.
names=['column_name_1','column_name_2','column_name_3']
Step 10: Now, obtain all the column names of a data frame in a list.
total_columns=split_df.columns
Step 11: Then, run a loop to rename the split columns of the data frame.
for i in range(1,len(total_columns)): split_df=split_df.withColumnRenamed(total_columns[i], names[i-1])
Step 12: Finally, display the updated data frame.
split_df.show()
Example 1:
In this example, we have created the data frame in which there is one column ‘Full_Name‘ having multiple values ‘First_Name, Middle_Name, and Last_Name‘ separated by a comma ‘,‘ as follows:
We have split “Full_Name” column into various columns by splitting the column names and putting them in the list. Then, we obtained the maximum size of columns for rows and split it into various columns by running the for loop. Later on, we got the names of the new columns in the list and allotted those names to the new columns formed.
Python3
# Python program to split a column # with comma separated values # Import the libraries SparkSession and function libraries from pyspark.sql import SparkSession from pyspark.sql import functions # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Create a Spark Context sc = spark_session.sparkContext # Create the data frame using createDataFrame function df = spark_session.createDataFrame( sc.parallelize([[ '1' , 'Arun, Kumar, Chaudhary' ], [ '2' , 'Aniket, Singh, Rajpoot' ], [ '3' , 'Ishita, Rai, Pundir' ]]), [ "Roll_no" , "Full_Name" ]) # Split the column with ',' and put into list df2 = df.select( 'Roll_no' , functions.split( 'Full_Name' , ', ' ).alias( 'Full_Name' )) # Obtain the number of columns in each row df_sizes = df2.select(functions.size( 'Full_Name' ).alias( 'Full_Name' )) # Get the maximum size among all the column sizes available for each row nb_columns = df_sizes.agg(functions. max ( 'Full_Name' )).collect()[ 0 ][ 0 ] # Split the data frame column into different columns in data frame split_df = df2.select( 'Roll_no' , * [df2[ 'Full_Name' ][i] for i in range (nb_columns)]) # Obtain the column names which you want to # give to the split columns names = [ 'First Name' , 'Middle Name' , 'Last Name' ] # Get all the column names of data frame in a list total_columns = split_df.columns # Run loop to rename the split columns of the data frame for i in range ( 1 , len (total_columns)): split_df = split_df.withColumnRenamed(total_columns[i], names[i - 1 ]) # Displayed the split data frame split_df.show() |
Output:
Example 2:
In this example, we have uploaded the CSV file (link), i.e., basically, a dataset of 6×5, in which there is one column having multiple values separated by a comma “,” as follows:
We have split that column into various columns by splitting the column names and putting them in the list. Then, we obtained the maximum size of columns for rows and split it into various columns by running the for loop. Later on, we got the names of the new columns in the list and allotted those names to the new columns formed.
Python3
# Python program to split a column # with comma separated values # Import the libraries SparkSession and function libraries from pyspark.sql import SparkSession from pyspark.sql import functions # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file df = csv_file = spark_session.read.csv( '/content/car_data.csv' , sep = ',' , inferSchema = True , header = True ) # Split the column with ',' and put into list df2 = df.select( 'Car Id' , functions.split( 'Car Details' , ', ' ).alias( 'Car Details' )) # Obtain the number of columns in each row df_sizes = df2.select(functions.size( 'Car Details' ).alias( 'Car Details' )) # Get the maximum size among all the column sizes available for each row nb_columns = df_sizes.agg(functions. max ( 'Car Details' )).collect()[ 0 ][ 0 ] # Split the data frame column into different columns in data frame split_df = df2.select( 'Car Id' , * [df2[ 'Car Details' ][i] for i in range (nb_columns)]) # Obtain the column names which you want to # give to the split columns names = [ 'Car Name' , 'Door Count' , 'Fuel Type' , 'Car Type' ] # Get all the column names of data frame in a list total_columns = split_df.columns # Run loop to rename the split columns of the data frame for i in range ( 1 , len (total_columns)): split_df = split_df.withColumnRenamed(total_columns[i], names[i - 1 ]) # Displayed the split data frame split_df.show() |
Output:
Contact Us