How to use The Function Split() In Python

In this example first, the required package “split” is imported from the “pyspark.sql.functions” module. Then, a SparkSession is created. Next, a PySpark DataFrame is created with two columns “id” and “fruits” and two rows with the values “1, apple, orange, banana” and “2, grape, kiwi, peach”. Using the “split” function, the “fruits” column is split into an array of strings and stored in a new column “fruit_list”. Then, a new DataFrame “df3” is created by selecting the first three elements of the “fruit_list” array using the “getItem” function and aliasing them as “fruit1”, “fruit2”, and “fruit3”. Finally, the “show” method is called on both DataFrames to display the output.

Syntax: split(str: Column, pattern: str) -> Column

The split method returns a new PySpark Column object that represents an array of strings. Each element in the array is a substring of the original column that was split using the specified pattern.

The split method takes two parameters:

  • str: The PySpark column to split. This can be a string column, a column expression, or a column name.
  • pattern: The string or regular expression to split the column on.

Python3




# importing packages
from pyspark.sql.functions import split
from pyspark.sql import SparkSession
 
# creating a spark session
spark = SparkSession.builder.appName('split_array_to_columns').getOrCreate()
 
# Create a PySpark DataFrame
df = spark.createDataFrame([(1, "apple,orange,banana"),
                            (2, "grape,kiwi,peach")], ["id", "fruits"])
 
# Split the "fruits" column into an array of strings
df = df.withColumn("fruit_list", split(df.fruits, ","))
 
# Show the resulting DataFrame
df.show()
 
df3 = df.select(
    df.fruit_list.getItem(0).alias('fruit1'),
    df.fruit_list.getItem(1).alias('fruit2'),
    df.fruit_list.getItem(2).alias('fruit3')
)
 
df3.show()


Output:

Output Image

Spark – Split array to separate column

Apache Spark is a potent big data processing system that can analyze enormous amounts of data concurrently over distributed computer clusters. PySpark is a Python-based interface for Apache Spark. Python programmers may create Spark applications more quickly and easily thanks to PySpark.

Similar Reads

Method 1: Using The Function Split()

In this example first, the required package “split” is imported from the “pyspark.sql.functions” module. Then, a SparkSession is created. Next, a PySpark DataFrame is created with two columns “id” and “fruits” and two rows with the values “1, apple, orange, banana” and “2, grape, kiwi, peach”. Using the “split” function, the “fruits” column is split into an array of strings and stored in a new column “fruit_list”. Then, a new DataFrame “df3” is created by selecting the first three elements of the “fruit_list” array using the “getItem” function and aliasing them as “fruit1”, “fruit2”, and “fruit3”. Finally, the “show” method is called on both DataFrames to display the output....

Method 2: Using the function getItem()

...

Contact Us