How to use the function getItem() In Python
In this example, first, let’s create a data frame that has two columns “id” and “fruits”. To split the fruits array column into separate columns, we use the PySpark getItem() function along with the col() function to create a new column for each fruit element in the array. The getItem() function is a PySpark SQL function that allows you to extract a single element from an array column in a DataFrame. It takes an integer index as a parameter and returns the element at that index in the array.
Syntax: getItem(index)
The index parameter specifies the index of the element to extract from the array column. The index can be an integer or a column expression that evaluates to an integer.
Python3
# importing packages from pyspark.sql import SparkSession from pyspark.sql.functions import array, col # creating a spark session spark = SparkSession.builder.appName( 'split_array_to_columns' ).getOrCreate() data = [( 1 , [ 'apple' , 'banana' , 'orange' ]), ( 2 , [ 'grape' , 'kiwi' , 'pineapple' , 'watermelon' ]), ( 3 , [ 'peach' , 'pear' ])] # creating a dataframe df = spark.createDataFrame(data, [ 'id' , 'fruits' ]) df.show() # splitting the fruits column into multiple columns df = df.select( 'id' , * [col( 'fruits' ).getItem(i).alias(fruit{i + 1 }') for i in range ( 0 , 4 )]) df.printSchema() |
Output:
Spark – Split array to separate column
Apache Spark is a potent big data processing system that can analyze enormous amounts of data concurrently over distributed computer clusters. PySpark is a Python-based interface for Apache Spark. Python programmers may create Spark applications more quickly and easily thanks to PySpark.
Contact Us