Examples of Splitting the Rows of a Spark RDD by Delimitor
Let us see a few examples, by which we can split the Spark rows based on different delimiter.
Example 1: Splitting Rows by comma
In this example, let us say we have an RDD of strings where each row contains a list of values separated by commas. We use the parallelize method to generate an RDD from comma-separated strings. Using the split method, we build the function split_row, which uses the map transformation to apply to each row, splitting each row by a comma delimiter. Finally, we use the collect action to collect the result. The RDD’s entire data set is pulled out by the collect() function, which then returns it as a list. We’ll loop through the list in and print each entry to the console.
Python3
# Import required modules from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName( "SplitRowsByDelimiter" ).getOrCreate() # Create an RDD with sample data rdd = spark.sparkContext.parallelize([ "apple,orange,banana" , "carrot,tomato,potato" ]) # define a function to split each row by a comma def split_row(row): return row.split( "," ) # apply the function to each row using the map transformation split_rdd = rdd. map (split_row) # collect the result as a list of arrays result = split_rdd.collect() # print the result for row in result: print (row) |
Output:
['apple', 'orange', 'banana'] ['carrot', 'tomato', 'potato']
Example 2: Splitting Rows by tab delimiter
In this example, let us say we have an RDD of strings where each row contains a list of values separated by tabs. we use the parallelize function to generate an RDD from tab-separated strings. The ‘\t‘ character can be used to divide each row into an array of values and return the array as a new element to divide the rows by the tab delimiter.
We will apply a transformation to each row of the RDD using the map() method. The lambda function that specifies the transformation to be applied to each element is the parameter to the map() function. In this instance, we’ll use the split() method to divide each row by the delimiter.
Python3
# Import required modules from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName(" SplitRowsByDelimiter").getOrCreate() # Create an RDD with sample data rdd = spark.sparkContext.parallelize([ "foo\tbar\tbaz" , "hello\tworld" ]) # Define the delimiter delimiter = "\t" # Split the rows by the delimiter split_data = rdd. map ( lambda row: row.split(delimiter)) # Print the resulting RDD for row in split_data.collect(): print (row) |
Output:
['foo', 'bar', 'baz'] ['hello', 'world']
Example 3: Splitting Rows by space delimiter
In this example, let us say we have an RDD of strings where each row contains a list of values separated by spaces. Use the split method with no arguments to divide the rows up into arrays of values according to the space delimiter, returning the array as a new element.
Python3
# Import required modules from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName( "SplitRowsByDelimiter" ).getOrCreate() # Create an RDD with sample data rdd = spark.sparkContext.parallelize([ "Geeks for Geeks" , "hello world" ]) # define a function to split each row by a space delimiter def split_row(row): return row.split() # apply the function to each row using the map transformation split_rdd = rdd. map (split_row) # collect the result as a list of arrays result = split_rdd.collect() # Print the resulting RDD for row in split_data.collect(): print (row) |
Output:
['foo', 'bar', 'baz'] ['hello', 'world']
How to split rows of a Spark RDD by Deliminator
For processing huge datasets, Apache Spark is a potent distributed computing system. A fundamental concept that describes an immutable distributed collection of objects in Spark is called a Resilient Distributed Dataset (RDD). Splitting the rows of an RDD based on a delimiter is a typical Spark task. For parsing structured data, like CSV or TSV files, this can be helpful. In this article, we will learn how to split the rows of a Spark RDD based on delimiter in Python.
Contact Us