Examples of Splitting the Rows of a Spark RDD by Delimitor

Let us see a few examples, by which we can split the Spark rows based on different delimiter.

Example 1: Splitting Rows by comma

In this example, let us say we have an RDD of strings where each row contains a list of values separated by commas. We use the parallelize method to generate an RDD from comma-separated strings. Using the split method, we build the function split_row, which uses the map transformation to apply to each row, splitting each row by a comma delimiter. Finally, we use the collect action to collect the result. The RDD’s entire data set is pulled out by the collect() function, which then returns it as a list. We’ll loop through the list in and print each entry to the console.

Python3




# Import required modules
from pyspark.sql import SparkSession
 
# Create a SparkSession
spark = SparkSession.builder.appName("SplitRowsByDelimiter").getOrCreate()
 
# Create an RDD with sample data
rdd = spark.sparkContext.parallelize(["apple,orange,banana", "carrot,tomato,potato"])
 
# define a function to split each row by a comma
def split_row(row):
    return row.split(",")
 
# apply the function to each row using the map transformation
split_rdd = rdd.map(split_row)
 
# collect the result as a list of arrays
result = split_rdd.collect()
 
# print the result
for row in result:
    print(row)


Output:

['apple', 'orange', 'banana']
['carrot', 'tomato', 'potato']

Example 2: Splitting Rows by tab delimiter

In this example, let us say we have an RDD of strings where each row contains a list of values separated by tabs. we use the parallelize function to generate an RDD from tab-separated strings. The ‘\t‘ character can be used to divide each row into an array of values and return the array as a new element to divide the rows by the tab delimiter.

We will apply a transformation to each row of the RDD using the map() method. The lambda function that specifies the transformation to be applied to each element is the parameter to the map() function. In this instance, we’ll use the split() method to divide each row by the delimiter.

Python3




# Import required modules
from pyspark.sql import SparkSession
 
# Create a SparkSession
spark = SparkSession.builder.appName("
                  SplitRowsByDelimiter").getOrCreate()
 
# Create an RDD with sample data
rdd = spark.sparkContext.parallelize(["foo\tbar\tbaz",
                                      "hello\tworld"])
 
# Define the delimiter
delimiter = "\t"
 
# Split the rows by the delimiter
split_data = rdd.map(lambda row: row.split(delimiter))
 
# Print the resulting RDD
for row in split_data.collect():
    print(row)


Output:

['foo', 'bar', 'baz']
['hello', 'world']

Example 3: Splitting Rows by space delimiter

In this example, let us say we have an RDD of strings where each row contains a list of values separated by spaces. Use the split method with no arguments to divide the rows up into arrays of values according to the space delimiter, returning the array as a new element.

Python3




# Import required modules
from pyspark.sql import SparkSession
 
# Create a SparkSession
spark = SparkSession.builder.appName(
                  "SplitRowsByDelimiter").getOrCreate()
 
# Create an RDD with sample data
rdd = spark.sparkContext.parallelize(["Geeks for Geeks",
                                      "hello world"])
 
# define a function to split each row by a space delimiter
def split_row(row):
    return row.split()
 
# apply the function to each row using the map transformation
split_rdd = rdd.map(split_row)
 
# collect the result as a list of arrays
result = split_rdd.collect()
 
# Print the resulting RDD
for row in split_data.collect():
    print(row)


Output:

['foo', 'bar', 'baz']
['hello', 'world']


How to split rows of a Spark RDD by Deliminator

For processing huge datasets, Apache Spark is a potent distributed computing system. A fundamental concept that describes an immutable distributed collection of objects in Spark is called a Resilient Distributed Dataset (RDD). Splitting the rows of an RDD based on a delimiter is a typical Spark task. For parsing structured data, like CSV or TSV files, this can be helpful. In this article, we will learn how to split the rows of a Spark RDD based on delimiter in Python.

Similar Reads

Splitting Rows of a Spark RDD by Delimitor

Resilient Distributed Datasets (RDDs) are a core abstraction used in Apache Spark to describe a distributed group of immutable objects that may be processed concurrently over a cluster of computers. Splitting the rows of an RDD based on a delimiter is a typical Spark task. Spark is able to handle big datasets in parallel by employing the methods and objects to distribute the computation over a cluster of computers....

Examples of Splitting the Rows of a Spark RDD by Delimitor

Let us see a few examples, by which we can split the Spark rows based on different delimiter....

Contact Us