How take a random row from a PySpark DataFrame?

How to select a range of rows from a dataframe in PySpark ?

In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.

Method 1 : PySpark sample() method

PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame.

Here are the details of the sample() method :

Syntax : DataFrame.sample(withReplacement,fractionfloat,seed)

It returns a subset of the DataFrame.

Parameters :

withReplacement : bool, optional

Sample with replacement or not (default False).

fractionfloat : optional

Fraction of rows to generate

seed : int, optional

Used to reproduce the same random sampling.

Example:

In this example, we need to add a fraction of float data type here from the range [0.0,1.0]. Using the formula :

Number of rows needed = Fraction * Total Number of rows

We can say that the fraction needed for us is 1/total number of rows.

Python

# importing the library and 
# its SparkSession functionality 
import pyspark 
from pyspark.sql import SparkSession 
  
# creating a session to make DataFrames 
random_row_session = SparkSession.builder.appName( 
    'Random_Row_Session'
).getOrCreate() 
  
# Pre-set data for our DataFrame 
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]] 
columns = ['Letters', 'Position'] 
  
# Creating a DataFrame 
df = random_row_session.createDataFrame(data, 
                                        columns) 
  
# Printing the DataFrame 
df.show() 
  
# Taking a sample of df and storing it in #df2 
# please not that the second argument here is a fraction 
# of the data set we need(fraction is in float) 
# number of rows = fraction * total number of rows 
df2 = df.sample(False, 1.0/len(df.collect())) 
  
# printing the sample row which is a DataFrame 
df2.show() 

Output :

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+

Method 2: Using takeSample() method

We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type.

We can get RDD of a Data Frame using DataFrame.rdd and then use the takeSample() method.

Syntax of takeSample() :

takeSample(withReplacement, num, seed=None)

Parameters :

withReplacement : bool, optional

Sample with replacement or not (default False).

num : int

the number of sample values

seed : int, optional

Used to reproduce the same random sampling.

Returns : It returns num number of rows from the DataFrame.

Example: In this example, we are using takeSample() method on the RDD with the parameter num = 1 to get a Row object. num is the number of samples.

Python

# importing the library and 
# its SparkSession functionality 
import pyspark 
from pyspark.sql import SparkSession 
from pyspark.sql import Row 
  
# creating a session to make DataFrames 
random_row_session = SparkSession.builder.appName( 
    'Random_Row_Session'
).getOrCreate() 
  
# Pre-set data for our DataFrame 
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]] 
columns = ['Letters', 'Position'] 
  
# Creating a DataFrame 
df = random_row_session.createDataFrame(data, 
                                        columns) 
  
# Printing the DataFrame 
df.show() 
  
# Getting RDD object from the DataFrame 
rdd = df.rdd 
  
# Taking a single sample of from the RDD 
# Putting num = 1 in the takeSample() function 
rdd_sample = rdd.takeSample(withReplacement=False, 
                            num=1) 
print(rdd_sample) 

Output :

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

[Row(Letters='c', Position=3)]

Method 3: Convert the PySpark DataFrame to a Pandas DataFrame and use the sample() method

We can use toPandas() function to convert a PySpark DataFrame to a Pandas DataFrame. This method should only be used if the resulting Pandas’ DataFrame is expected to be small, as all the data is loaded into the driver’s memory. This is an experimental method.

We will then use the sample() method of the Pandas library. It returns a random sample from an axis of the Pandas DataFrame.

Syntax : PandasDataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

Example:

In this example, we will be converting our PySpark DataFrame to a Pandas DataFrame and using the Pandas sample() function on it.

Python

# importing the library and 
# its SparkSession functionality 
import pyspark 
from pyspark.sql import SparkSession 
  
# creating a session to make DataFrames 
random_row_session = SparkSession.builder.appName( 
    'Random_Row_Session'
).getOrCreate() 
  
# Pre-set data for our DataFrame 
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]] 
columns = ['Letters', 'Position'] 
  
# Creating a DataFrame 
df = random_row_session.createDataFrame(data, 
                                        columns) 
  
# Printing the DataFrame 
df.show() 
  
# Converting the DataFrame to 
# a Pandas DataFrame and taking a sample row 
pandas_random = df.toPandas().sample() 
  
# Converting the sample into 
# a PySpark DataFrame 
df_random = random_row_session.createDataFrame(pandas_random) 
  
# Showing our randomly selected row 
df_random.show() 

Output :

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

+-------+--------+
|Letters|Position|
+-------+--------+
|      b|       2|
+-------+--------+

Tags:

#Python-Pyspark #Python #python

How to select a range of rows from a dataframe in PySpark ?

How take a random row from a PySpark DataFrame?

Method 1 : PySpark sample() method

Python

Method 2: Using takeSample() method

Python

Method 3: Convert the PySpark DataFrame to a Pandas DataFrame and use the sample() method

Python

Contact Us