How to use takeSample() method In Python

We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type.

We can get RDD of a Data Frame using DataFrame.rdd and then use the takeSample() method. 

Syntax of takeSample() : 

takeSample(withReplacement, num, seed=None) 

Parameters

withReplacement : bool, optional

Sample with replacement or not (default False).

num : int

the number of sample values

seed : int, optional

Used to reproduce the same random sampling.

Returns : It returns num number of rows from the DataFrame.

Example: In this example, we are using takeSample() method on the RDD with the parameter num = 1 to get a Row object. num is the number of samples.

Python




# importing the library and
# its SparkSession functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
  
# creating a session to make DataFrames
random_row_session = SparkSession.builder.appName(
    'Random_Row_Session'
).getOrCreate()
  
# Pre-set data for our DataFrame
data = [['a', 1], ['b', 2], ['c', 3], ['d', 4]]
columns = ['Letters', 'Position']
  
# Creating a DataFrame
df = random_row_session.createDataFrame(data,
                                        columns)
  
# Printing the DataFrame
df.show()
  
# Getting RDD object from the DataFrame
rdd = df.rdd
  
# Taking a single sample of from the RDD
# Putting num = 1 in the takeSample() function
rdd_sample = rdd.takeSample(withReplacement=False,
                            num=1)
print(rdd_sample)


Output

+-------+--------+
|Letters|Position|
+-------+--------+
|      a|       1|
|      b|       2|
|      c|       3|
|      d|       4|
+-------+--------+

[Row(Letters='c', Position=3)]

How take a random row from a PySpark DataFrame?

In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.

Similar Reads

Method 1 : PySpark sample() method

PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame....

Method 2: Using takeSample() method

...

Method 3: Convert the PySpark DataFrame to a Pandas DataFrame and use the sample()  method

We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type....

Contact Us