How to use takeSample() method In Python
We first convert the PySpark DataFrame to an RDD. Resilient Distributed Dataset (RDD) is the most simple and fundamental data structure in PySpark. They are immutable collections of data of any data type.
We can get RDD of a Data Frame using DataFrame.rdd and then use the takeSample() method.
Syntax of takeSample() :
takeSample(withReplacement, num, seed=None)
Parameters :
withReplacement : bool, optional
Sample with replacement or not (default False).
num : int
the number of sample values
seed : int, optional
Used to reproduce the same random sampling.
Returns : It returns num number of rows from the DataFrame.
Example: In this example, we are using takeSample() method on the RDD with the parameter num = 1 to get a Row object. num is the number of samples.
Python
# importing the library and # its SparkSession functionality import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row # creating a session to make DataFrames random_row_session = SparkSession.builder.appName( 'Random_Row_Session' ).getOrCreate() # Pre-set data for our DataFrame data = [[ 'a' , 1 ], [ 'b' , 2 ], [ 'c' , 3 ], [ 'd' , 4 ]] columns = [ 'Letters' , 'Position' ] # Creating a DataFrame df = random_row_session.createDataFrame(data, columns) # Printing the DataFrame df.show() # Getting RDD object from the DataFrame rdd = df.rdd # Taking a single sample of from the RDD # Putting num = 1 in the takeSample() function rdd_sample = rdd.takeSample(withReplacement = False , num = 1 ) print (rdd_sample) |
Output :
+-------+--------+ |Letters|Position| +-------+--------+ | a| 1| | b| 2| | c| 3| | d| 4| +-------+--------+ [Row(Letters='c', Position=3)]
How take a random row from a PySpark DataFrame?
In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language.
Contact Us