PySpark – Create dictionary from data in two columns
In this article, we are going to see how to create a dictionary from data in two columns in PySpark using Python.
Method 1: Using Dictionary comprehension
Here we will create dataframe with two columns and then convert it into a dictionary using Dictionary comprehension.
Python
# importing pyspark # make sure you have installed the pyspark library import pyspark # Importing and creating a SparkSession # to work on DataFrames # The session name is 'Practice_Session' from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName( 'Practice_Session' ).getOrCreate() # Creating a DataFrame using createDataFrame() # method, with hard coded data. rows = [[ 'John' , 54 ], [ 'Adam' , 65 ], [ 'Michael' , 56 ], [ 'Kelsey' , 37 ], [ 'Chris' , 49 ], [ 'Jonathan' , 28 ], [ 'Anthony' , 26 ], [ 'Esther' , 48 ], [ 'Rachel' , 52 ], [ 'Joseph' , 56 ], [ 'Richard' , 49 ], ] columns = [ 'Name' , 'Age' ] df_pyspark = spark_session.createDataFrame(rows, columns) # printing the DataFrame df_pyspark.show() # dictionary comprehension is used here # Name column here is the key while Age # columns is the value # You can also use {row['Age']:row['Name'] # for row in df_pyspark.collect()}, # to reverse the key,value pairs # collect() gives a list of # rows in the DataFrame result_dict = {row[ 'Name' ]: row[ 'Age' ] for row in df_pyspark.collect()} # Printing a few key:value pairs of # our final resultant dictionary print (result_dict[ 'John' ]) print (result_dict[ 'Michael' ]) print (result_dict[ 'Adam' ]) |
Output :
Method 2: Converting PySpark DataFrame and using to_dict() method
Here are the details of to_dict() method:
to_dict() : PandasDataFrame.to_dict(orient=’dict’)
Parameters:
- orient : str {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}
- Determines the type of the values of the dictionary.
Return: It returns a Python dictionary corresponding to the DataFrame
Python
# importing pyspark # make sure you have installed # the pyspark library import pyspark # Importing and creating a SparkSession # to work on DataFrames # The session name is 'Practice_Session' from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName( 'Practice_Session' ).getOrCreate() # Creating a DataFrame using createDataFrame() # method, with hard coded data. rows = [[ 'John' , 54 ], [ 'Adam' , 65 ], [ 'Michael' , 56 ], [ 'Kelsey' , 37 ], [ 'Chris' , 49 ], [ 'Jonathan' , 28 ], [ 'Anthony' , 26 ], [ 'Esther' , 48 ], [ 'Rachel' , 52 ], [ 'Joseph' , 56 ], [ 'Richard' , 49 ], ] columns = [ 'Name' , 'Age' ] df_pyspark = spark_session.createDataFrame(rows, columns) # printing the DataFrame df_pyspark.show() # COnvert PySpark dataframe to pandas # dataframe df_pandas = df_pyspark.toPandas() # Convert the dataframe into # dictionary result = df_pandas.to_dict(orient = 'list' ) # Print the dictionary print (result) |
Output :
Method 3: By iterating over a column of dictionary
Iterating through columns and producing a dictionary such that keys are columns and values are a list of values in columns.
For this, we need to first convert the PySpark DataFrame to a Pandas DataFrame
Python
# importing pyspark # make sure you have installed the pyspark library import pyspark # Importing and creating a SparkSession to work on # DataFrames The session name is 'Practice_Session' from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName( 'Practice_Session' ).getOrCreate() # Creating a DataFrame using createDataFrame() # method, with hard coded data. rows = [[ 'John' , 54 ], [ 'Adam' , 65 ], [ 'Michael' , 56 ], [ 'Kelsey' , 37 ], [ 'Chris' , 49 ], [ 'Jonathan' , 28 ], [ 'Anthony' , 26 ], [ 'Esther' , 48 ], [ 'Rachel' , 52 ], [ 'Joseph' , 56 ], [ 'Richard' , 49 ], ] columns = [ 'Name' , 'Age' ] df_pyspark = spark_session.createDataFrame(rows, columns) # printing the DataFrame df_pyspark.show() result = {} # Convert PySpark DataFrame to Pandas # DataFrame df_pandas = df_pyspark.toPandas() # Traverse through each column for column in df_pandas.columns: # Add key as column_name and # value as list of column values result[column] = df_pandas[column].values.tolist() # Print the dictionary print (result) |
Output :
Contact Us