PySpark – Create dictionary from data in two columns ❤️

In this article, we are going to see how to create a dictionary from data in two columns in PySpark using Python.

Method 1: Using Dictionary comprehension

Here we will create dataframe with two columns and then convert it into a dictionary using Dictionary comprehension.

Python

# importing pyspark 
# make sure you have installed the pyspark library 
import pyspark 
  
# Importing and creating a SparkSession 
# to work on DataFrames 
# The session name is 'Practice_Session' 
from pyspark.sql import SparkSession 
spark_session = SparkSession.builder.appName( 
  'Practice_Session').getOrCreate() 
  
# Creating a DataFrame using createDataFrame() 
# method, with hard coded data. 
rows = [['John', 54], 
        ['Adam', 65], 
        ['Michael', 56], 
        ['Kelsey', 37], 
        ['Chris', 49], 
        ['Jonathan', 28], 
        ['Anthony', 26], 
        ['Esther', 48], 
        ['Rachel', 52], 
        ['Joseph', 56], 
        ['Richard', 49], 
        ] 
  
columns = ['Name', 'Age'] 
df_pyspark = spark_session.createDataFrame(rows, columns) 
  
# printing the DataFrame 
df_pyspark.show() 
  
# dictionary comprehension is used here 
# Name column here is the key while Age 
# columns is the value 
# You can also use {row['Age']:row['Name'] 
# for row in df_pyspark.collect()}, 
# to reverse the key,value pairs 
  
  
# collect() gives a list of 
# rows in the DataFrame 
result_dict = {row['Name']: row['Age']  
               for row in df_pyspark.collect()} 
  
# Printing a few key:value pairs of 
# our final resultant dictionary 
print(result_dict['John']) 
print(result_dict['Michael']) 
print(result_dict['Adam']) 

Output :

Method 2: Converting PySpark DataFrame and using to_dict() method

Here are the details of to_dict() method:

to_dict() : PandasDataFrame.to_dict(orient=’dict’)

Parameters:

orient : str {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}

Determines the type of the values of the dictionary.

Return: It returns a Python dictionary corresponding to the DataFrame

Python

# importing pyspark 
# make sure you have installed 
# the pyspark library 
import pyspark 
  
# Importing and creating a SparkSession 
# to work on DataFrames 
# The session name is 'Practice_Session' 
from pyspark.sql import SparkSession 
spark_session = SparkSession.builder.appName( 
  'Practice_Session').getOrCreate() 
  
# Creating a DataFrame using createDataFrame() 
# method, with hard coded data. 
rows = [['John', 54], 
        ['Adam', 65], 
        ['Michael', 56], 
        ['Kelsey', 37], 
        ['Chris', 49], 
        ['Jonathan', 28], 
        ['Anthony', 26], 
        ['Esther', 48], 
        ['Rachel', 52], 
        ['Joseph', 56], 
        ['Richard', 49], 
        ] 
  
columns = ['Name', 'Age'] 
df_pyspark = spark_session.createDataFrame(rows, columns) 
  
# printing the DataFrame 
df_pyspark.show() 
  
# COnvert PySpark dataframe to pandas 
# dataframe 
df_pandas = df_pyspark.toPandas() 
  
# Convert the dataframe into 
# dictionary 
result = df_pandas.to_dict(orient='list') 
  
# Print the dictionary 
print(result) 

Output :

Method 3: By iterating over a column of dictionary

Iterating through columns and producing a dictionary such that keys are columns and values are a list of values in columns.

For this, we need to first convert the PySpark DataFrame to a Pandas DataFrame

Python

# importing pyspark 
# make sure you have installed the pyspark library 
import pyspark 
  
# Importing and creating a SparkSession to work on  
# DataFrames The session name is 'Practice_Session' 
from pyspark.sql import SparkSession 
spark_session = SparkSession.builder.appName( 
  'Practice_Session').getOrCreate() 
  
# Creating a DataFrame using createDataFrame() 
# method, with hard coded data. 
rows = [['John', 54], 
        ['Adam', 65], 
        ['Michael', 56], 
        ['Kelsey', 37], 
        ['Chris', 49], 
        ['Jonathan', 28], 
        ['Anthony', 26], 
        ['Esther', 48], 
        ['Rachel', 52], 
        ['Joseph', 56], 
        ['Richard', 49], 
        ] 
  
columns = ['Name', 'Age'] 
df_pyspark = spark_session.createDataFrame(rows, columns) 
  
# printing the DataFrame 
df_pyspark.show() 
  
result = {} 
  
# Convert PySpark DataFrame to Pandas 
# DataFrame 
df_pandas = df_pyspark.toPandas() 
  
# Traverse through each column 
for column in df_pandas.columns: 
  
    # Add key as column_name and 
    # value as list of column values 
    result[column] = df_pandas[column].values.tolist() 
  
# Print the dictionary 
print(result) 

Output :

PySpark – Create dictionary from data in two columns

Method 1: Using Dictionary comprehension

Python

Method 2: Converting PySpark DataFrame and using to_dict() method

Python

Method 3: By iterating over a column of dictionary

Python

Contact Us