How to use drop() function In Python
We can join the dataframes using joins like inner join and after this join, we can use the drop method to remove one duplicate column.
Syntax: dataframe.join(dataframe1,dataframe.column_name == dataframe1.column_name,”inner”).drop(dataframe.column_name)
where,
- dataframe is the first dataframe
- dataframe1 is the second dataframe
- inner specifies inner join
- drop() will delete the common column and delete first dataframe column
Example: Join two dataframes based on ID and remove duplicate ID in first dataframe
Python3
# inner join on two dataframes # and remove duplicate column dataframe.join(dataframe1, dataframe. ID = = dataframe1. ID , "inner" ).drop(dataframe. ID ).show() |
Output:
How to avoid duplicate columns after join in PySpark ?
In this article, we will discuss how to avoid duplicate columns in DataFrame after join in PySpark using Python.
Create the first dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output:
Create a second dataframe for demonstration:
Python3
# list of employee data data1 = [[ "1" , "45000" , "IT" ], [ "2" , "145000" , "Manager" ], [ "6" , "45000" , "HR" ], [ "5" , "34000" , "Sales" ]] # specify column names columns = [ 'ID' , 'salary' , 'department' ] # creating a dataframe from the lists of data dataframe1 = spark.createDataFrame(data1, columns) dataframe1.show() |
Output:
Contact Us