Joining DataFrames
Use df.join()
We can join one dataframe with another using the join statement. Let us see various examples of joins below.
Example 1: Specify join columns as a String or Sequence of String
If the column name in both dataframes are same we can simply write the names of the columns on which we want to join.
val joined_df = class_df.join(result_df, Seq("id"))
// OR
val joined_df = class_df.join(result_df, "id")
joined_df.show()
Output:
As it can be seen above the dataframes were joined by specifying the name of the common column.
Example 2: Specify join condition using expressions
If the column names are not same then we can use the join expressions to specify the match condition. The code for this type of join is as follows:
import spark.implicits._
val joined_df = class_df.join(result_df, class_df("id") === result_df("id")).select(result_df("Id"), $"Name", $"Subject", $"Score")
joined_df.show()
Output:
As seen above the join is performed using the expression.
We can also specify the join condition in both of the above examples. Let us try to do a left join of the class dataframe with the result dataframe. We will remove the last id from the result dataframe to verify that the left join is actually performed. The code for the left join is as follows:
val result_filtered_df = result_df.filter("id in (1,2)")
val joined_df = class_df.join(result_filtered_df, "Id", "left")
joined_df.show()
Output:
As seen above, the left join was performed successfully. The missing record for results was filled with NULL values. Similarly, we can perform left join with example 2 as well.
Using SQL
We can also join the two dataframes using sql. To do that we will first need to convert the dataframes to views. We will then join the views and store the result to another dataframe. Let us see how to perform the join using SQL.
class_df.createOrReplaceTempView("class_df_view")
result_df.createOrReplaceTempView("result_df_view")
var joined_df = spark.sql("SELECT cl.Id, Name, Subject, Score FROM class_df_view cl INNER JOIN result_df_view rs ON cl.Id = rs.ID")
joined_df.show()
Output:
As seen above the views were joined to create a new dataframe. This method helps those familiar with the SQL syntax and allows for easy migration from SQL projects. Although the views are extra but since they are temporary they will be deleted after the session ends.
How to Join Two DataFrame in Scala?
Scala stands for scalable language. It is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn’t require type information while writing the code. The type verification is done at the compile time. Static typing allows us to build safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.
Contact Us