Building Sample DataFrames

Let us build two sample DataFrame to perform join upon in Scala.

Scala
import org.apache.spark.sql.SparkSession

object joindfs{
  def main(args:Array[String]) {
    val spark: SparkSession = SparkSession.builder().master("local[1]").getOrCreate()

    val class_columns = Seq("Id", "Name")
    val class_data    = Seq((1, "Dhruv"), (2, "Akash"), (3, "Aayush"))
    val class_df = spark.createDataFrame(class_data).toDF(class_columns:_*)

    val result_column = Seq("Id", "Subject", "Score")
    val result_data   = Seq((1, "Maths", 98), (2, "Maths", 99), (3, "Maths", 94), (1, "Physics", 95), (2, "Physics", 97), (3, "Physics", 99))
    val result_df = spark.createDataFrame(result_data).toDF(result_column:_*)

    class_df.show()
    result_df.show()
  }
}

Output:

class_df

result_df

Explanation:

Here we have formed two dataframes.

  1. The first one is the class dataframe which contains the information about students in a classroom.
  2. The second one is the result dataframe which contains the marks of students in Maths and Physics.
  3. We will form a combined dataframe that will contain both student and result information.

Let us see how to join these dataframes now.

How to Join Two DataFrame in Scala?

Scala stands for scalable language. It is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn’t require type information while writing the code. The type verification is done at the compile time. Static typing allows us to build safe systems by default. Smart built-in checks and actionable error messages, combined with thread-safe data structures and collections, prevent many tricky bugs before the program first runs.

Similar Reads

Understanding Dataframe and Spark

A DataFrame is a data structure in the Spark Language. Spark is used to develop distributed products i.e. a code that can be run on many machines at the same time....

Building Sample DataFrames

Let us build two sample DataFrame to perform join upon in Scala....

Joining DataFrames

Use df.join()...

Conclusion

In this article we have seen how to join the two dataframes in scala. Majorly, this can be done either using the scala join function or the SQL syntax. The scala join function further can be called in two ways, using strings or expressions. The strings method can be used if the column names are common in both the dataframes. In that case, the join columns can be specified using a list of strings. In the case, the column names are not common we can use the expressions to specify the join condition. The sql method creates temporary views from the dataframes and performs the join on them. It then creates another dataframe from the result of join as the joined dataframe....

Contact Us