DEV Community

Cover image for Apache Spark Unit Testing Strategies
Sukumaar Mane
Sukumaar Mane

Posted on • Updated on

Apache Spark Unit Testing Strategies

Recipe/Guide about writing unit tests for Apache Spark with Scala (mainly for beginners).


Recipe complexity level: ◼️◻️◻️◻️
Recipe prerequisite:

  • Some knowledge of Big Data, Apache Spark, Scala, Java.

Recipe ingredients:


Unit testing ?
In computer programming, unit testing is a software testing method by which individual units of source code—sets of one or more computer program modules together with associated control data, usage procedures, and operating procedures—are tested to determine whether they are fit for use Wikipedia

Writing unit tests of the code before writing the actual code is a brilliant strategy used in TDD.

TDD ?
Test-driven development (TDD) is a software development process relying on software requirements being converted to test cases before software is fully developed, and tracking all software development by repeatedly testing the software against all test cases. Wikipedia

I will skip writing about the advantages of writing unit tests or the advantages of TDD (because there are so many that I need to write a separate article for it.)

This code sample uses (super awesome) Scalatest 😎🤩 testing framework.

ScalaTest:

  • It is the most flexible and most popular testing tool in the Scala ecosystem.link
  • With so many other features it allows designing tests with multiple styles.
    • ScalaTest supports different styles of testing, each designed to address a particular set of needs. link
    • There are separate traits for these styles.
    • JUnit lovers can use the AnyFunSuite trait.

Project Creation:
Directory structure of my project:

.
├── build.sbt
└── src
    ├── main
    │   └── scala
    │       └── sukumaar
    │           └── App.scala #This doesn't have any imp code
    └── test
        └── scala
            └── sukumaar
                ├── AppTest.scala
                └── TraitSparkSessionTest.scala
Enter fullscreen mode Exit fullscreen mode

build.sbt I used:

name := "sample-spark-scala-project"
version := "1.0"
scalaVersion := "2.12.13"

val sparkVersion = "2.4.0"

libraryDependencies += 
    "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += 
    "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += 
    "org.scalatest" %% "scalatest" % "3.2.9" % Test

/*
// you can always use this dependency if you are 
// going to use only funsuite
libraryDependencies += 
    "org.scalatest" %% "scalatest-funsuite" % "3.2.11" % "test"
*/
Enter fullscreen mode Exit fullscreen mode

Import this project to your favorite IDE.
If you prefer CLI(like a mature developer) then enter sbt clean compile


The steps I followed:
of course you can change package name, if you do then you have to change directory name accordingly in previous step

Step 1: Add this to TraitSparkSessionTest.scala

package sukumaar
trait TraitSparkSessionTest {}
Enter fullscreen mode Exit fullscreen mode

Step 2: Add this to TraitSparkSessionTest.scala

package sukumaar

import org.apache.spark.sql.SparkSession

trait TraitSparkSessionTest {

  protected val sparkSession = SparkSession
    .builder()
    .appName("sample-spark-scala-project")
    .master("local[2]")
    .getOrCreate()       
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Add this to AppTest.scala

package sukumaar
class AppTest {}
Enter fullscreen mode Exit fullscreen mode

The trick is sparkSession object must be used in all the test classes wherever spark test cases are present unless there is a use case to use more than one spark session.
As this object is a part of TraitSparkSessionTest trait, this trait can be easily used as a mixin to mix with the test classes.

Step 4: Add this to AppTest.scala

package sukumaar
import org.scalatest.funsuite.AnyFunSuite

class AppTest extends AnyFunSuite 
with TraitSparkSessionTest //<-this is the same trait having sparkSession
{

  import sparkSession.implicits._ 
  // this will enable methods like toDF() to DS

  test("this is sample test") {

    val rdd = sparkSession
    .sparkContext
    .parallelize(Seq("A", "B"))

    val ds = rdd.toDS
    ds.show() 
    //for sake of debugging, otherwise it should be avoided

    ds.cache()
    /*
    ds.cache()  not required if action operation on 
    dataset/dataframe is getting performed only once
    but otherwise
    */

    assert(2 == ds.count()) // assertion
  }
}

Enter fullscreen mode Exit fullscreen mode

Done. Now go and run your tests 😇

Full source code link: https://github.com/sukumaar/spark-scala-unit-test-example


s/w versions used:
jdk 1.8.0_312 
scala 2.12.15
sbt 1.6.2
os: ubuntu 20.04.4 lts
Enter fullscreen mode Exit fullscreen mode

Discussion (0)