loading...

How to run pyspark with additional Spark packages

digitaldisorder profile image Jakub T ・2 min read

When trying to run the tests of my PySpark jobs with delta.io I hit the following problem:

 Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
E                       at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
E                       at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
E                       at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
E                       at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
E                       at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
E                       at scala.util.Try$.apply(Try.scala:192)
E                       at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
E                       at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
E                       at scala.util.Try.orElse(Try.scala:84)
E                       at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
E                       ... 13 more

This is due to the fact the delta.io packages are not available by default in the Spark installation.

When writing Spark applications in Scala you will probably add the dependencies in your build file or when launching the app you will pass it using the --packages or --jars command-line arguments.

In order to force PySpark to install the delta packages, we can use the PYSPARK_SUBMIT_ARGS.

export PYSPARK_SUBMIT_ARGS='--packages io.delta:delta-core_2.11:0.5.0 pyspark-shell'
pytest

then you can execute the tests as previously:

pytest tests/delta_job.py

============================================================================================================================= test session starts ==============================================================================================================================
platform darwin -- Python 3.7.5, pytest-5.2.0, py-1.8.0, pluggy-0.13.0
rootdir: /Users/kuba/work/delta-jobs/
plugins: requests-mock-1.7.0, mock-1.13.0, flaky-3.6.1, cov-2.8.1
collected 4 items

tests/delta_job.py ....                                                                                                                                                                                                                              [100%]

============================================================================================================================== 4 passed in 26.32s ==============================================================================================================================

and everything is working as expected.

Of course this way you can put any spark-submit command line argument that is available. More about it here: https://spark.apache.org/docs/latest/submitting-applications.html

Discussion

pic
Editor guide