DEV Community

Alex Merced
Alex Merced

Posted on • Edited on

Configuring Apache Spark for Apache Iceberg

Get Started with Dremio For Free Today

Apache Iceberg

Apache Iceberg is quickly becoming the industry standard for interfacing with data on data lakes. A lot of the time when people first try out Iceberg they do so using Apache Spark.

Often to start spark up you may run a command like this:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \
    --conf spark.sql.catalog.icebergcatalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.icebergcatalog.type=hadoop \
    --conf spark.sql.catalog.icebergcatalog.warehouse=$PWD/warehouse
Enter fullscreen mode Exit fullscreen mode

You can test out the command above in a docker container with this command:

docker run -it --name spark34 alexmerced/spark34
Enter fullscreen mode Exit fullscreen mode

Then run the following sql statement

CREATE TABLE icebergcatalog.names (name STRING) USING iceberg;
Enter fullscreen mode Exit fullscreen mode

The goal of this article is help demystify this so you can configure Spark for Iceberg in any context.

Spark Configurations

Spark configs can be defined in a few ways:

  • As flags when starting Spark
  • Using Imperitive function calls

Flags

There are are several flags you can pass to the spark-shell or spark-sql commands.

  • --package will identify maven packages you want installed and useable during the Spark session
--packages org.projectnessie:nessie-spark-3.2-extensions:0.43.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178
Enter fullscreen mode Exit fullscreen mode
  • --jar will allow you to pass paths to existing
--jars /home/docker/iceberg-spark-runtime-3.2_2.12-1.0.0.jar,/home/docker/nessie-spark-extensions-3.2_2.12-0.44.0.jar
Enter fullscreen mode Exit fullscreen mode
  • --conf will allow you configure many properties of your spark session in a "key"="value" format.
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
Enter fullscreen mode Exit fullscreen mode

Function Calls

Similarly, you can configure your Spark Session directly from your Scala or Python Spark scripts:

SCALA

val conf: SparkConf = new SparkConf()
  conf.setAppName("yo")
  conf.set("spark.jars.packages", "org.projectnessie:nessie-spark-3.2-extensions:0.43.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178")
  conf.set(spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog")

Enter fullscreen mode Exit fullscreen mode

PYTHON

conf = SparkConf()

conf.set(
    "spark.jars.packages",
    "org.projectnessie:nessie-spark-3.2-extensions:0.43.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178",
)

conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
Enter fullscreen mode Exit fullscreen mode

Packages

Here is a list of the some of the packages you may want to include with your --packages flag and why.

Library Purpose
org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0 The main iceberg library, make sure to use the right one for the version of spark and iceberg you are using
org.projectnessie:nessie-spark-extensions-3.3_2.12:0.44.0 Library for working with Nessie based catalogs like Dremio Arctic
software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178 Libraries for working with AWS S3

Configurations

Configuration Purpose Possible Values
spark.sql.extensions Configure any extensions of SQL support in Spark. Each extension can be seperated with a comma "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions"
spark.sql.catalog.CatalogName This configures a new catalog under whatever name you want (just change CatalogName to desired name) to a particular implementation of SparkCatalog org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.CatalogName.io-impl This changes IO implementations of the target catalog, mainly changed for writing to Object Storage org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.CatalogName.warehouse Determines where new tables will be written to for that particular catalog A file or S3 path
spark.sql.catalog.CatalogName.catalog-impl The particular type of catalog your using Nessie - org.apache.iceberg.nessie.NessieCatalog, Hive - NA, JDBC - org.apache.iceberg.jdbc.JdbcCatalog, AWS Glue - org.apache.iceberg.aws.glue.GlueCatalog

Amazon S3 Credentials

--conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
Enter fullscreen mode Exit fullscreen mode

Environmental Variables

In a lot of the upcoming snippets I'll be using different environmental variables. If you want to use the snippets as is, then define the follow environmental variables:

export TOKEN=XXXXXX
export WAREHOUSE=s3a://.../
export AWS_ACCESS_KEY_ID=XXXXXX
export AWS_SECRET_ACCESS_KEY=XXXXXX
export URI=https://...
export AWS_REGION=us-east-1
export DB_USERNAME=xxxxxx
export DB_PASSWORD=xxxxx
Enter fullscreen mode Exit fullscreen mode
  • TOKEN: auth token for Nessie catalogs
  • WAREHOUSE: Where tables will be written
  • AWS_XXXX= AWS Credentials
  • URI= The Hive, Nessie or JDBC URI
  • DB_USERNAME and DB_PASSWORD are for using a JDBC database like mySQL or Postgres

Sample Configurations

Below you'll find sample configurations for many catalog types for reference.

If you want to try some of these in a locally running Docker container with Spark, use this command to spin one up.

docker run -it -p 8080:8080 --name spark-playground alexmerced/spark33playground
Enter fullscreen mode Exit fullscreen mode

NOTE: port 8080 is exposed in case you want run a Jupyter Notebook server in the container

HIVE

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.hive=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive.type=hadoop \
    --conf spark.sql.catalog.hive.uri=$URI \
    --conf spark.sql.catalog.hive.warehouse=$WAREHOUSE
Enter fullscreen mode Exit fullscreen mode

AWS GLUE

spark-sql --packages "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178" \
    --conf spark.sql.catalog.glue=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.glue.warehouse=$WAREHOUSE \
    --conf spark.sql.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \
    --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
Enter fullscreen mode Exit fullscreen mode

NESSIE/DREMIO ARCTIC

Nessie catalogs provide you git-like semantics with your Iceberg catalogs (Merging, Branching, Rollbacks) and if you want to get hands on to see how this works try this Dremio Arctic tutorial.

spark-sql --packages "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0,org.projectnessie:nessie-spark-extensions-3.3_2.12:0.44.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178" \
--conf spark.sql.extensions="org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions" \
--conf spark.sql.catalog.nessie.uri=$ARCTIC_URI \
--conf spark.sql.catalog.nessie.ref=main \
--conf spark.sql.catalog.nessie.authentication.type=BEARER \
--conf spark.sql.catalog.nessie.authentication.token=$TOKEN \
--conf spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog \
--conf spark.sql.catalog.nessie.warehouse=$WAREHOUSE \
--conf spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.nessie.io-impl=org.apache.iceberg.aws.s3.S3FileIO
Enter fullscreen mode Exit fullscreen mode

JDBC

This allows you to use Postgres, MySQL or any JDBC compatible database as a catalog.

Make sure to pass the right username and password for your database.

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.0.0 \
    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.jdbc.warehouse=$WAREHOUSE \
    --conf spark.sql.catalog.jdbc.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \
    --conf spark.sql.catalog.jdbc.uri=$URI \
    --conf spark.sql.catalog.jdbc.jdbc.verifyServerCertificate=true \
    --conf spark.sql.catalog.jdbc.jdbc.useSSL=true \
    --conf spark.sql.catalog.jdbc.jdbc.user=$DB_USERNAME \
    --conf spark.sql.catalog.jdbc.jdbc.password=$DB_PASSWORD
Enter fullscreen mode Exit fullscreen mode

Conclusion

Once you've configured your Spark session keep in mind you must use the configured namespace for all queries. For example if I configured the catalog with:

    --conf spark.sql.catalog.mycoolcatalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \
Enter fullscreen mode Exit fullscreen mode

Then creating a table would look like:

CREATE TABLE mycoolcatalog.table1 (name STRING) USING iceberg;
Enter fullscreen mode Exit fullscreen mode

Aside from that consult the Iceberg Documentation or this Apache Iceberg 101 article for more information.


Get Started with Dremio For Free Today

Top comments (0)