Alex Merced

Posted on Nov 22, 2022 • Edited on Jan 8

Configuring Apache Spark for Apache Iceberg

#spark #iceberg #datalake

Get Started with Dremio For Free Today

Apache Iceberg

Apache Iceberg is quickly becoming the industry standard for interfacing with data on data lakes. A lot of the time when people first try out Iceberg they do so using Apache Spark.

Often to start spark up you may run a command like this:

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1 \
    --conf spark.sql.catalog.icebergcatalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.icebergcatalog.type=hadoop \
    --conf spark.sql.catalog.icebergcatalog.warehouse=$PWD/warehouse

You can test out the command above in a docker container with this command:

docker run -it --name spark34 alexmerced/spark34

Then run the following sql statement

CREATE TABLE icebergcatalog.names (name STRING) USING iceberg;

The goal of this article is help demystify this so you can configure Spark for Iceberg in any context.

Spark Configurations

Spark configs can be defined in a few ways:

As flags when starting Spark
Using Imperitive function calls

Flags

There are are several flags you can pass to the spark-shell or spark-sql commands.

--package will identify maven packages you want installed and useable during the Spark session

--packages org.projectnessie:nessie-spark-3.2-extensions:0.43.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178

--jar will allow you to pass paths to existing

--jars /home/docker/iceberg-spark-runtime-3.2_2.12-1.0.0.jar,/home/docker/nessie-spark-extensions-3.2_2.12-0.44.0.jar

--conf will allow you configure many properties of your spark session in a "key"="value" format.

    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \

Function Calls

Similarly, you can configure your Spark Session directly from your Scala or Python Spark scripts:

SCALA

val conf: SparkConf = new SparkConf()
  conf.setAppName("yo")
  conf.set("spark.jars.packages", "org.projectnessie:nessie-spark-3.2-extensions:0.43.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178")
  conf.set(spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog")

PYTHON

conf = SparkConf()

conf.set(
    "spark.jars.packages",
    "org.projectnessie:nessie-spark-3.2-extensions:0.43.0,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178",
)

conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")

Packages

Here is a list of the some of the packages you may want to include with your --packages flag and why.

Library	Purpose
org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0	The main iceberg library, make sure to use the right one for the version of spark and iceberg you are using
org.projectnessie:nessie-spark-extensions-3.3_2.12:0.44.0	Library for working with Nessie based catalogs like Dremio Arctic
software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178	Libraries for working with AWS S3

Configurations

Configuration	Purpose	Possible Values
spark.sql.extensions	Configure any extensions of SQL support in Spark. Each extension can be seperated with a comma	"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions"
spark.sql.catalog.CatalogName	This configures a new catalog under whatever name you want (just change CatalogName to desired name) to a particular implementation of SparkCatalog	org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.CatalogName.io-impl	This changes IO implementations of the target catalog, mainly changed for writing to Object Storage	org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.CatalogName.warehouse	Determines where new tables will be written to for that particular catalog	A file or S3 path
spark.sql.catalog.CatalogName.catalog-impl	The particular type of catalog your using	Nessie - `org.apache.iceberg.nessie.NessieCatalog`, Hive - NA, JDBC - `org.apache.iceberg.jdbc.JdbcCatalog`, AWS Glue - `org.apache.iceberg.aws.glue.GlueCatalog`

Amazon S3 Credentials

--conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY

Environmental Variables

In a lot of the upcoming snippets I'll be using different environmental variables. If you want to use the snippets as is, then define the follow environmental variables:

export TOKEN=XXXXXX
export WAREHOUSE=s3a://.../
export AWS_ACCESS_KEY_ID=XXXXXX
export AWS_SECRET_ACCESS_KEY=XXXXXX
export URI=https://...
export AWS_REGION=us-east-1
export DB_USERNAME=xxxxxx
export DB_PASSWORD=xxxxx

TOKEN: auth token for Nessie catalogs
WAREHOUSE: Where tables will be written
AWS_XXXX= AWS Credentials
URI= The Hive, Nessie or JDBC URI
DB_USERNAME and DB_PASSWORD are for using a JDBC database like mySQL or Postgres

Sample Configurations

Below you'll find sample configurations for many catalog types for reference.

If you want to try some of these in a locally running Docker container with Spark, use this command to spin one up.

docker run -it -p 8080:8080 --name spark-playground alexmerced/spark33playground

NOTE: port 8080 is exposed in case you want run a Jupyter Notebook server in the container

HIVE

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.hive=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive.type=hadoop \
    --conf spark.sql.catalog.hive.uri=$URI \
    --conf spark.sql.catalog.hive.warehouse=$WAREHOUSE

AWS GLUE

spark-sql --packages "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178" \
    --conf spark.sql.catalog.glue=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.glue.warehouse=$WAREHOUSE \
    --conf spark.sql.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
    --conf spark.sql.catalog.glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
    --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \
    --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY

NESSIE/DREMIO ARCTIC

Nessie catalogs provide you git-like semantics with your Iceberg catalogs (Merging, Branching, Rollbacks) and if you want to get hands on to see how this works try this Dremio Arctic tutorial.

spark-sql --packages "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0,org.projectnessie:nessie-spark-extensions-3.3_2.12:0.44.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178" \
--conf spark.sql.extensions="org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions" \
--conf spark.sql.catalog.nessie.uri=$ARCTIC_URI \
--conf spark.sql.catalog.nessie.ref=main \
--conf spark.sql.catalog.nessie.authentication.type=BEARER \
--conf spark.sql.catalog.nessie.authentication.token=$TOKEN \
--conf spark.sql.catalog.nessie.catalog-impl=org.apache.iceberg.nessie.NessieCatalog \
--conf spark.sql.catalog.nessie.warehouse=$WAREHOUSE \
--conf spark.sql.catalog.nessie=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.nessie.io-impl=org.apache.iceberg.aws.s3.S3FileIO

JDBC

This allows you to use Postgres, MySQL or any JDBC compatible database as a catalog.

Make sure to pass the right username and password for your database.

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.0.0 \
    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.jdbc.warehouse=$WAREHOUSE \
    --conf spark.sql.catalog.jdbc.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \
    --conf spark.sql.catalog.jdbc.uri=$URI \
    --conf spark.sql.catalog.jdbc.jdbc.verifyServerCertificate=true \
    --conf spark.sql.catalog.jdbc.jdbc.useSSL=true \
    --conf spark.sql.catalog.jdbc.jdbc.user=$DB_USERNAME \
    --conf spark.sql.catalog.jdbc.jdbc.password=$DB_PASSWORD

Conclusion

Once you've configured your Spark session keep in mind you must use the configured namespace for all queries. For example if I configured the catalog with:

    --conf spark.sql.catalog.mycoolcatalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \

Then creating a table would look like:

CREATE TABLE mycoolcatalog.table1 (name STRING) USING iceberg;

Aside from that consult the Iceberg Documentation or this Apache Iceberg 101 article for more information.

DEV Community

Configuring Apache Spark for Apache Iceberg

Get Started with Dremio For Free Today

Apache Iceberg

Spark Configurations

Flags

Function Calls

Packages

Configurations

Amazon S3 Credentials

Environmental Variables

Sample Configurations

Conclusion

Get Started with Dremio For Free Today

Top comments (0)

Read next

LobeChat uses Namespace for action labels in DevTools configuration

Building a subscription tracker Desktop and iOS app with compose multiplatform — Offline data

Conda + JuiceFS: Enhancing AI Development Environment Sharing

Learning Agentic AI with UC Berkeley’s LLM Agents Course