Let’s look into the infrastructure that we need to run our Spark Applications. Spark should work with all cluster managers that are supported. These include:
- Standalone mode.
- Hadoop YARN.
- Apache Mesos.
These cluster managers maintain sets of machine where we can deploy Spark Applications. These cluster manages have a different views towards managing machines, so we need to keep this in mind.
Deploying our Cluster
We have two options for where we should deploy our clusters. We can either do this on-premise or in the public cloud.
Organizations that have their own datacentres may find that deploying Spark clusters on-premise a viable option. This will five you full control over the hardware that we use for our clusters, but clusters will be fixed in size.
If our data analytics demands are elastic, it will be hard to perform our analytics workloads. If our clusters are too small, they won’t be able to perform computationally large jobs. If they are too large, they’ll just sit there and do nothing. You’ll also have to select and manage your own storage system. This will include implementing georeplication, disaster recovery etc.
If you do opt for the on-premise solution, you’ll need to use a cluster manager that allows you to run many Spark applications and then dynamically reassigns resources between them. YARN and Mesos have better support for this and can also support non-Spark workloads.
Public Cloud solution
Deploying Spark in the cloud is increasingly becoming popular and a more viable option that deploying it on-premise.
We have far more flexibility when deploying big data workloads. We can launch and shut down clusters as we need to, so if we have a massive job that requires hundreds of machines, we only pay for them when we use them. We can even control the size of those machine to further optimize costs.
The cloud also allows us to decouple storage systems from our clusters. For example, we can deploy Azure Blob Storage and just spin up clusters as and when we need to for our Spark workloads. We can run Spark natively against cloud storage systems without having to manage everything on-premise.
Using Cluster Managers
Let’s go over the three cluster managers supported by Spark.
This lightweight cluster manager was built specifically for Spark. We can run multiple Spark workloads on the same cluster and it can scale. However, standalone mode clusters can only run Spark workloads.
If you’re just starting out with Spark and have no prior experience using Mesos or YARN, this is the best place to start.
YARN is a framework for cluster resource management and job scheduling. Spark has native support for the Hadoop YARN cluster manager. We can run jobs on YARN by specifying the master as YARN when we submit spark jobs using the command-line.
We need to bare in mind a couple of configurations for YARN and how that affects our Spark Applications. Instead of using the master node IP when defining the — master parameter, we will use yarn. Spark will find YARN config files using the HADOOP_CONF_DIR variable.
If you’re using HDFS (Hadoop Distributed File System) as your data store, you’ll need to include 2 config files on Spark’s classpath:
- _hdfs-site.xml — _Provides default behaviors for the HDFS client
- _core-site.xml — _Sets the default file system name
Mesos was actually started by many of the original Spark team. Here’s a quote by one of them describing Mesos:
“Apache Mesos abstracts CPU, memory, storage and other compute resources away from machines, enabling fault-tolerant and elastic distributed systems to easily be built and run effectively”
Essentially, Mesos is a datacentre scale cluster manager that manages both short and long term applications. You would chose this cluster manager only if your business already has Mesos.
Submitting Spark applications to Apache Mesos is similar to other cluster managers. When using Mesos, you should opt for cluster mode over client mode, as client mode needs more configuration especially when it comes to distributing resources around the cluster.
There are low-level capabilities that Spark provides us to run our Spark Applications in a more securely. This will happen outside Spark and are network-based. This includes authentication, network encryption and setting TLS and SSL configurations.
We can also tune some configurations on the network. This is great for performing custom deployment configurations for our Spark clusters, particularly when we need to use proxies between certain nodes.
Scheduling our Spark Applications
Spark applications run an independent set of executor processes and Cluster Managers provide the facilities for scheduling across our Spark applications. Within each Spark application, multiple jobs run concurrently if they were submitted by different threads. Spark has a scheduler to schedule resources within each application.
If there are multiple users that need to share the cluster and run different Spark applications, there are different ways to manage resource allocation, depending on our cluster manager.
The easiest method it to perform static partitioning of resources. Our application is given the maximum amount of resources that it can use and holds onto them for the entire duration of the application. Dynamic allocation can be used to scale applications up or down based on the number of pending tasks. If you want users to share memory and executor resources, we can launch a single Spark application and use thread scheduling within it to server multiple requests in parallel.
If we want to run multiple applications on the same cluster, Spark has the ability to dynamically adjust resources our application uses based on its workload.
This is disabled by default and in order enable it, you need to set spark.dynamicAllocation.enabled to true and you need to set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true.
The external shuffle service allows executors to be removed without deleting the shuffle files written by them.
Other things to consider
There are other things you need to think about when selecting a cluster manager and how you set it up.
For example, YARN is great when your developing applications that use HDFS, but not much else. Compute and Storage are coupled together, so you need to scale both instead of just one.
Mesos helps this a little bit, but still requires a significant investment in Mesos since they can run more than Spark applications. Standalone mode is the lightest-weight cluster manager, but you’re going to have to build the infrastructure that you could get from YARN or Mesos.
Another consideration that you have to take into account is managing different Spark versions. You’ll need to spend time managing different setups if you are running various Spark workloads with different Spark versions.
Configuring a basic monitoring solution is also critical to help users debug Spark jobs that run on our clusters. This will the focus of my next blog post.
In this post, I discussed various options for deploying our Spark applications along with their merits and faults. This knowledge may be treated as a reference point for you rather than being something that you should definitely know. Spark configuration can be a pretty deep subject, so if you’re keen to find out more, I’d recommend referring to the Spark documentation.
That’s all from me for now. In my next post, I’ll discuss methods that we can use to debug and monitor our Spark applications.
As always, if you have questions let me know in the comments and I’ll do my best to answer.
See you later!