Cover image for On Server Administration In Data Engineering

On Server Administration In Data Engineering

nathanepstein profile image Nathan Epstein ・8 min read


Cloud computing is almost always a good idea, serverless computing is sometimes a good idea, and you probably shouldn't be managing your own machines on premises.

Intro Notes

It should come as no surprise that data analysis pipelines require compute resources for the various steps they include. Downloading data requires computation, as does reading and transforming data, as does building models for prediction. All of this is to say that we, as the engineers responsible for building such pipelines, need to make informed decisions about the infrastructure we use to execute the various computations associated with the deployment of predictive models. Towards this objective, we have a wide range of options. These include - but are certainly not limited to - running executables on local machines, running individual cloud servers, managing clusters of cloud machines, and delegating computation to anonymous cloud machines. It is possible to identify contexts in which any of these approaches are an appropriate choice and a valuable exercise to examine their associated tradeoffs. Through this examination, we can build deeper understanding of how to evaluate infrastructure choices in our own data systems.

The Base Case: Local Computing

The first and simplest option is compute on a local machine. The strengths and weaknesses here are reasonably clear. A single local machine is easy to administer but is likely to run into limitations quickly. In particular, almost any production use case will lead to bottlenecks which require more complex server options. Running compute on your local machine is certainly the fastest and easiest way to get started. The environment can be heavily customized and processes can be run on demand without the overhead of SSH or other remote communication methods. But the advantages mostly end there. Local compute comes with operational fragility and is inherently unscalable.

Cloud Computing

The next option is to run compute on a single cloud machine. This has many of the same advantages as a single local machine. It is similarly straight forward to administer and allows for simple centralization of process and resouce management. On top of this, managed cloud computing services afford additional benefits which are essential for many production use cases.

The foremost of these concerns is resource availability. Use of a third party cloud provider allows for delegation of responsibility for ensuring that compute resources are provided without disruption. In the case of a self-administered local machine, we are responsible for resolving any issues (software failures, hardware failures, power outages, etc.) which might cause our infrastructure to become unavailable. This is undesirable in that it requires us to devote attention to concerns outside our core competency and objectives - the construction of data pipelines. With a managed cloud, we side step this issue. If a machine goes down, a new one is provided. Our infrastructure concerns are limited to the setup of the relevant software environment.

Another related concern is disaster recovery. On a local machine, we painstakingly construct our software environment to match our computing needs. The various packages, programming languages, and libraries are installed. Versions are selected in order to be internally compatible with each other and with our application needs. Application code is written and arranged according to a deliberate file structure. This machine setup is a meaningful amount of work which, without appropriate tooling, can be quite painful to replicate. So if our locally administered machine is made permanently unavailable - either through a software failure, physical damage to the machine, or via physical depreciation over time - recovery can be an expensive affair. Can we ameliorate this issue with appropriate tooling? Of course. But there isn't really a compelling reason to do so. If we're making use of a managed cloud provider, then any machine replacement will be abstracted away. Physical resources will be replaced by the cloud provider without requiring any attention or thought on our end.

Additionally, third party cloud providers will typically have telemetry offerings which are quite useful from an operational perspective. This can include monitoring of network IO, CPU usage, and status checks. Being able to monitor these things is valuable for identifying patterns of resource usage and, in turn, determining the necessary machine resources for compute tasks. It's certainly possible to implement this telemetry ourselves - either through custom implementations or the use of open source software - but this is, again, disadvantageous. To the extent that we can delegate responsibility for concerns which are not related to the core objective of building data pipelines, we are generally well served by doing so.

A common use of this telemetry is resource scaling. We may view our metrics and determine that the compute resources we have are not well matched to the needs of the application. We may have a larger machine than is required and would be just as happy with a less expensive resource. Or perhaps we have identified resource bottlenecks and need to scale up. Making these adjustments is a non-trivial undertaking when managing servers ourselves. Either we need to purchase a new machine or make physical alterations. Both of these require technical expertise which is far removed from the central problem of constructing data analysis pipelines. But with a cloud provider, the transition is as simple as selecting the preferred resource. The physical migration which occurs is abstracted from us.

Managed cloud providers also offer resource standardization. This means if we do decide to make a scale adjustment, which entails an alteration of the underlying physical infrastructure (either in form of a modification or new machine), we don't have to worry about our software functioning differently. Virtualization is handled by the cloud provider which affords us the capacity to move our application across different machines without worrying about our environment. Of course, we can use virtualization on a local machine and impose a shared environment on future machines but this is additional responsibility we'd prefer to delegate.

Horizontal Scaling

As our compute needs increase, we will likely need to scale horizontally rather than vertically. That is, we may need additional servers rather than larger ones. This is intuitive both because there are limits to the size of a single machine and because costs tend to scale in a super-linear fashion. Each incremental increase in machine size comes with an increasingly higher cost. This leads to the result that it is more cost effective to distribute compute across many small machines than a few large ones.

This capacity to scale comes with a complexity cost. Distributed computation requires coordination of resources across the various machines. The form that this communication takes will be a function of the compute being done. There are many tools for managing machine groups which warrant their own detailed treatment. Applications involving the composition of several jobs distributed over a cluster may call for orchestration tools like Kubernetes. Distributing analysis of large data sets across many machines can be done with libraries such as Hadoop and Spark. In many cases, coordination of machines can be handled manually via API calls or other forms of inter-process communication. Whatever the tooling used to facilitate managing the complexity of distributed compute, its advantage over single-machine computing is the capacity for arbitrary horizontal scaling.

Of course, we have the option of whether to achieve horizontal scale via local or cloud machines. In the case of local machines, this means procuring the necessary quantity of servers, physically maintaining them, and configuring the appropriate software to coordinate computing among them. The tradeoffs associated with this approach roughly mirror those of running compute on a single local machine. There are potential benefits in the way of customizability, information security, and cost. Conversely, horizontal scaling using a managed cloud provider affords the benefits of flexibility, comparative ease of management, reliability, and pre-built tooling.

Using managed cloud resources also leads to an important orgnizational benefit. Because these offerings have a broad user base, there is a comparatively large potential labor supply. That is, there are more hirable individuals with the expertise to manage common cloud infrastructure than there are with the expertise to manage niche deployments.

As data pipelines become more complex and resource intensive, the need for horizontal scaling typically follows. Certain organizations, particularly very large ones, may have specific needs which warrant the maintenance of physical computing infrastructure. However, many organizations find that the use of a virtual private cloud is the appropriate means of achieving the horizontal scale required by their pipelines.

Serverless Computing

Another computation framework which has emerged more recently is serverless computing. Of course, there are actually servers which handle compute but their administration is abstracted from the end user. In the serverless compute model, application code is executed by a cloud service provider using physical machine resources that they provision and administer. The client of the serverless compute is only responsible for specifying the executable and associated meta-data (i.e. timing, function inputs, etc.).

As a comparatively nascent space, the the options within serverless computing are evolving rapidly. In addition to serverless compute, commercial offerings exist for serverless databases in which the scaling and management of the database is abstracted from the user by the cloud provider. It seems reasonable to expect that both the variety and quality of such offerings will continue to grow quickly.

The primary advantage of the serverless framework is the ease of administration. Because this work is abstracted from the client, the need for both effort and expertise on this front is removed. This allows users to focus on the particulars of their application logic and not need to think about the infrastructure which is responsible for the execution.

An additional advantage is cost. Depending on the usage pattern, serverless compute is often cheaper than having dedicated machines. For systems in which compute is intermittent and there are long periods of machine resource underutilization, serverless compute is likely to be a cost effective solution. Existing serverless compute offerings charge for the compute time used so if dedicated machines sit idle, they will have a high cost relative to their on-demand counterparts.

Another, related, benefit of serverless compute is the elasticity of resources. Machines are requisitioned by the cloud provider to accommodate the application at runtime so effectively arbitrary changes in scale are possible. If the system has no work to complete, then no physical resources are claimed or paid for. As work is demanded by the system, the appropriate amount of compute resources are acquired for the duration of the tasks.

There are important tradeoffs to consider when transitioning to a serverless architecture. While the benefits of serverless are significant, it is not the correct choice for all computing contexts.

First, there are systems for which serverless computing would be meaningfully more expensive. We highlighted that alternation between bursts of compute and periods of idleness is a usage pattern which is handled in a cost effective manner by serverless compute. The inverse is also true. If resource usage is consistently high, then a dedicated machine is likely a cheaper option; perhaps significantly so.

There are also performance costs to serverless compute. Serverless computing is an on demand model which means that utilized resources need to be acquired at runtime. This also applies to the loading of dependencies. Rather than being a one time process on a dedicated machine, this will be a recurring process for each run of the application code. This spin up process comes with a latency cost.

Another drawback to serverless is the comparative inability to customize the machine on which application code is run. Managed compute services generally provide a particular environment in which your dependencies must be built. While this may not be a major concern for many applications, it may complicate the deployment of applications which have intricate and particular dependencies. The serverless deployment of Docker images, which would serve to ameliorate this issue, can involve additional complexity and is not universally supported by major cloud providers. The prevalence of templated runtimes over fully customizable alternatives presents an additional roadblock for the deployment of applications using less common programming languages.

An additional concern is telemetry. A primary feature of serverless computing is that the user experience of server administration is hands off. While this is typically a benefit, there are circumstances in which detailed monitoring of the executing machine - beyond just process logs - is desirable but not available.

The last major concern is vendor lock. Serverless computing is provided by a managed cloud provider according to vendor specific interfaces. This means that building systems around a serverless architecture entails committing to a particular vendor and accepting that there will be costs associated with changing providers.

Concluding Notes

Management of compute resources is an essential component of building data pipelines. While there are no universal rules of server administration, it is still important to understand the essential tradeoffs in order to make informed infrastructure decisions. Hopefully, the above is a useful starting point in highlighting the competing concerns at play within your own data pipelines.


Editor guide