What is Apache Zeppelin?
First of all, it is worth asking: what is a notebook interface? A notebook is an interface for interactively running code; it lets you explore and visualize data. You can mix narrative, rich media, and data in a unique space.
Now we can proceed with the definition of Apache Zeppelin. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at regular intervals.
It's easier to mix languages in the same notebook. You can write some code and then use markdown to document it all together. You can also easily convert your notebook into a presentation style - perhaps for presenting to management or using dashboards.
What does Serverless Means?
The idea behind serverless is that you as a developer shouldn't need to care about the server infrastructure. You pay to run the code without concerns about what type of physical infrastructure is running below.
There are quite a few advantages to serverless. Scalability essentially comes for free. Because you're just paying to run logic, the cloud provider can easily dedicate more hardware to run your code. Also, you pay by code execution rather than having a fixed rate. Even more, the cloud provider manages the server software and hardware. You shouldn't need to worry about that. Finally, serverless frees up developers to focus on what they're good at - coding.
Solution Requirements
Build a serverless infrastructure to run Apache Zeppelin and persist notebook files. The solution must be publicly available and provide login and logout capability. Also, the compute platform must automatically shut down after 30 minutes of inactivity.
High-level Architecture
The diagram below shows the high-level architecture. As you can see, it is a serverless infrastructure, and you can operate Apache Zeppelin using a public endpoint while Elastic File System stores the notebook files. Amazon CloudWatch custom metric counts the lines of logs and shuts down the Amazon Fargate container after 30 minutes of inactivity.
The only missing feature in this architecture is the login and logout capability. In this case, Apache Zeppelin provides Shiro for notebook authentication. Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. Here, you can find a step-by-step guide about how Shiro works. This example uses the default configuration.
Infrastructure as Code Description
The solution uses AWS SAM with the global configuration for Lambda functions and the public API you can use to access Apache Zeppelin. The stack deployment provides the URL as an output value.
Amazon API Gateway
Amazon API Gateway is used as the front door to interact with the application; it exposes the URL the user can use to trigger operations and use Serverless Apache Zeppelin.
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
Type: AWS::Serverless::Api
Properties:
StageName: !Ref ServiceName
Outputs:
ZeppelinApi:
Description: "API Gateway endpoint URL for Prod stage for Hello World function"
Value: !Sub "https://${ZeppelinApi}.execute-api.${AWS::Region}.amazonaws.com/${ServiceName}/"
Elastic File System
When provisioned, each Amazon ECS task hosted on AWS Fargate receives ephemeral storage for bind mounts; everything on the disk is lost after container termination. To persist notebook files, the solution uses Amazon Elastic File System; all notebooks on EFS are preserved after container termination. The Access Point configuration allows Apache Zeppelin to have write permissions on Amazon Elastic File System.
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
[...]
AccessPoint:
Type: 'AWS::EFS::AccessPoint'
Properties:
FileSystemId: !Ref FileSystem
PosixUser:
Uid: "500"
Gid: "500"
SecondaryGids:
- "2000"
RootDirectory:
CreationInfo:
OwnerGid: "500"
OwnerUid: "500"
Permissions: "0777"
Path: !Sub "/${ServiceName}"
FileSystem:
Type: AWS::EFS::FileSystem
Properties:
PerformanceMode: generalPurpose
FileSystemTags:
- Key: ServiceName
Value: !Ref ServiceName
MountTarget1:
[Availability Zone A Configuration]
MountTarget2:
[Availability Zone B Configuration]
MountTarget3:
[Availability Zone C Configuration]
Amazon Cloud Watch Custom Metric
To provide an auto-shutdown feature, the Apache Serverless solution uses a custom metric. AWS Fargate saves logs into an Amazon CloudWatch Log Group, and the Amazon CloudWatch Custom Metric Filter counts the log lines. If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service to terminate the Task.
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
[...]
AccessPoint:
[...]
FileSystem:
[...]
ShutdownSnsTopic:
[description later in this post]
ZeppelinLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/ecs/fargate-${ServiceName}"
RetentionInDays: 1
ActivityMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
LogGroupName: !Ref ZeppelinLogGroup
FilterPattern: "INFO"
MetricTransformations:
-
MetricValue: "1"
MetricNamespace: !Sub "${ServiceName}/Actions"
MetricName: "ActionsCount"
ZeppelinActionsCountAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ZeppelinActionsCountAlarm
MetricName: ActionsCount
Namespace: !Sub "${ServiceName}/Actions"
Statistic: SampleCount
Period: '300'
EvaluationPeriods: '6'
TreatMissingData: breaching
Threshold: '1'
ComparisonOperator: LessThanOrEqualToThreshold
AlarmActions:
- !Ref ShutdownSnsTopic
AWS Fargate
Here is the AWS Fargate Cluster and Task Definition. The Apache Serverless solution uses Shiro to enable login and logout capability. As stated here, you can create a shiro.ini file by executing the cp command. You can find it in the EntryPoint property of the container definition.
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
[...]
AccessPoint:
[...]
FileSystem:
[...]
ZeppelinLogGroup:
Cluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Join ['', [!Ref ServiceName, Cluster]]
ZeppelinTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
RequiresCompatibilities:
- "FARGATE"
Cpu: !Ref ContainerCPU
Memory: !Ref MemoryHardLimit
NetworkMode: "awsvpc"
TaskRoleArn: !GetAtt ZeppelinTaskRole.Arn
ExecutionRoleArn: !GetAtt ZeppelinTaskRole.Arn
ContainerDefinitions:
- Name: !Ref ServiceName
Image: "apache/zeppelin:0.10.0"
EntryPoint:
- /bin/bash
- -c
- |
cp conf/shiro.ini.template conf/shiro.ini
/usr/bin/tini -- bin/zeppelin.sh
Command: ["echo", "done!"]
MemoryReservation: !Ref MemorySoftLimit
Memory: !Ref MemoryHardLimit
PortMappings:
- ContainerPort: !Ref ContainerPort
Protocol: tcp
- ContainerPort: 4040
Protocol: tcp
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref ZeppelinLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Sub 'ecs-${ServiceName}-awsvpc'
MountPoints:
- ContainerPath: !Ref ZeppelinPersistNotebookPath
SourceVolume: !Sub "${ServiceName}"
ReadOnly: false
Volumes:
- Name: !Sub "${ServiceName}"
EFSVolumeConfiguration:
AuthorizationConfig:
IAM: ENABLED
AccessPointId: !Ref AccessPoint
FilesystemId: !Ref FileSystem
TransitEncryption: ENABLED
AWS Lambda | Workflow
Below is the high-level workflow about how the implementation works, how the task is created, and shut down.
Start Apache Serverless
In the beginning, it checks if the Apache Zeppelin Container is running.
In case of a yes, AWS Lambda returns 302 to the Apache Zeppelin public IP. In case of a no, AWS Lambda executes the next step. Then, it checks if the Apache Zeppelin Container exists.
In case of a yes, AWS Lambda returns static web content. It is a loading page with an auto-refresh every 20 seconds. In case of a no, AWS Lambda starts a new Apache Zeppelin container and returns the loading page. Every 20 seconds, the client checks Apache Zeppelin provisioning and gets the notebook interface if the container is running; otherwise, it gets the loading page. When you have the notebook interface, to use Apache Zeppelin, you must provide your user credentials.
Shutdown Apache Serverless
If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service, and an AWS Lambda Function terminates the cluster. The Amazon Simple Notification Service is the AWS Lambda Function trigger.
Usage Suggestions & Improvements
Apache Zeppelin supports Amazon S3 for persisting notebook files. As stated here, you can use ZEPPELIN_NOTEBOOK_STORAGE, ZEPPELIN_NOTEBOOK_S3_BUCKET, and ZEPPELIN_NOTEBOOK_S3_USER as environment variables.
On the other hand, Amazon Elastic File System offers a very generic solution that can be used for various purposes; the only limit is your imagination. Since Amazon EFS is a file system, you don't have to deal with Amazon S3 Object Storage. In this case, you can simply upload your application to a Docker container and run it on AWS Fargate, just by replacing Apache Zeppelin.
For example, you can run Serverless Visual Studio Code; check the container here.
Another improvement related to Serverless Apache Zeppelin on AWS is configuring Amazon DynamoDB as an external database for Shiro users.
What will be your next application to deploy as Serverless?
Top comments (0)