Danny Chan

Posted on May 1, 2023

AWS Hong Kong EMR Workshop

EMR deployment
EC2, EKS, outpost, serverless

open sources framework
spark, hive

cost optimization

transient cluster, reserved instance, spot instance, fleets (az, high available)
aws graviton 2 instance, cost, performance, lower 30%

scaling feature
scaling up, down based on demand.

ec2 enhance
reduce start-up time, task nodes start times, lower cost with spark shuffle, reduce cst, performance with ebs gp3 columns

emr on eks
job template for data engineer simply job by common params, spark sql rsrunner - script directly with api, dynamodb connector
emr 6.9.0 20% opt time of oss spark 3.3.0

zero rename

ss3 copy file is copy and replace, low performance
transactional data lakes, record level
atomic change, read write isolation, high throughput ingestion, small file compactions, row level upset and deletes

transaction data lakes

acid, record level , sql, spark, flink support
query: prestodb, trino, flink, hive support
query cross partition, files
hudi, iceberg, disaster recovery, concurrency with Glue, merge on read (mor) support, time travel support spark sql, trino sql
delta lake

EMR serverless
Apache airflow

Security

isolation, private subnet
authentication, ldap
encryption,
audit: using ranger, aws lake formation

Workflow
spark driver -> pending executor prds -> ca, auto scaling group have node group -> api (node provision)

Karpenter

replace ca, auto scaling group have node group
auto select correct node type for processing job, scale out faster, scale in if no more job

emr on eks

multi version on same cluster, multi az, start job quick no provisioning delay,
master, core (driver),task (executor) instance go to auto scale's one instance (spot, save cost)

auto pod tuning
auto resize existing pod small to bigger, based on tral time cpu memory utilz, avoid manual tune driver & executor resource

managed apache flink
streaming analysis

challenge
scale data to 1000 node, network config

modernize data platform on eks
infra as code, performance bench report, data workload, spark, fafka, ray

amazon eks

observability: prometheus, fluent bit, otel
delivery: argocd, flex, crossplance
reliability: karpener, sutoscalar, keda
security: ciium, Gatekeeper

data on eks adoption
cluster manage, addon manage, team manage, workload manage,

virtual cluster
handle k8s namespace (namespace per project)
job run api: branch to submit spark application, spark jar,

app: hive

jobs: job within app
workers: drivers and executors for job

Workflow
application scheduler -> pending pods -> karpenter -> ec2
can have different size of nodes

Benefit
cost optimize, consolidation, 1 big node cheapers than 3 small nodes

Link Workshop
https://catalog.workshops.aws

Workshop

DEV Community

AWS Hong Kong EMR Workshop

Top comments (0)

Read next

Reflexões sobre SOLID - A Letra "O"

the "100x Faster" Challenge: Inspired by OpenAI's Batch API and EchoHive's 1000x Dev MasterClass Course!

Marking macOS component packages available based on hardware platform type

Episode 24/18: Signals and Observables, Angular Q&A Session