DEV Community

Danny Chan
Danny Chan

Posted on

AWS Hong Kong EMR Workshop

EMR deployment
EC2, EKS, outpost, serverless

open sources framework
spark, hive

cost optimization

  • transient cluster, reserved instance, spot instance, fleets (az, high available)
  • aws graviton 2 instance, cost, performance, lower 30%

scaling feature
scaling up, down based on demand.

ec2 enhance
reduce start-up time, task nodes start times, lower cost with spark shuffle, reduce cst, performance with ebs gp3 columns

emr on eks
job template for data engineer simply job by common params, spark sql rsrunner - script directly with api, dynamodb connector
emr 6.9.0 20% opt time of oss spark 3.3.0

zero rename

  • ss3 copy file is copy and replace, low performance
  • transactional data lakes, record level
  • atomic change, read write isolation, high throughput ingestion, small file compactions, row level upset and deletes

transaction data lakes

  • acid, record level , sql, spark, flink support
  • query: prestodb, trino, flink, hive support
  • query cross partition, files
  • hudi, iceberg, disaster recovery, concurrency with Glue, merge on read (mor) support, time travel support spark sql, trino sql
  • delta lake

EMR serverless
Apache airflow

Security

  • isolation, private subnet
  • authentication, ldap
  • encryption,
  • audit: using ranger, aws lake formation

Workflow
spark driver -> pending executor prds -> ca, auto scaling group have node group -> api (node provision)

Karpenter

  • replace ca, auto scaling group have node group
  • auto select correct node type for processing job, scale out faster, scale in if no more job

emr on eks

  • multi version on same cluster, multi az, start job quick no provisioning delay,
  • master, core (driver),task (executor) instance go to auto scale's one instance (spot, save cost)

auto pod tuning
auto resize existing pod small to bigger, based on tral time cpu memory utilz, avoid manual tune driver & executor resource

managed apache flink
streaming analysis

challenge
scale data to 1000 node, network config

modernize data platform on eks
infra as code, performance bench report, data workload, spark, fafka, ray

amazon eks

  • observability: prometheus, fluent bit, otel
  • delivery: argocd, flex, crossplance
  • reliability: karpener, sutoscalar, keda
  • security: ciium, Gatekeeper

data on eks adoption
cluster manage, addon manage, team manage, workload manage,

virtual cluster
handle k8s namespace (namespace per project)
job run api: branch to submit spark application,  spark jar,

app: hive

  • jobs: job within app
  • workers: drivers and executors for job

Workflow
application scheduler -> pending pods -> karpenter -> ec2
can have different size of nodes

Benefit
cost optimize, consolidation, 1 big node cheapers than 3 small nodes

Link Workshop
https://catalog.workshops.aws

Workshop

Image description
Image description
Image description

Top comments (0)