A batch event driven pipeline could contains the following steps:
- A file containing new records would be dropped to a S3 landing bucket directory by a source system.
- A lambda function using the S3 bucket directory landing directory as trigger to either initialize a new transient EMR cluster with the new file as a Spark Step or add the new file in an existing EMR. This check could be performed against the EMR name.
- The EMR Spark process would perform the data transformations & enrichments using third party data contained in an RDS. The produced file would be exported to an output bucket directory.
- An Athena table would point on the s3 output bucket directory for users perform analytics on the produced data.
EMR Configurations
Step Concurrency - should be more than one if the data pipeline can handle parallel step executions.
Auto-Termination - If true cluster is transient, it is suggested for unpredicted and unscheduled loads as having an EMR cluster running indefinitely is not cost-effective. The EMR while bootstraping occurs no costs.
Top comments (0)