How do you merge millions of small files in a S3 bucket to larger single files to a separate bucket daily?

#aws #database #devops #datascience

We have a situation where we're currently uploading events to s3 in real-time. The result is roughly 30 million tiny json files (<1kb) per day. These files sit in a raw layer bucket with the following folder format "#{folder_name}/#{year}/#{month}/#{day}/#{hour}/#{minute}/#{System.os_time()}-#{file_name}.#{file_ext}". We want to send this to a data warehouse for analytics but need the files to be much larger (150-200mb). What solutions are there for merging json files from a s3 bucket back to a separate s3 bucket. I have tried developing a lambda to tackle this problem but it was not enough since all the files must be downloaded in /tmp and lambda ran out of memory. Please help :)

Latest comments (5)

Kishan B • Mar 17 '22

whitfin.io/quickly-concatenating-f...

This tool does the job of concatenating within S3 without requiring a download.. It uses the multipart api of S3 to achieve the same.

peterb • Sep 17 '21 • Edited

Redshift Spectrum does an excellent job of this, you can read from S3 and write back to S3 (parquet etc) in one command as a stream

e.g. take lots of jsonl event files and make some 1 GB parquet files
First
create external table mytable (....)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://bucket/folderforjson/path/yesr/month/day ...'

Then
upload ('select columns from mytable where ...')
to 's3://bucket/folderforparquet/year/month/day...'
iam_role 'arn:aws:iam::123456789:role/prod....-role'
format parquet
partition by (year, month, day)
include
cleanpath

You can buy Redshift by the hour, and Redshift Spectrum is $5 per TB

Tal Rofe • Aug 27 '21

Why do you need the entire data in your lambda? You should use streaming & multi-part uploading to S3

just a martian • Aug 28 '21

1 min of files exceeds 500mb breaking tmp

Thomas Werner • Aug 27 '21

I think it mostly depends what the expected target output format is. If you simply need to concatenate all events as a list (JSON array), then it could be probably done by opening an output stream for a file in the target S3 bucket, and writing each JSON file to it one after the other. Memory consumption should be constant, given that all input JSON files are the same size. You only need to make sure that the list of event file paths / handles is not loaded into a collection all at once, so you don't run out of memory.

But it sounds like you need to apply more complicated merge logic? What's an example for an event file and what's the expected result format?

DEV Community

How do you merge millions of small files in a S3 bucket to larger single files to a separate bucket daily?

Latest comments (5)

Read next

A Closer Look at AUROC and AUPRC under Class Imbalance

Twenty Constructionist Things to Do with Artificial Intelligence and Machine Learning

Kubernetes: Monolithic vs Microservice

Collection of Hands-on Exercises to Get Started with Apache Iceberg