DEV Community

Discussion on: How do you merge millions of small files in a S3 bucket to larger single files to a separate bucket daily?

peterb profile image
peterb • Edited on

Redshift Spectrum does an excellent job of this, you can read from S3 and write back to S3 (parquet etc) in one command as a stream

e.g. take lots of jsonl event files and make some 1 GB parquet files
create external table mytable (....)
row format serde ''
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat ''
location 's3://bucket/folderforjson/path/yesr/month/day ...'

upload ('select columns from mytable where ...')
to 's3://bucket/folderforparquet/year/month/day...'
iam_role 'arn:aws:iam::123456789:role/prod....-role'
format parquet
partition by (year, month, day)

You can buy Redshift by the hour, and Redshift Spectrum is $5 per TB