Flattening a directory structure on AWS S3

#infrastructure #aws #s3 #bash

Originally published in Streaver's blog.

The problem

The other day I had to flatten a small tree of files on AWS S3, something in the order of hundreds of files. After some trial and error, I ended with a pretty straightforward series of piped commands that got the job done.

The problem was simple; we wanted to migrate from a structure that was something like s3:// <your-bucket>/posts/:postId/images/:imageId to a flat directory structure like s3:// <your-bucket>/posts/images/:imageId so we could have images as a global thing shared by all posts and not have them under a specific post folder.

You will need a bare shell and the AWS CLI setup with your credentials. I will assume you already have all of that.

Step 1: Checking for name collisions

We could have post1/image1.png and post2/image1.png, so we have to check if the flattening is possible. Doing that is pretty simple. You use some shell commands piped together, and you can get it in a matter of seconds.

We start by getting a list of all the files we want to migrate, for that we use:

$> aws s3 ls <your-bucket> --recursive
2021-06-07 12:24:28     390966 posts/1/images/image1.png
2021-06-08 13:09:16     120346 posts/1/images/image2.png
2021-06-07 12:23:37     144935 posts/2/images/image3.png
2021-06-07 12:23:37     144945 posts/3/images/image3.png
...

and we get a list with a timestamp, the size, and the path.

The second step is to filter out the information that we do not need:

$> aws s3 ls <your-bucket> --recursive | awk '{print $4}'
posts/1/images/image1.png
posts/1/images/image2.png
posts/2/images/image3.png
posts/3/images/image3.png

for that we use awk, which is a handy shell command (actually, it is a pattern-directed scanning and processing language) that lets you scan and detect patterns and extract information very quickly.

Since we are going to be flattening the directory structure, in order to check for duplicates, we need to check for the filenames only, we do not need the full path. To achieve that, the third step is:

$> aws s3 ls <your-bucket> --recursive | awk '{print $4}' | xargs -I {} basename {}
image1.png
image2.png
image3.png
image3.png

Here we use xargs which allows executing a command for each line that you pipe into it, and we combine it with basename to extract the filename of a given path.

The last step is to check for duplicated filenames. For that, we can use uniq -d, which prints the duplicated lines. If we put it all together and run it, we will know if we have duplicates:

aws s3 ls <your-bucket> --recursive | awk '{print $4}' | xargs -I {} basename {} | uniq -d
image3.png

We run it, if it is empty, we are good, the flattening can be done straight away, if not, we need to somehow modify the filename so that after the flattening of the files none of them collide. I will leave the renaming up to you since it will depend on how you are writing/reading files. Four our case, I renamed posts/3/images/image3.png to posts/3/images/image4.png.

Step 2: Moving the files

The next step is to move the files or duplicate them. If you need a zero-downtime migration, you can copy first and delete later. If you don’t need a zero-downtime migration, you can move the files from the old location to the new one, and there will be a short period where files can't be found.

We will use the xargs commands to move each file as we see them.

aws s3 ls <your-bucket> --recursive | awk '{print $4}' | xargs -I {} sh -c 'aws s3 mv s3://<your-bucket>/{} s3://<your-bucket>/posts/images/$(basename {})'

Let’s break it down! The first parts are the same as before. Then, the xargs starts similar, but instead of running the commands, it creates a subshell and passes the command into it before replacing the {} with the filename.

If you want to run it but are not sure if things will go as planned, you can add the --dryrun option to the aws s3 mv command, and that will simulate and print the result of the command without it happening.

Some personal notes

This is a pretty handy trick to have if you need to move files. It will work excellent with small amounts of files. It is also resistant to an internet connection failure; if your connection fails (or you kill the process, or it dies), you can rerun it, and it will pick up from where it left.

If you have larger amounts of files, you can launch a bunch of background processes running this command, and it should work (you probably want to paginate results or smartly split them). On the other hand, if your amount of files is too significant, maybe a copy-on-write or copy-on-read strategy works better, but you will need some programming for that.

I hope this has been useful! If you have questions, please let me know in the comments.